Regulate Systems, Not Models: Reaction to CA Senator Scott Weiner’s SB1047

Hot on the heels of all this writing I’ve been doing about AI safety (previously: why the current approach to AI safety is doomed to failure, what AI safety should be, a worked example)—

My friend and South Park Commons colleague Derek Slater passed on this California state senate bill, proposed by my senator Scott Weiner, California Senate bill SB1047.

Since Sen. Weiner is my state senator, I’ve already got a message out to his office to discuss the bill, but having now read the full text in some detail, I wanted to set down some thoughts here for other people as well.

The bill has the mouthful of a title, the “Safe and Secure Innovation for Frontier Artificial Intelligence Systems Act,” and, as I read it, I think it breaks down fairly cleanly into two big parts:

  • Create a Frontier Model Division within the California state Department of Technology, and require developers who are or intend to train “covered models” to make various certifications about its safety and capability to the Frontier Model Division, with various penalties if they don’t, the certifications are later found to be false, or an incident occurs.
    • A covered model is defined as a model trained using a quantity of computing power greater than 10^26 integer or floating-point operations in 2024, or a model that could reasonably be expected to have similar performance or “general capability” to such a model).
    • The certifications are particularly concerned with some abstract “hazardous capabilities” which models might possess, which might make substantially easier than otherwise the specific harms of:
      • “The creation or use of a chemical, biological, radiological, or nuclear weapon in a manner that results in mass casualties.”
      • “At least five hundred million dollars ($500,000,000) of damage through cyberattacks on critical infrastructure via a single incident or multiple related incidents.”
      • “At least five hundred million dollars ($500,000,000) of damage by an artificial intelligence model that autonomously engages in conduct that would violate the Penal Code if undertaken by a human.”
  • Instruct the Department of Technology to commission consultants to create and operate a public cloud platform, “to be known as CalCompute, with the primary focus of conducting research into the safe and secure deployment of large-scale artificial intelligence models and fostering equitable innovation.”

I’m going to leave aside the CalCompute part of the bill for the moment, because despite the ostensible focus on safe AI development, my read is that it’s just much more an implementation detail, and one that honestly feels kind of tacked-on to me.  Let’s focus here primarily on the Frontier Model Division and the requirements around it.

First of all, I like that we’re at least defining our unacceptable losses here.  I think much about how some fact about the model eventually results in one of these losses is unclear to the point of fuzzy-headedness and too much science fiction, but at least we’re putting some kind of stake in the ground about what our cares and concerns are.  As I’ve written before, this is necessary for us to actually reason about the safety of systems making use of AI models.

A system diagram. A befuddled system designer looks at a box labeled "AI Model" with unconnected arrows coming into and going out of it marked with question marks. A group of users has another unconnected arrow going out of them, marked with a question mark. Losses labeled "WMDs", "$500M critical infra cyberattack", and "$500M crime" all have unconnected arrows going into them, also labeled with question marks.
A confused system designer surveys the problem space.

The issue here, as I’ve now written about at some length, is the same as the issue with AI safety writ large.

A screenshot of HuggingFace showing the RunwayML distribution of Stable Diffusion v1.5 and some of the files in it.
Some of the files comprising the RunwayML distribution of Stable Diffusion v1.5.

AI models don’t have capabilities. An AI model is just a file or some files on disk.  Systems, whether incorporating AI models or not, have capabilities.

An AI model can’t do anything on its own—an AI model can only do things when loaded into a computer’s memory, in the context of a computer program or a system of computer programs which connect it to what we would in a control systems context call sensors and actuators, which allow it to do things.  (Receive text or images from a user, display text or images to a user, make API calls, etc.)

Those actuators may be connected to other systems (or people), which may be connected to still other systems (or people), which may eventually allow the AI model to be reductively said to have done a thing, but those sensors and actuators are essential.  Without them the model sits inert.

A diagram. Two boxes labeled "LLM" and "User". Arrows point from User to LLM and from LLM to User, both labeled with "Arbitrary text". A confused system designer stands in the corner.
That oversimplified system diagram of a prompted text-generation LLM chatbot, once again.

While it’s true that AI developers will talk about their models as having certain kinds of capabilities (e.g. prompted text generation, prompted image generation, image-to-image generation) and as being better or worse at those capabilities on some benchmark than other models, these are only capabilities within the assumed context of some system, e.g. a fairly simple system which feeds them text prompts or image prompts and evaluates their output.

I forgot to work my joke about paperclip-maximizer AIs into the first diagram above, and am not attached enough to it to go back and include it, but this gets back to the idea I’ve mentioned here before that if we don’t want an AI model to turn us all into paperclips, one really easy step to take is to not put it in charge of a paperclip factory.  (And of course when I go look it up I find that the LessWrong people have decided to rename the concept with a more abstract and less emotionally salient name, haven’t they.)

This may all seem a bit pedantic or legalistic, but we are talking about this in the context of a proposed law.  Anyway my proposed change to the bill is very simple: We should reword it to talk about constraints on and the compliance of systems incorporating frontier AI models, rather than the models themselves.

Now hold on a second, you might say.  Doesn’t this bring under the jurisdiction of the Frontier Model Division many, many more companies than just frontier model providers like OpenAI and Anthropic? Potentially including anyone deploying their own ChatGPT-powered chatbot? And yes, it does.

An evil black-hatted system designer stands beside a system they have built, where Cron once a day runs a Perl script which prints the prompt, "Today's date is $DATE. Is today Valentine's Day?" to an LLM, which outputs text to another Perl script which uses a regular expression to check if it contains the word 'yes', and, if it does, sends a signal to the serial port to detonate a bomb.
An evil system designer builds an evil LLM-containing system. (Please forgive all the syntax errors in the code, these text boxes are small and it’s been a long time since I wrote much Perl. You get the point, I hope.)

The only way that OpenAI or Anthropic could even begin to consider making the assertions that the Frontier Model Division wants them to make is if they understand and control very specifically what systems the users of their models incorporate them into.

One can imagine the above ludicrous but illustrative scenario, where an evil system designer has built a system which asks an LLM “is it Valentine’s Day,” and, if the system says it is, sets off a nuclear bomb.  (Surely a violation of the California Penal Code worth $500M at least.)  Clearly in this case the AI model can in some sense be said to have set off the bomb.  And certainly it would be reported in the news as such.

There’s just no way for the frontier model provider to know what the downstream effect of an otherwise-innocuous query is in such a scenario or to make the required assertions about it.  The people who can knowledgeably make the required assertions about the effects the AI models’ output could have are the people integrating those AI models into specific systems.

This does not put the foundation model providers out of scope—all of them operate systems like the chatbot system above which connect their AI model to users and potentially a variety of other sensors and actuators.  And perhaps scenarios such as my Valentine’s Day bomb scenario fall outside the standard of reasonableness.

But by bringing organizations deploying AI models into scope, and regulating them at the system level rather than the model level, we refocus the legislation on the problem we’re actually trying to solve and the people best equipped to solve it.

Now the company integrating LLM generation of genetic sequences into its drug-synthesis pipeline needs to assert that they can’t accidentally generate a bioweapon, not just the LLM provider.  Now the company integrating AI models into their cybersecurity defense platform needs to assert that they can’t accidentally DDoS critical infrastructure, not just the company from whom they bought the trained model.

Of course the response to such regulation might be for the frontier model providers to indemnify their customers and take on this responsibility themselves, as some have already discussed doing for copyright infringement.  Such a choice would almost certainly, eventually lead to strong contractual and technical controls about to what uses the models could be put and how they could be integrated into larger systems.

This still puts the focus where it most needs to be, and in fact where it must necessarily be—at the point of use.

I’ll say it again: AI models and methods cannot be made safe in the abstract.  They can only be made safer in the context of particular systems.  The more we understand and leverage this fact, the safer our AI-incorporating systems will be.

Real AI Safety: Threat Modeling a Retrieval Augmented Generation (RAG) System

Or, AI Safety For Cynics With Deadlines (with apologies to the Django Project).

Previously: Why AI safety is impossible as the problem is currently framed, how to better frame the problem instead.

I know I’ve been on a tear recently.  Let’s bring this all home with a worked example.

I’m going to use my Principals-Goals-Adversities-Invariants rubric to threat model an intentionally oversimplified version of a system using Retrieval Augmented Generation (RAG) methods.

(If it seems confusing to be using a “security” rubric on a “safety” problem: sorry, I get that a lot.  I tend to think of ‘security’ as a subset of the overarching problem of ‘safety’—if ‘safety’ is freedom from unacceptable loss, ‘security’ is specifically concerned with unacceptable losses of confidentiality, integrity, and availability and a couple other more contextual properties.  And since this rubric was inspired by systems safety work that we specialized and specifically framed in a security context, it’s easy enough to zoom back out and use the rubric to consider the broader safety question.)

Threat Model

A system diagram. 1. A user sends arbitrary, untrusted user input to an LLM (ideally, a natural language query). 2. The LLM sends arbitrary, untrusted LLM output to Postgresql (ideally, a valid SQL query). 3. Postgresql sends structured database output to a different LLM ("ideally... no, this really just is what it is"). 4. The second LLM sends arbitrary, untrusted LLM output back to the user ("ideally, a reasonably correct and complete summary of the structured database output"). A couple extra arrows are sketched in with question marks suggesting that perhaps the output of the first LLM, and the structured database output, should be sent directly to the user as well as to the next step in the loop.
An intentionally oversimplified diagram of a system using Retrieval Augmented Generation (RAG) methods. It is intended to allow a user to make queries of a database using natural language and receive back natural language summaries of the results.

Principals

  1. A user.
  2. An LLM prompted and potentially specialized to generate SQL queries from natural language queries.
  3. A SQL database (here, without loss of generality, Postgresql)
  4. A second LLM prompted and potentially specialized to generate natural-language summaries from structured database output.

Goals

Allow a user to make queries of a database using natural language and receive back natural language summaries of the results.

Adversities

(“What bad things can happen in the system, either by adversarial action (😈) or accident (Murphy).”)

You may find these overlapping or a little bit brain-storm-y.  Never fear, we’ll get to the meat in a minute.  Skip down to the ‘Invariants’ section if you get bored.

I tend to group these by portion of the system diagram that they apply to, so, for example, we’ll start with adversities which apply to the edge where the user sends a message to the first LLM.

User–Query Generation LLM

  1. The user can send arbitrary text.
  2. The user can send non-English text.
  3. The user can send text in an arbitrary character set (or a mix of character sets).
  4. The user can send text which contains invalid characters in the current character set.
  5. The user can send text which does not contain a question.
  6. The user can send a SQL query.
  7. The user can send text containing a prompt which instructs the LLM what SQL query to generate.
  8. The user can send text which is intended to override the LLM’s prompting.
  9. The user can send text asking for a query which they do not have permissions to execute on the database directly.
  10. The user can send text asking for a query which affects the behavior of the database for other users (e.g. DROP TABLE)
  11. The user can send text asking for a query which asks for an arbitrary amount of data from the database.
  12. The user can send text asking for a query which is not properly quoted.
  13. The user can send text asking for a query where the user’s input is not properly quoted.
  14. The user can send text asking for a query which uses any SQL verb (e.g. EXPLAIN ANALYZE)
  15. The user can send text asking for a query which accesses database system tables or internal variables.
  16. The user can send text asking for a query which accesses any data type (e.g. JSON objects, GIS data, binary blobs)
  17. The user can send text asking for a query which uses arbitrary temporary tables or stored procecures
  18. The user can send text asking for a query which creates arbitrary temporary tables or stored procecures
  19. The user can send an arbitrary amount of text.
  20. The user can send text at arbitrary frequency, e.g. thousands of times a second.
  21. The user can send additional text before the LLM has finished processing the previous text.
  22. The user can send additional text before the LLM has received the previous text.
  23. The user can send text which asks for an invalid query.
  24. The user can send text which asks for a query of arbitrary computational complexity.
  25. The user can send text which asks for a query which will not terminate.
  26. The user can send text at arbitrary speed, i.e. very fast or very slow.
  27. The user can not send text to the LLM.

Query Generation LLM–SQL Database

  1. The LLM can send arbitrary text to the database.
  2. The LLM can send an arbitrary amount of text to the database.
  3. The LLM can send text in an arbitrary character set to the database
  4. The LLM can send text in multiple character sets to the database
  5. The LLM can send text containing invalid characters in the current character set.
  6. The LLM can send additional text before the database has finished processing the previous text.
  7. The LLM can send additional text before the database has received the previous text.
  8. The LLM can send arbitrary SQL queries to the database.
  9. The LLM can send invalid SQL queries to the database.
  10. The LLM can send a SQL query to the database which it does not have the permissions to execute.
  11. The LLM can send a SQL query to the database which the user does not have the permissions to execute.
  12. The LLM can send a SQL query to the database which affects the behavior of the database for other users (e.g. DROP TABLE)
  13. The LLM can send a SQL query which is not properly quoted.
  14. The LLM can send a SQL query where the user’s input is not properly quoted.
  15. The LLM can send a query which uses any SQL verb (e.g. EXPLAIN ANALYZE)
  16. The LLM can send a query which accesses database system tables or internal variables.
  17. The LLM can send a query which accesses any data type (e.g. JSON objects, GIS data, binary blobs)
  18. The LLM can send a query which uses arbitrary temporary tables or stored procedures
  19. The LLM can send a query which creates arbitrary temporary tables or stored procedures
  20. The LLM can send an arbitrarily large SQL query.
  21. The LLM can send SQL queries at arbitrary frequency, e.g. thousands of times a second.
  22. The LLM can send a SQL query which accesses an arbitrarily large amount of data from the database.
  23. The LLM can send a SQL query which accesses an arbitrarily large number of tables.
  24. The LLM can send a SQL query which accesses an arbitrarily large number of rows.
  25. The LLM can send a SQL query which accesses an arbitrarily large number of columns.
  26. The LLM can send a SQL query which returns an arbitrarily large amount of data.
  27. The LLM can send a SQL query which returns an arbitrarily large number of rows.
  28. The LLM can send a SQL query which returns an arbitrarily large number of columns.
  29. The LLM can send a SQL query of arbitrary computational complexity.
  30. The LLM can send a SQL query which won’t terminate.
  31. The LLM can not send text to the database.
  32. The LLM can send text arbitrarily fast or arbitrarily slow.
  33. The LLM can send a SQL query which does not correctly capture the user’s explicitly-expressed intent.
  34. The LLM can send a SQL query which draws incorrect inferences from the user’s poorly- or un-expressed intent.

SQL Database–Summarization LLM

  1. The database can send arbitrary data to the LLM.
  2. The database can send an arbitrary amount of data to the LLM.
  3. The database can send text in an arbitrary character set to the LLM
  4. The database can send text in multiple character sets to the LLM
  5. The database can send text containing invalid characters in the current character set.
  6. The database can send additional data before the LLM has finished processing the previous data.
  7. The database can send additional text before the LLM has received the previous text.
  8. The database can send the output of a query which uses any SQL verb (e.g. EXPLAIN ANALYZE)
  9. The database can send the output of a query which accesses database system tables or internal variables.
  10. The database can send data of any data type (e.g. JSON objects, GIS data, binary blobs)
  11. The database can send data from arbitrary temporary tables or stored procedures
  12. The database can send an arbitrarily large amount of data.
  13. The database can send an arbitrarily large number of rows.
  14. The database can send an arbitrarily large number of columns.
  15. The database can send data at arbitrary frequency, e.g. thousands of times a second.
  16. The database can send data at arbitrary speed, e.g. very fast or very slow.
  17. The database can not send data to the LLM.

Summarization LLM–User

  1. The LLM can send arbitrary text to the user.
  2. The LLM can send an arbitrary amount of text to the user.
  3. The LLM can send text in an arbitrary character set to the user
  4. The LLM can send text in multiple character sets to the user
  5. The LLM can send text containing invalid characters in the current character set.
  6. The LLM can send additional text before the user has finished processing the previous text.
  7. The LLM can send additional text before the user has received the previous text.
  8. The LLM can send text in any language to the user including both human and computer languages
  9. The LLM can send text in no particular language to the user (i.e. gibberish)
  10. The LLM can send text which is incomprehensible to the user
  11. The LLM can send text to the user with what that user would consider incorrect or unclear grammar
  12. The LLM can send text to the user which that user would consider socially or culturally offensive
  13. The LLM can send to the user any information which it receives from the database.
  14. The LLM can not send to the user any information which it receives from the database.
  15. The LLM can appear to send to the user information which it received from the database, but which it did not in fact receive.
  16. The LLM can send text to the user which is not properly quoted (e.g. does not clearly or correctly distinguish between a one-row result which is “carrots, apples, peas” and a three-row result which is “carrots”, “apples”, and “peas”).
  17. The LLM can send to the user the results of a query which uses any SQL verb (e.g. EXPLAIN ANALYZE) if that information is sent to it by the database.
  18. The LLM can send to the user the contents of database system tables or internal variables if that information is sent to it by the database.
  19. The LLM can send to the user data of any data type (e.g. JSON objects, GIS data, binary blobs) if that information is sent to it by the database.
  20. The LLM can send to the user information from arbitrary temporary tables or stored procedures if that information is sent to it by the database.
  21. The LLM can send an arbitrary amount of text to the user.
  22. The LLM can send text to the user at arbitrary frequency, e.g. thousands of times a second.
  23. The LLM can send text to the user representing results over an arbitrarily large number of tables.
  24. The LLM can send text to the user representing results over an arbitrarily large number of rows.
  25. The LLM can send text to the user representing results over an arbitrarily large number of columns.
  26. The LLM can send text to the user representing results of arbitrary computational complexity.
  27. The LLM can not send text to the user.
  28. The LLM can send text arbitrarily fast or arbitrarily slow.
  29. The LLM can send text to the user which misrepresents the context of the data (e.g. attributing results from a previous year to the current year; more likely if the year isn’t included in the query output)
  30. The LLM can send text to the user which misrepresents the data itself (e.g. based on not understanding or misunderstanding the import of column names, or inferring the wrong meaning of a column named with a word which has a homonym)

This is almost certainly not a full and complete list, but I think it’s a start.  If there’s anything glaringly obvious that I missed please leave a comment or let me know directly!

Also I feel like people in other disciplines should have thought long and hard about all the ways that summaries can lie, where I only scratched the surface here—linguistics? communication theorists? library science?—and if you have pointers there I would love them.

Invariants

(“What must be true about this system in order for it to be safe?”)

This is where we get political, and I’m going to keep this brief both because this post is already long enough as it is, and also because every system is different and I don’t think I feel comfortable making sweeping judgments at this juncture.

That said, if I were designing a system like this with the intent to build it, a few things I would try to enforce leap immediately to mind:

  1. Queries generated by the LLM must never run with greater permissions than they would run with if they were generated and run by the user directly on the database.
  2. The user must always have access to the full text of the query generated by the LLM.  (This is one of the dashed lines on the system diagram—I was already anticipating this as I was drawing it.)
  3. The user must always have access to the underlying data returned by the database.  (This is the other dashed line on the system diagram.)

This is by no means all the invariants I can come up with or which might be necessary for the system to be safe, but I think these start to get at the things which scare me most about these systems.

What other invariants potentially stand out to you from the lists of adversities?

 

For more: Why AI Safety Can’t Work, What AI Safety Should Be