A Case-Study in Securing LLM Applications From Consumer Reports

This is not my work but some colleagues and friends at Consumer Reports and Include Security have a nice post up about their, uh, security adventures developing an LLM-powered chatbot application. (They independently discovered the vulnerabilities recently published as LLM4Shell at BlackHat Asia.)

Who’s Verifying the Verifier: A Case-Study in Securing LLM Applications

tl;dr:

The code is a little confusing, but it basically executes ( exec() ) all the code except for the last line (as system commands) generated from the LLM and then evaluates ( eval() ) the last line of code (as python). We also notice the sanitize function, which should be doing something to reduce risk; however we found that it only removes spaces and “python” from the beginning of the code, as well as backtick marks around the code:

And it gets, uh, “better” from there.

Regulate Systems, Not Models: Reaction to CA Senator Scott Weiner’s SB1047

Hot on the heels of all this writing I’ve been doing about AI safety (previously: why the current approach to AI safety is doomed to failure, what AI safety should be, a worked example)—

My friend and South Park Commons colleague Derek Slater passed on this California state senate bill, proposed by my senator Scott Weiner, California Senate bill SB1047.

Since Sen. Weiner is my state senator, I’ve already got a message out to his office to discuss the bill, but having now read the full text in some detail, I wanted to set down some thoughts here for other people as well.

The bill has the mouthful of a title, the “Safe and Secure Innovation for Frontier Artificial Intelligence Systems Act,” and, as I read it, I think it breaks down fairly cleanly into two big parts:

Create a Frontier Model Division within the California state Department of Technology, and require developers who are or intend to train “covered models” to make various certifications about its safety and capability to the Frontier Model Division, with various penalties if they don’t, the certifications are later found to be false, or an incident occurs.
- A covered model is defined as a model trained using a quantity of computing power greater than 10^26 integer or floating-point operations in 2024, or a model that could reasonably be expected to have similar performance or “general capability” to such a model).
- The certifications are particularly concerned with some abstract “hazardous capabilities” which models might possess, which might make substantially easier than otherwise the specific harms of:
  - “The creation or use of a chemical, biological, radiological, or nuclear weapon in a manner that results in mass casualties.”
  - “At least five hundred million dollars ($500,000,000) of damage through cyberattacks on critical infrastructure via a single incident or multiple related incidents.”
  - “At least five hundred million dollars ($500,000,000) of damage by an artificial intelligence model that autonomously engages in conduct that would violate the Penal Code if undertaken by a human.”
Instruct the Department of Technology to commission consultants to create and operate a public cloud platform, “to be known as CalCompute, with the primary focus of conducting research into the safe and secure deployment of large-scale artificial intelligence models and fostering equitable innovation.”

I’m going to leave aside the CalCompute part of the bill for the moment, because despite the ostensible focus on safe AI development, my read is that it’s just much more an implementation detail, and one that honestly feels kind of tacked-on to me. Let’s focus here primarily on the Frontier Model Division and the requirements around it.

First of all, I like that we’re at least defining our unacceptable losses here. I think much about how some fact about the model eventually results in one of these losses is unclear to the point of fuzzy-headedness and too much science fiction, but at least we’re putting some kind of stake in the ground about what our cares and concerns are. As I’ve written before, this is necessary for us to actually reason about the safety of systems making use of AI models.

A system diagram. A befuddled system designer looks at a box labeled "AI Model" with unconnected arrows coming into and going out of it marked with question marks. A group of users has another unconnected arrow going out of them, marked with a question mark. Losses labeled "WMDs", "$500M critical infra cyberattack", and "$500M crime" all have unconnected arrows going into them, also labeled with question marks. — A confused system designer surveys the problem space.

The issue here, as I’ve now written about at some length, is the same as the issue with AI safety writ large.

A screenshot of HuggingFace showing the RunwayML distribution of Stable Diffusion v1.5 and some of the files in it. — Some of the files comprising the RunwayML distribution of Stable Diffusion v1.5.

AI models don’t have capabilities. An AI model is just a file or some files on disk. Systems, whether incorporating AI models or not, have capabilities.

An AI model can’t do anything on its own—an AI model can only do things when loaded into a computer’s memory, in the context of a computer program or a system of computer programs which connect it to what we would in a control systems context call sensors and actuators, which allow it to do things. (Receive text or images from a user, display text or images to a user, make API calls, etc.)

Those actuators may be connected to other systems (or people), which may be connected to still other systems (or people), which may eventually allow the AI model to be reductively said to have done a thing, but those sensors and actuators are essential. Without them the model sits inert.

A diagram. Two boxes labeled "LLM" and "User". Arrows point from User to LLM and from LLM to User, both labeled with "Arbitrary text". A confused system designer stands in the corner. — That oversimplified system diagram of a prompted text-generation LLM chatbot, once again.

While it’s true that AI developers will talk about their models as having certain kinds of capabilities (e.g. prompted text generation, prompted image generation, image-to-image generation) and as being better or worse at those capabilities on some benchmark than other models, these are only capabilities within the assumed context of some system, e.g. a fairly simple system which feeds them text prompts or image prompts and evaluates their output.

I forgot to work my joke about paperclip-maximizer AIs into the first diagram above, and am not attached enough to it to go back and include it, but this gets back to the idea I’ve mentioned here before that if we don’t want an AI model to turn us all into paperclips, one really easy step to take is to not put it in charge of a paperclip factory. (And of course when I go look it up I find that the LessWrong people have decided to rename the concept with a more abstract and less emotionally salient name, haven’t they.)

This may all seem a bit pedantic or legalistic, but we are talking about this in the context of a proposed law. Anyway my proposed change to the bill is very simple: We should reword it to talk about constraints on and the compliance of systems incorporating frontier AI models, rather than the models themselves.

Now hold on a second, you might say. Doesn’t this bring under the jurisdiction of the Frontier Model Division many, many more companies than just frontier model providers like OpenAI and Anthropic? Potentially including anyone deploying their own ChatGPT-powered chatbot? And yes, it does.

An evil black-hatted system designer stands beside a system they have built, where Cron once a day runs a Perl script which prints the prompt, "Today's date is $DATE. Is today Valentine's Day?" to an LLM, which outputs text to another Perl script which uses a regular expression to check if it contains the word 'yes', and, if it does, sends a signal to the serial port to detonate a bomb. — An evil system designer builds an evil LLM-containing system. (Please forgive all the syntax errors in the code, these text boxes are small and it’s been a long time since I wrote much Perl. You get the point, I hope.)

The only way that OpenAI or Anthropic could even begin to consider making the assertions that the Frontier Model Division wants them to make is if they understand and control very specifically what systems the users of their models incorporate them into.

One can imagine the above ludicrous but illustrative scenario, where an evil system designer has built a system which asks an LLM “is it Valentine’s Day,” and, if the system says it is, sets off a nuclear bomb. (Surely a violation of the California Penal Code worth $500M at least.) Clearly in this case the AI model can in some sense be said to have set off the bomb. And certainly it would be reported in the news as such.

There’s just no way for the frontier model provider to know what the downstream effect of an otherwise-innocuous query is in such a scenario or to make the required assertions about it. The people who can knowledgeably make the required assertions about the effects the AI models’ output could have are the people integrating those AI models into specific systems.

This does not put the foundation model providers out of scope—all of them operate systems like the chatbot system above which connect their AI model to users and potentially a variety of other sensors and actuators. And perhaps scenarios such as my Valentine’s Day bomb scenario fall outside the standard of reasonableness.

But by bringing organizations deploying AI models into scope, and regulating them at the system level rather than the model level, we refocus the legislation on the problem we’re actually trying to solve and the people best equipped to solve it.

Now the company integrating LLM generation of genetic sequences into its drug-synthesis pipeline needs to assert that they can’t accidentally generate a bioweapon, not just the LLM provider. Now the company integrating AI models into their cybersecurity defense platform needs to assert that they can’t accidentally DDoS critical infrastructure, not just the company from whom they bought the trained model.

Of course the response to such regulation might be for the frontier model providers to indemnify their customers and take on this responsibility themselves, as some have already discussed doing for copyright infringement. Such a choice would almost certainly, eventually lead to strong contractual and technical controls about to what uses the models could be put and how they could be integrated into larger systems.

This still puts the focus where it most needs to be, and in fact where it must necessarily be—at the point of use.

I’ll say it again: AI models and methods cannot be made safe in the abstract. They can only be made safer in the context of particular systems. The more we understand and leverage this fact, the safer our AI-incorporating systems will be.

Real AI Safety: Threat Modeling a Retrieval Augmented Generation (RAG) System

Or, AI Safety For Cynics With Deadlines (with apologies to the Django Project).

Previously: Why AI safety is impossible as the problem is currently framed, how to better frame the problem instead.

I know I’ve been on a tear recently. Let’s bring this all home with a worked example.

I’m going to use my Principals-Goals-Adversities-Invariants rubric to threat model an intentionally oversimplified version of a system using Retrieval Augmented Generation (RAG) methods.

(If it seems confusing to be using a “security” rubric on a “safety” problem: sorry, I get that a lot. I tend to think of ‘security’ as a subset of the overarching problem of ‘safety’—if ‘safety’ is freedom from unacceptable loss, ‘security’ is specifically concerned with unacceptable losses of confidentiality, integrity, and availability and a couple other more contextual properties. And since this rubric was inspired by systems safety work that we specialized and specifically framed in a security context, it’s easy enough to zoom back out and use the rubric to consider the broader safety question.)

Threat Model

A system diagram. 1. A user sends arbitrary, untrusted user input to an LLM (ideally, a natural language query). 2. The LLM sends arbitrary, untrusted LLM output to Postgresql (ideally, a valid SQL query). 3. Postgresql sends structured database output to a different LLM ("ideally... no, this really just is what it is"). 4. The second LLM sends arbitrary, untrusted LLM output back to the user ("ideally, a reasonably correct and complete summary of the structured database output"). A couple extra arrows are sketched in with question marks suggesting that perhaps the output of the first LLM, and the structured database output, should be sent directly to the user as well as to the next step in the loop. — An intentionally oversimplified diagram of a system using Retrieval Augmented Generation (RAG) methods. It is intended to allow a user to make queries of a database using natural language and receive back natural language summaries of the results.

Principals

A user.
An LLM prompted and potentially specialized to generate SQL queries from natural language queries.
A SQL database (here, without loss of generality, Postgresql)
A second LLM prompted and potentially specialized to generate natural-language summaries from structured database output.

Goals

Allow a user to make queries of a database using natural language and receive back natural language summaries of the results.

Adversities

(“What bad things can happen in the system, either by adversarial action (😈) or accident (Murphy).”)

You may find these overlapping or a little bit brain-storm-y. Never fear, we’ll get to the meat in a minute. Skip down to the ‘Invariants’ section if you get bored.

I tend to group these by portion of the system diagram that they apply to, so, for example, we’ll start with adversities which apply to the edge where the user sends a message to the first LLM.

User–Query Generation LLM

The user can send arbitrary text.
The user can send non-English text.
The user can send text in an arbitrary character set (or a mix of character sets).
The user can send text which contains invalid characters in the current character set.
The user can send text which does not contain a question.
The user can send a SQL query.
The user can send text containing a prompt which instructs the LLM what SQL query to generate.
The user can send text which is intended to override the LLM’s prompting.
The user can send text asking for a query which they do not have permissions to execute on the database directly.
The user can send text asking for a query which affects the behavior of the database for other users (e.g. DROP TABLE)
The user can send text asking for a query which asks for an arbitrary amount of data from the database.
The user can send text asking for a query which is not properly quoted.
The user can send text asking for a query where the user’s input is not properly quoted.
The user can send text asking for a query which uses any SQL verb (e.g. EXPLAIN ANALYZE)
The user can send text asking for a query which accesses database system tables or internal variables.
The user can send text asking for a query which accesses any data type (e.g. JSON objects, GIS data, binary blobs)
The user can send text asking for a query which uses arbitrary temporary tables or stored procecures
The user can send text asking for a query which creates arbitrary temporary tables or stored procecures
The user can send an arbitrary amount of text.
The user can send text at arbitrary frequency, e.g. thousands of times a second.
The user can send additional text before the LLM has finished processing the previous text.
The user can send additional text before the LLM has received the previous text.
The user can send text which asks for an invalid query.
The user can send text which asks for a query of arbitrary computational complexity.
The user can send text which asks for a query which will not terminate.
The user can send text at arbitrary speed, i.e. very fast or very slow.
The user can not send text to the LLM.

Query Generation LLM–SQL Database

The LLM can send arbitrary text to the database.
The LLM can send an arbitrary amount of text to the database.
The LLM can send text in an arbitrary character set to the database
The LLM can send text in multiple character sets to the database
The LLM can send text containing invalid characters in the current character set.
The LLM can send additional text before the database has finished processing the previous text.
The LLM can send additional text before the database has received the previous text.
The LLM can send arbitrary SQL queries to the database.
The LLM can send invalid SQL queries to the database.
The LLM can send a SQL query to the database which it does not have the permissions to execute.
The LLM can send a SQL query to the database which the user does not have the permissions to execute.
The LLM can send a SQL query to the database which affects the behavior of the database for other users (e.g. DROP TABLE)
The LLM can send a SQL query which is not properly quoted.
The LLM can send a SQL query where the user’s input is not properly quoted.
The LLM can send a query which uses any SQL verb (e.g. EXPLAIN ANALYZE)
The LLM can send a query which accesses database system tables or internal variables.
The LLM can send a query which accesses any data type (e.g. JSON objects, GIS data, binary blobs)
The LLM can send a query which uses arbitrary temporary tables or stored procedures
The LLM can send a query which creates arbitrary temporary tables or stored procedures
The LLM can send an arbitrarily large SQL query.
The LLM can send SQL queries at arbitrary frequency, e.g. thousands of times a second.
The LLM can send a SQL query which accesses an arbitrarily large amount of data from the database.
The LLM can send a SQL query which accesses an arbitrarily large number of tables.
The LLM can send a SQL query which accesses an arbitrarily large number of rows.
The LLM can send a SQL query which accesses an arbitrarily large number of columns.
The LLM can send a SQL query which returns an arbitrarily large amount of data.
The LLM can send a SQL query which returns an arbitrarily large number of rows.
The LLM can send a SQL query which returns an arbitrarily large number of columns.
The LLM can send a SQL query of arbitrary computational complexity.
The LLM can send a SQL query which won’t terminate.
The LLM can not send text to the database.
The LLM can send text arbitrarily fast or arbitrarily slow.
The LLM can send a SQL query which does not correctly capture the user’s explicitly-expressed intent.
The LLM can send a SQL query which draws incorrect inferences from the user’s poorly- or un-expressed intent.

SQL Database–Summarization LLM

The database can send arbitrary data to the LLM.
The database can send an arbitrary amount of data to the LLM.
The database can send text in an arbitrary character set to the LLM
The database can send text in multiple character sets to the LLM
The database can send text containing invalid characters in the current character set.
The database can send additional data before the LLM has finished processing the previous data.
The database can send additional text before the LLM has received the previous text.
The database can send the output of a query which uses any SQL verb (e.g. EXPLAIN ANALYZE)
The database can send the output of a query which accesses database system tables or internal variables.
The database can send data of any data type (e.g. JSON objects, GIS data, binary blobs)
The database can send data from arbitrary temporary tables or stored procedures
The database can send an arbitrarily large amount of data.
The database can send an arbitrarily large number of rows.
The database can send an arbitrarily large number of columns.
The database can send data at arbitrary frequency, e.g. thousands of times a second.
The database can send data at arbitrary speed, e.g. very fast or very slow.
The database can not send data to the LLM.

Summarization LLM–User

The LLM can send arbitrary text to the user.
The LLM can send an arbitrary amount of text to the user.
The LLM can send text in an arbitrary character set to the user
The LLM can send text in multiple character sets to the user
The LLM can send text containing invalid characters in the current character set.
The LLM can send additional text before the user has finished processing the previous text.
The LLM can send additional text before the user has received the previous text.
The LLM can send text in any language to the user including both human and computer languages
The LLM can send text in no particular language to the user (i.e. gibberish)
The LLM can send text which is incomprehensible to the user
The LLM can send text to the user with what that user would consider incorrect or unclear grammar
The LLM can send text to the user which that user would consider socially or culturally offensive
The LLM can send to the user any information which it receives from the database.
The LLM can not send to the user any information which it receives from the database.
The LLM can appear to send to the user information which it received from the database, but which it did not in fact receive.
The LLM can send text to the user which is not properly quoted (e.g. does not clearly or correctly distinguish between a one-row result which is “carrots, apples, peas” and a three-row result which is “carrots”, “apples”, and “peas”).
The LLM can send to the user the results of a query which uses any SQL verb (e.g. EXPLAIN ANALYZE) if that information is sent to it by the database.
The LLM can send to the user the contents of database system tables or internal variables if that information is sent to it by the database.
The LLM can send to the user data of any data type (e.g. JSON objects, GIS data, binary blobs) if that information is sent to it by the database.
The LLM can send to the user information from arbitrary temporary tables or stored procedures if that information is sent to it by the database.
The LLM can send an arbitrary amount of text to the user.
The LLM can send text to the user at arbitrary frequency, e.g. thousands of times a second.
The LLM can send text to the user representing results over an arbitrarily large number of tables.
The LLM can send text to the user representing results over an arbitrarily large number of rows.
The LLM can send text to the user representing results over an arbitrarily large number of columns.
The LLM can send text to the user representing results of arbitrary computational complexity.
The LLM can not send text to the user.
The LLM can send text arbitrarily fast or arbitrarily slow.
The LLM can send text to the user which misrepresents the context of the data (e.g. attributing results from a previous year to the current year; more likely if the year isn’t included in the query output)
The LLM can send text to the user which misrepresents the data itself (e.g. based on not understanding or misunderstanding the import of column names, or inferring the wrong meaning of a column named with a word which has a homonym)

This is almost certainly not a full and complete list, but I think it’s a start. If there’s anything glaringly obvious that I missed please leave a comment or let me know directly!

Also I feel like people in other disciplines should have thought long and hard about all the ways that summaries can lie, where I only scratched the surface here—linguistics? communication theorists? library science?—and if you have pointers there I would love them.

Invariants

(“What must be true about this system in order for it to be safe?”)

This is where we get political, and I’m going to keep this brief both because this post is already long enough as it is, and also because every system is different and I don’t think I feel comfortable making sweeping judgments at this juncture.

That said, if I were designing a system like this with the intent to build it, a few things I would try to enforce leap immediately to mind:

Queries generated by the LLM must never run with greater permissions than they would run with if they were generated and run by the user directly on the database.
The user must always have access to the full text of the query generated by the LLM. (This is one of the dashed lines on the system diagram—I was already anticipating this as I was drawing it.)
The user must always have access to the underlying data returned by the database. (This is the other dashed line on the system diagram.)

This is by no means all the invariants I can come up with or which might be necessary for the system to be safe, but I think these start to get at the things which scare me most about these systems.

What other invariants potentially stand out to you from the lists of adversities?

For more: Why AI Safety Can’t Work, What AI Safety Should Be

Why AI Safety Can’t Work

At least AI safety the way that people talk about it with me on social media and in person.

I’ve had this conversation often enough online and in person that I believe that I am not in a meaningful sense attributing a strawman framing of the problem to people who are generally interested in the space. (I.e. I believe this is the steelman or best possible framing of the problem commonly in use.)

If there are better framings out there, please let me know! I would love to be wrong about this. In particular I hope that there are specialists out there who are further along than this.

But:

The standard way that people on social media frame the problem goes something like, “we believe that, in the future, we may create an AI system that is so far advanced beyond what we are currently capable of that it may have it within its power to destroy all of humanity; how do we ensure that it can’t?”

And the answer is: ???

(I am honestly not aware of a good, concise summary of what the best current thinking about answers is that I could point you to as the steelman here. I am, however, aware that many, many pixels have been spilled discussing this online and increasingly in more respectable places. That even a rough consensus is not more evident is, in fact, I believe, a symptom of what I’m about to describe. Again, pointers to others’ work in this space welcomed.)

I submit to you that the reason the answers are so hard to come by and so unsatisfying when they do come is that the question assumes its own conclusion. To ask the question, in this way, is to answer it. And the answer is: we can’t. But this is not the right question.

Let’s unpack this.

In Western Christian philosophy, the Christian God is often defined as a unique being who is omniscient, omnipotent, and omnibenevolent. (Often called the “three-O” conception of God; sometimes ‘omnipresent’ is substituted for ‘omnibenevolent’. I link to the site for illustrative purposes only; no endorsement is expressed or implied.)

This leads to a variety of paradoxes which Christian philosophers and people who like to argue with Christian philosophers have debated over the millennia.

The Problem of Evil is the most obvious one, to a liberal Christian, or at least the one that bothered me most growing up a liberal Christian. Very roughly, if God wants only good things to happen (omnibenevolent), and God has the full knowledge of everything (omniscient) and the power to do anything (omnipotent), why do not-good things happen in the world?

People have explored a variety of answers over the years, and there’s almost certainly a Masters of Divinity degree with your name on it if you want to answer it yourself. I am not going for an M.Div, and I’m certainly not going to solve it here—I offer it to illustrate what all the different words mean.

Getting back to our conception of our AI, and comparing it to our conception of the Christian God, I assert that, without substantially constraining what “so far advanced beyond what we are currently capable of” means, current AI safety proponents are, in fact, arguing that this future AI will be effectively omniscient and omnipotent.

So, the good news, there’s no Problem of Evil here, the AI isn’t omnibenevolent! (That is, in fact, kind of the problem—but I’m getting ahead of myself.)

Unfortunately there are other contradictions inherent in ‘omniscience’ and ‘omnipotence’. One is literally called the Omnipotence Paradox—the old, “Can God create a stone so big that He can’t lift it?”

Fortunately our concerns are more practical. Given that we might create an omnipotent and omniscient entity, how do we ensure that it can’t destroy us?

And again, the answer is, we can’t. It’s right there in the framing. If this entity is omnipotent, then it, definitionally, can destroy us, and there’s no way to ensure that it can’t, because, definitionally, it can.

(It doesn’t even need to be omniscient to do so!)

Now, there are two ways out of this thicket.

One is to relent and put some pretty significant, but realistic, constraints on the potential future capabilities of this AI… let’s call it an AI system, rather than an entity. What are its inputs, what are its outputs, who built it and is therefore responsible for it, who pays its power bills, how many backhoes does it take to sever its connection to the public internet.

The other is to relax our goal. What if, instead of ensuring that our omnipotent AI entity can’t destroy us, we merely try to ensure that our omnipotent AI entity doesn’t want to destroy us?

And the answer, again, is that we can’t ensure that our omnipotent AI entity wants any particular thing, because, again, it is omnipotent, and an omnipotent entity is capable of wanting whatever it wants, definitionally.

For some outside force, like humanity as a whole, or at least AI safety practitioners, to constrain its power like that, even optionally, is definitionally impossible. That’s why the definition of the Christian God needs the additional, explicit omnibenevolence constraint.

We might as well make burnt sacrifices on Mount Moriah. (Notwithstanding that there’s another God who lays claim to that particular ritual spot.) It would certainly be more emotionally satisfying.

And, because omnipotence is inherently contradictory, given that we start with it as a premise, we can prove anything we want. We can also disappear up our own navels.

So. We have to significantly constrain our understanding of the capabilities of our future AI system before we can think, or act, meaningfully to ensure its safety, and the safety of anything or anyone that it interacts with.

We are not going to create God, or an omnipotent AI, because God, and an omnipotent AI, cannot exist—God, as here defined, and an omnipotent AI, again definitionally.

And if there ever comes to be convincing evidence that we have created an omnipotent AI, or a God, well. Go, fetch a ram from the flock as offering, without defect and of the proper value. It’s the only way.

(Inb4: “But what about a finite entity which is so much advanced than us as to be effectively omnipotent relative to us.” You just said ‘omnipotent’ but using more words. It rounds to the same thing. We need to define constraints.)

For more: What AI Safety Should Be, Real AI Safety: Threat Modeling a Retrieval Augmented Generation (RAG) System

What AI Safety Should Be

To return to the topic of my drunken ramble of a blog post with much less swearing:

(It’s not particularly essential reading for this, I wouldn’t bother if you haven’t read it already.)

After I posted that, my good friend and former colleague Rachel Silber (who is looking for work, you should hire her) pointed me to this paper titled “System Safety and Artificial Intelligence”, which, I think, of everything I’ve seen so far, feels like it is most obviously headed in the right direction.

It is drawing on some work which folks who read my blog regularly will know that I’m always on about, that of Prof. Nancy Leveson of MIT in the field of systems safety, which applies principles of feedback systems and control theory to the problem of safety in complex systems. And while that sounds kind of heady and probably too mathematical, one of its core insights is that humans are an essential part of any safety control system, and the most adaptable part.

Prof. Leveson has done the best job of anybody in the field of collecting and condensing its insights and expressing them for a lay audience, most recently in her new book aptly titled An Introduction to Systems Safety Engineering. (And where that is the deep-dive, for an essential tasting course of a high-level overview, check out “How Complex Systems Fail” by Dr. Richard Cook.)

Prof. Leveson actually has a computer science and software engineering background, and her early work largely concerned the role of software in safety-critical systems. Among other accomplishments, Prof. Leveson served on the board investigating the loss of the Space Shuttle Columbia; wrote the seminal paper about the THERAC-25 medical radiation device losses, some of which were fatal, and in which software played a critical role; and with her lab did the spec for the TCAS-II system, the Traffic Collision Avoidance System mark 2, which is responsible for airplanes not crashing into each other in midair. She also has a wicked dry sense of humor. I cannot recommend her work highly enough.

In terms of the application of all this theory to the safety of systems involving “AI”, here are my initial takeaways:

* The safety of a system cannot be considered in isolation from its intended use. In particular, we can’t consider the safety of the “AI” parts of our system separate from the safety of the system as a whole, and this is a mistake that I think everybody I know in the public conversation around AI safety is making right now.

(Internally I know that the conversation at the AI companies I have any significant acquaintance with is further along than the conversation on e.g. Twitter. But I do usually feel like the conversation is stuck in the weeds a bit and could substantially benefit from a systems level perspective, even if it likely feels naïve at first to folks who spend all day hands-on with these systems. Please bear with us—we have to build the complexity gradually.)

To put it a little fancifully, we have so far been trying to make an AI safe from turning us all into paperclips in the abstract, rather than in the context of an AI which has been put in charge of a paperclip factory. (Begging the question of why an AI has been put in charge of a paperclip factory in the first place.)

The way that the paper phrases this is “Shift Focus from Component Reliability to System Hazard Elimination”.

So, here, Prof. Leveson and Dr. Cook and all the others would say, and I agree with and believe, that we have to consider the safety of any particular use of AI in the context of the specific chatbot or whatever that we are trying to build, rather than in the abstract.

And I think that as more people come into “AI safety” in the context of particular projects at particular companies we are already headed in this direction. But I want to call this out specifically, and point it out as a place where the public “AI safety” conversation is headed in a very different and much less constructive direction. And conversely the Retrieval Augmented Generation folks are well on their way in the right direction, using LLMs as one component of the system but constraining its behavior significantly.

* Systems incorporating AI are fundamentally amenable to all the same kinds of systems understanding and safety analysis that we already do on systems involving humans and computers. I don’t feel like I can quite justify this yet, and it’s another point of departure from the public AI safety conversation (no wonder they’re so stem-wound), but even if it is kind of something we’re going to have to take as a point of doctrine or maybe even faith, I do strongly believe it to be true. The paper authors certainly believe it too—the whole existence of their paper is predicated on it.

If all this is true, then, this all comes down to the something much like the systems-informed threat modeling methods we might use to understand and design to protect ourselves against any other kind of unacceptable losses in any other kind of system.

* First we have to to crisply define what “the system” is and how all its parts interact, or at least how we’ve designed them to interact.

* The second step is to crisply define what our goals are for the system, as its builders.

* Systems safety defines ‘safety’ as “freedom from unacceptable loss,” and so the third step of the safety process is to crisply define what our unacceptable losses are. (And, probably, first, to whom they’re unacceptable—starting from “us, the builders of the system” and growing outwards.)

(I joked in my rant that currently the only real unacceptable loss we’re worried about with current AI chatbots is the unacceptable loss of them saying something which embarrasses their Northern Californian designers in front of their peers. And it is a bit of an unfair thing to say, I think, but also I do believe that it captures something more true than not about where we are at with our thinking on unacceptable losses in AI and “to whom?” the losses are unacceptable. We can do better.)

* From here we enumerate what bad things can happen to the system, which could potentially lead to an unacceptable loss.

* Doing any analysis here is going to force us to grapple with what the capabilities of our “AI” system are, and how they interact with and affect the rest of the system. Folks have been getting hung up because we don’t know what all the capabilities of LLMs are, for example, and people keep turning up new ones.

Here I think is particularly where zooming out to the systems perspective pays dividends, because if we black-box the LLM as a system which takes arbitrary text from a user as input and produces arbitrary text to a user as output in the context of certain unacceptable losses we’re concerned about (e.g. the user perceives the output text as racist) then we quickly see that we lack the kinds of sensors and actuators we might need to constrain these interactions.

Also, however, it just as quickly becomes obvious where some places to intervene in the system might be (e.g. something on the output, or, just as importantly, something on the input—basically the only two places it’s possible to intervene in a system this simple, without adding additional feedback loops to e.g. influence the user’s perception, i.e. their internal state, or e.g. the internal state of the LLM).

* Finally, then, we come up with a list of invariants which must be true in order for the system to be safe, and design to enforce them. (The paper phrases this as “Shift from Event-based to Constraint-based Accident Models.”)

One of the key insights of systems safety is that playing whack-a-mole with bad events is a waste of time, energy, and money. Most really unacceptable losses don’t happen because of a single bad event or what we might call a component failure, but because two subsystems, both operating correctly according to their designers’ intent, interact in a way that their designers didn’t foresee.

This can be as simple as one system outputting values in metric and a second system expecting input in imperial. However, we can, with careful understanding and design, eliminate whole classes of bad events, by constraining or enforcing invariants on the kinds of interaction which take place (e.g. by requiring and checking that all output is in metric).

I do believe that, if we are willing and able to apply standard safety engineering techniques, and specifically systems safety understandings and techniques, to the questions of safety in particular systems incorporating AI components, that we will in most cases be able to build safe systems.

They may look and act very differently than current AI chatbots, but they will actually work, and work safely—because we have designed them to work, and work safely.

For more: Why AI Safety Can’t Work, Real AI Safety: Threat Modeling a Retrieval Augmented Generation (RAG) System