What AI Safety Should Be

To return to the topic of my drunken ramble of a blog post with much less swearing:

(It’s not particularly essential reading for this, I wouldn’t bother if you haven’t read it already.)

After I posted that, my good friend and former colleague Rachel Silber (who is looking for work, you should hire her) pointed me to this paper titled “System Safety and Artificial Intelligence”, which, I think, of everything I’ve seen so far, feels like it is most obviously headed in the right direction.

It is drawing on some work which folks who read my blog regularly will know that I’m always on about, that of Prof. Nancy Leveson of MIT in the field of systems safety, which applies principles of feedback systems and control theory to the problem of safety in complex systems. And while that sounds kind of heady and probably too mathematical, one of its core insights is that humans are an essential part of any safety control system, and the most adaptable part.

Prof. Leveson has done the best job of anybody in the field of collecting and condensing its insights and expressing them for a lay audience, most recently in her new book aptly titled An Introduction to Systems Safety Engineering. (And where that is the deep-dive, for an essential tasting course of a high-level overview, check out “How Complex Systems Fail” by Dr. Richard Cook.)

Prof. Leveson actually has a computer science and software engineering background, and her early work largely concerned the role of software in safety-critical systems. Among other accomplishments, Prof. Leveson served on the board investigating the loss of the Space Shuttle Columbia; wrote the seminal paper about the THERAC-25 medical radiation device losses, some of which were fatal, and in which software played a critical role; and with her lab did the spec for the TCAS-II system, the Traffic Collision Avoidance System mark 2, which is responsible for airplanes not crashing into each other in midair. She also has a wicked dry sense of humor. I cannot recommend her work highly enough.

In terms of the application of all this theory to the safety of systems involving “AI”, here are my initial takeaways:

* The safety of a system cannot be considered in isolation from its intended use. In particular, we can’t consider the safety of the “AI” parts of our system separate from the safety of the system as a whole, and this is a mistake that I think everybody I know in the public conversation around AI safety is making right now.

(Internally I know that the conversation at the AI companies I have any significant acquaintance with is further along than the conversation on e.g. Twitter. But I do usually feel like the conversation is stuck in the weeds a bit and could substantially benefit from a systems level perspective, even if it likely feels naïve at first to folks who spend all day hands-on with these systems. Please bear with us—we have to build the complexity gradually.)

To put it a little fancifully, we have so far been trying to make an AI safe from turning us all into paperclips in the abstract, rather than in the context of an AI which has been put in charge of a paperclip factory. (Begging the question of why an AI has been put in charge of a paperclip factory in the first place.)

The way that the paper phrases this is “Shift Focus from Component Reliability to System Hazard Elimination”.

So, here, Prof. Leveson and Dr. Cook and all the others would say, and I agree with and believe, that we have to consider the safety of any particular use of AI in the context of the specific chatbot or whatever that we are trying to build, rather than in the abstract.

And I think that as more people come into “AI safety” in the context of particular projects at particular companies we are already headed in this direction. But I want to call this out specifically, and point it out as a place where the public “AI safety” conversation is headed in a very different and much less constructive direction. And conversely the Retrieval Augmented Generation folks are well on their way in the right direction, using LLMs as one component of the system but constraining its behavior significantly.

* Systems incorporating AI are fundamentally amenable to all the same kinds of systems understanding and safety analysis that we already do on systems involving humans and computers. I don’t feel like I can quite justify this yet, and it’s another point of departure from the public AI safety conversation (no wonder they’re so stem-wound), but even if it is kind of something we’re going to have to take as a point of doctrine or maybe even faith, I do strongly believe it to be true. The paper authors certainly believe it too—the whole existence of their paper is predicated on it.

If all this is true, then, this all comes down to the something much like the systems-informed threat modeling methods we might use to understand and design to protect ourselves against any other kind of unacceptable losses in any other kind of system.

* First we have to to crisply define what “the system” is and how all its parts interact, or at least how we’ve designed them to interact.

* The second step is to crisply define what our goals are for the system, as its builders.

* Systems safety defines ‘safety’ as “freedom from unacceptable loss,” and so the third step of the safety process is to crisply define what our unacceptable losses are. (And, probably, first, to whom they’re unacceptable—starting from “us, the builders of the system” and growing outwards.)

(I joked in my rant that currently the only real unacceptable loss we’re worried about with current AI chatbots is the unacceptable loss of them saying something which embarrasses their Northern Californian designers in front of their peers. And it is a bit of an unfair thing to say, I think, but also I do believe that it captures something more true than not about where we are at with our thinking on unacceptable losses in AI and “to whom?” the losses are unacceptable. We can do better.)

* From here we enumerate what bad things can happen to the system, which could potentially lead to an unacceptable loss.

* Doing any analysis here is going to force us to grapple with what the capabilities of our “AI” system are, and how they interact with and affect the rest of the system. Folks have been getting hung up because we don’t know what all the capabilities of LLMs are, for example, and people keep turning up new ones.

Here I think is particularly where zooming out to the systems perspective pays dividends, because if we black-box the LLM as a system which takes arbitrary text from a user as input and produces arbitrary text to a user as output in the context of certain unacceptable losses we’re concerned about (e.g. the user perceives the output text as racist) then we quickly see that we lack the kinds of sensors and actuators we might need to constrain these interactions.

Also, however, it just as quickly becomes obvious where some places to intervene in the system might be (e.g. something on the output, or, just as importantly, something on the input—basically the only two places it’s possible to intervene in a system this simple, without adding additional feedback loops to e.g. influence the user’s perception, i.e. their internal state, or e.g. the internal state of the LLM).

A diagram. Two boxes labeled "LLM" and "User". Arrows point from User to LLM and from LLM to User, both labeled with "Arbitrary text". A confused system designer stands in the corner. — An oversimplified system diagram of an LLM-powered chatbot, with befuddled system designer.

* Finally, then, we come up with a list of invariants which must be true in order for the system to be safe, and design to enforce them. (The paper phrases this as “Shift from Event-based to Constraint-based Accident Models.”)

One of the key insights of systems safety is that playing whack-a-mole with bad events is a waste of time, energy, and money. Most really unacceptable losses don’t happen because of a single bad event or what we might call a component failure, but because two subsystems, both operating correctly according to their designers’ intent, interact in a way that their designers didn’t foresee.

This can be as simple as one system outputting values in metric and a second system expecting input in imperial. However, we can, with careful understanding and design, eliminate whole classes of bad events, by constraining or enforcing invariants on the kinds of interaction which take place (e.g. by requiring and checking that all output is in metric).

I do believe that, if we are willing and able to apply standard safety engineering techniques, and specifically systems safety understandings and techniques, to the questions of safety in particular systems incorporating AI components, that we will in most cases be able to build safe systems.

They may look and act very differently than current AI chatbots, but they will actually work, and work safely—because we have designed them to work, and work safely.

For more: Why AI Safety Can’t Work, Real AI Safety: Threat Modeling a Retrieval Augmented Generation (RAG) System

Share this: