Why Is It So Hard To Build Safe Software?

Asking aircraft designers about airplane safety: Hairbun: Nothing is ever foolproof, but modern airliners are incredibly resilient. Flying is the safest way to travel. Asking building engineers about elevator safety: Cueball: Elevators are protected by multiple tried-and-tested failsafe mechanisms. They're nearly incapable of falling. Asking software engineers about computerized voting: Megan: That's terrifying. Ponytail: Wait, really? Megan: Don't trust voting software and don't listen to anyone who tells you it's safe. Ponytail: Why? Megan: I don't quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die. Ponytail: They say they've fixed it with something called "blockchain." Megan: AAAAA!!! Cueball: Whatever they sold you, don't touch it. Megan: Bury it in the desert. Cueball: Wear gloves. — XKCD #2030: “Voting Software”; used under the terms of its Creative Commons Attribution-NonCommercial 2.5 License.

Or, “Robert Graham is dead wrong”.

This XKCD comic on voting software security has been going around my computer security Twitter feed today, and a lot of folks have Takes on it.

It gets at something fundamental. What is it that makes software safety so hard?

A couple years ago, at the March 2016 STAMP Workshop in Cambridge, Massachusetts I gave a talk titled “Safety Thinking in Cloud Software: Challenges and Opportunities” where I tried to answer that. (As always, I talk about work here but don’t speak on behalf of any former employer.) What follows is based on my notes for that talk.

I would say that responses to the comic have fallen into two big groups:

Software safety is really hard because we have adversaries.
The comic is needlessly nihilistic about software safety.

Robert Graham‘s post “That XKCD on voting machine software is wrong” is the glass-case example of the first argument, that software safety is uniquely hard because we have adversaries.

This line of argument is fundamentally wrong, and betrays an ignorance of systems safety in general and its practice in aviation in particular.

First off, it’s just fundamentally incorrect to say that in software we have adversaries whereas in aviation we don’t. Remember, the statement the comic puts in the mouth of an aircraft designer isn’t a qualified statement—even in the presence of adversaries (9/11, MH17, even the infamous so-called “shoe bomber”) flying is still the safest way to travel.

Systems safety defines what we casually call an ‘accident’ formally as an unacceptable loss, any unacceptable loss. It doesn’t distinguish between adversarially-induced losses and non-adversarially-induced losses.

Considered from a systems safety perspective, the aviation system includes organizations like the TSA, the air marshal program, and air traffic control. It includes cockpit door locks, the fence around the airport, even the folks who go out to scare the geese away from the runways, all of which have important anti-adversarial functions. (Man, geese, now there’s an advanced persistent threat.)

So Rob’s argument is facially, factually wrong. Now, why is it so hard to build safe software systems?

There are five big factors which make it harder to keep modern software systems safe than to keep best-in-class physical systems like airliners and elevators safe. Namely:

Software is leaner.
Software is moves faster.
Software is more complex.
The geography & physics of networks are different.
Consequences for adversaries are lower.

Let’s address each of these in turn.

Software is Leaner

At Akamai, we had a team of 4 people who reviewed about 50 incidents a year for a company of 6,000 people, part time around other responsibilities related to managing the incident process.

This isn’t unusual—at Stripe we had I think 2 people part-time reviewing incidents and managing the incident process for a company of around 1,000 people.

By contrast, I had the pleasure of sitting down with a member of the Dutch Safety Board at one of the STAMP workshops, a year or two before I spoke. He told me that they work in teams of 5 or 6, for a total of 30 people, and investigate between 5 and 10 accidents a year.

In software we need to make do with many fewer people on the problem—and partly, to be sure, this is an area which software companies could resource more heavily, but there’s a bedrock belief that we expect to be able to make do with fewer people.

Software Moves Faster

At Akamai, the Infosec design safety review team of four to eight people reviewed about 50 new systems a year. We were generally given two business days to read about 60 pages of design documentation and provide feedback, which a security representative would take to a broader architectural review board session. And Akamai was notably hard-nosed about safety compared to some of our more agile competitors.

By contrast, the aircraft which became the Boeing 787 Dreamliner was announced in 2003, based on technology which had been in development since the late 1990s. The first production aircraft was delivered eight years later, in 2011. And the planes have an expected operational lifetime of something like 40 years. In software, if code I write is still in production six months from now, I’ll consider it to have real longevity.

Software is More Complex

The core understanding of systems safety is that most accidents, especially the really bad ones, happen not because of component failures (a popped tire or a snapped timing belt) but because of interaction failures—two systems, operating according to spec, interacting in ways the designers didn’t forsee. And software systems provide exponentially more opportunities for interaction failures than physical systems do.

There are lots of ways to measure complexity, but the proper way in a systems safety context is to count the number of feedback loops, and software systems have a truly enormous number. Any state in a program, including its stack, provides the opportunity for a feedback loop. Any connection over a network is inherently a feedback loop. And any modern web server can support thousands of them a second.

The Geography & Physics of Networks are Different

The geography and physics of networks are very weird to compared to the geography and physics of the physical world. Time and distance limit the interactions which can occur between planes. Three-dimensional space limits the interactions which can occur in an elevator shaft.

On a network, on the other hand, things which are separated by miles or continents can be more or less adjacent. In fact, it’s very hard to make things not be effectively adjacent on the Internet. We go to a lot of effort to erect barriers.

It’s so easy to connect everything to everything else in software that often we do so completely by accident, and it’s frankly a wonder that things short out as rarely as they do.

Consequences for Adversaries are Lower

In order to successfully hijack a plane, an attacker needs to run a very real risk of death or being arrested. In a suicide attack like the 9/11 hijackings, the attacker dying is even part of the plan! This weeds out all but the most committed, ideological adversaries. Even an attack like the MH17 shoot-down, where the adversaries weren’t themselves directly risking death, could have resulted in sanctions against their entire country.

By contrast—because of the weird geography of networks—there are very few cyberattacks where the attacker is directly risking death. While countries have been sanctioned over cyberattacks, that’s a relatively new phenomenon, and it’s harder work for law enforcement to track down the perpetrators in the first place, since cyberattacks are so much more likely to cross jurisdictional boundaries.

What, then, are we to make of all this? Are the nihilists right? Is software doomed to always be unsafe?

I don’t think so, and of course through my work I hope to make software safer.

Software Needs a Safety Culture

The Wright brothers were more or less two guys in their garage, who flew the first airplane at Kitty Hawk in 1902. Less than 25 years later, the Air Commerce Act assigned the Commerce Department responsibility for investigating accidents, in 1926, at which point air mail was still a pretty neat idea. The first web browser was released in 1991, and over the last 27 years we’ve built something extraordinarily more complicated on the Internet, with far greater access to and effect on people’s daily lives, without the same kind of investment in safety and accident investigation.

Even within large software companies, we’re only beginning to recognize that safety is a discipline and that we need to invest in it, and we’re struggling to identify and pull in knowledge and expertise from older fields like aviation to help us ensure it.

I believe that we can build safer software systems, even in the presence of asymmetric adversaries, in lean, fast-moving organizations, with massive complexity and the weird geography of networks. We have a lot of work ahead of us, but the same principles apply in software as anywhere else, and we can take a lot of inspiration from how fields like aviation have learned to keep us safe.

(P.S. Do I think that the comic is right about the current state of voting software? Absolutely.)