What do you say when the system goes down? How to write an internal incident email article live at GitHub’s The ReadMe Project

tl;dr I have an article about incident management up.

Back in February, Virginia Bryant from GitHub reached out to me.  GitHub was spinning up a new online magazine, The ReadMe Project, on a model similar to Stripe’s Increment.  She’d read my article about threat modeling there and liked it, and would I be interested in writing an article on a security topic for her magazine.

I really felt like I had said what I had to say about threat modeling for the time being, but, after chatting a bit about who her audience was and what their needs were, we settled on the topic of incident management.

Because I have so much to say on the topic from my years helping to run the incident management process at Akamai, but had a relatively short article to say it in, I decided to focus tightly on composing the incident email—although so much about structuring the overall process turned out to be latent in that.

As I discuss in the article, at the highest level, an incident email needs to include six things—

  1. What we are perceiving which causes us to believe that something bad may be happening;
  2. Our best guess right now of how bad it is;
  3. How far along we are in our response to it;
  4. Which one person is directly responsible for coordinating the response;
  5. Where we’re coordinating;
  6. Who else is involved and in what capacity.

—but so much emerges from that.

Working with Virginia and the ReadMe Project folks was a great experience, highly recommended, and many thanks to her and them for providing me this venue to talk about a thing that I’ve wanted to talk about for a long time.

It turns out that I have a lot more to say about incident management, so I’m working to find more places to write about it in the future.  (One is already in the works, on incident action items, so watch this space. 🙂 )

In the meantime, go check out the article!

A quick fun thing: Ever wanted to run a nuclear power plant?

Over the last couple weeks, and partly in honor of the late Dan Kaminsky, who as far as I knew never met a weird machine he didn’t like, I’ve finished debugging my Inform 7 text-adventure port of Stephen R. Berggren’s 1980 “Apple Nuclear Power Plant” simulation and released it into a form you can play online! (Or offline, if you have a Glulxe interpreter, e.g. Gargoyle.)

(You can also play Berggren’s original Applesoft BASIC version online via Joshua Bell’s Applesoft BASIC in Javascript project—go to that link and select “Nuclear Power Plant” under “Other” in the “Select a sample…” dropdown. But the text adventure port has advances like modern fonts and improved graphics which will make it a friendlier experience for a lot of people I suspect.)

Edit 2021-05-10: Here’s a screenshot.

A screenshot of the game. It reads: "> sit down; You sit down in the heavy padded chair.  The screen reads:" and then the default screen output at the very beginning of the game showing the resting state of the nuclear power plant.

Just a Little Thing I’ve Been Working On

As part of my grand plans, which I haven’t talked about here much at all, I’m working on doing more video stuff. So far, so good! (More info to come.)

To start off, I wanted to do a video version of something that I’ve done often, live and extemporaneously, so I used my post about redundancy and reliability as a jumping-off point.

Tune in if you want to watch me get very worked up about why you can’t just add a backup component and expect your system to fail less often!

Some of the Problems with Root Cause Analysis

So this tweet blew up! If I could anticipate in advance which of my systems safety tweets would go big, I would have follow-up material prepped ahead of time, but instead you get this quick post.

I’m going to take this in two parts: the first, on the problems with root cause analysis, and the second, on Tarot and other methods of understanding complex systems (which I will link here once I publish it).

My basic frustration with Root Cause Analysis (both formal RCA as part of a Total Quality Engineering program as well as the informal and ad-hoc root cause analysis that we perform in the software industry every day) is that it fundamentally misunderstands how accidents happen.

As I’ve said here before (and drawing on a great body of literature), most really bad accidents happen not because of a component failure (a bolt shears, a belt snaps, a == is swapped for a != in a single line of code) but because multiple systems, which are behaving correctly (according to their designers’ understanding and intent), interact in a way that their designers didn’t foresee.  Which can be as simple as subsystem A being specified to output values in metric units, and subsystem B being specified to take input in US units.  If those systems ever talk to one another, without their designers realizing, bad things happen.

The mistake of root cause analysis (and many similar methods like “Five Whys”, fault tree analysis, and the “Swiss Cheese model”) is the belief that there is one, singular and necessary cause of any accident, when in fact there are many, varied and contingent instigating conditions, none of which need necessarily have caused an accident on their own.

Imagine a typical root cause analysis of the scenario I mention above.  If the fault was first noticed in subsystem A, the root cause analysis (performed by the subsystem A team) would almost certainly find that, “subsystem B failed to correctly accept metric units.”  Whereas if the fault was first noticed in subsystem B, the root cause would be identified by the B team as “subsystem A erroneously output metric units.”  Opposite findings!  Both can’t be true.  It’s only when we take a step back and look at the system as a whole that we have a hope of understanding what’s really going on.

This is especially true when there are relevant conditions which are outside the scope of our analysis, for example in the case of a common-mode failure.  A root-cause analysis which doesn’t consider the operating environment of the components will produce incorrect results.

Should the root cause in our example then be simply identified as “subsystem A was specified to output values in metric units, and subsystem B being specified to take input in US units, and they interacted”?  After all that seems correct and complete based on what we know so far.  But why did they interact?  It’s quite possible that under most normal scenarios they never would.  Under what exceptional scenarios is their interaction possible?  And then what causes those scenarios to occur?  Etc etc.

The five whys, you see, is all about asking “Why?” until you get an answer you like. Ideally one that shifts the blame to someone who cannot defend themselves and/or is very cheap to remediate.

— Richard Tibbetts (@tibbetts) December 30, 2018

All of these methods focus too much on the specific components and the path or paths which lead up to an accident.  The strength of modern methods, and especially STAMP-derived methods, is that they approach safety as avoiding a set of states or conditions of a system, rather than attempting to plug all the Swiss cheese holes one by one.

For the same reason that the STAMP-inspired security threat modeling approach I use and advocate ignores completely questions of “why does an adversary want to interfere with this system” and “how does an adversary gain access to this system,” and focuses only on “what can an adversary do to this system given access,” the components involved and the path by which the conditions under which an accident is possible are irrelevant to the safety of the system, only whether those conditions obtain or not.

Some folks on Twitter have told me that they use a form of root cause analysis which looks not for a singular root cause but for root causes, which is at least a step forward from the singular cause form of analysis.  Still there’s nothing in that analysis method which forces you to get out of the details far enough that you can see the proverbial forest for the trees.

That is, in brief, some of the problems with root cause analysis.  In part two, we consider the strengths and weaknesses of Tarot as a method of understanding complex systems (which I will link here once it’s up).

Redundancy is Not Reliability; or, The Parable of the Two Timing Belts

Lately when I’m trying to explain systems safety to people I often tell this story. I want to be clear upfront that 1), I completely 100% invented this story, even though I present it as a thing that actually happened; and 2), the story probably works better if you (like me) are not overburdened with knowledge of internal combustion engines. I will happily accept polite corrections in the comments.

The story I tell is this:

In the very early days of automobiles, when people were still figuring out a lot of the very basic things about them, one of the all-too-common problems was that the timing belts on their engines would break. The timing belt is a rubber belt which keeps the engine’s valves opening and closing in sync with the turning of the engine, and if it breaks the whole engine can be destroyed quite catastrophically. And so some clever engineer quite reasonably said to themself, “Well, why don’t we put a second timing belt on to keep the engine going if the first timing belt breaks?”

So they did, and they shipped it to their customers as a new and improved model that was less likely to fail catastrophically, and those people bought the cars and drove away happy… and then a short while later those new-model cars started to come back with broken timing belts at the same rate.

Now the engineers are really tearing their hairs out, and trying to figure out what’s going on. If the first timing belt has a mean time to failure of 1000 hours and the second timing belt also has a mean time to failure of 1000 hours surely the system with two timing belts should have a mean time to failure of 2000 hours, it just stands to reason, right? Why were the cars still breaking just as fast?

Well it turns out that there are a lot of reasons that this potentially could be, but in this particular instance some clever engineer eventually noticed that the timing belts were more likely to fail in the winter and the summer and less likely to fail in spring and autumn. Now this was the early days of rubber vulcanization, and the material they were making these belts out of wasn’t very good yet. Before long they realized that the extreme heat of summer and the extreme cold of winter were making the belts fragile and prone to breakage, and both belts were exposed to the same environment, and that was why both belts would break at about the same time.

I assume that improved materials were a big part of the fix, but I don’t rightly know. It’s not a common problem any more.

One thing to take from this story (which again, I 100% made up) is that we can’t (necessarily) build more-reliable systems by using more-reliable components. Most really bad accidents happen, not because a bolt sheared or a belt snapped, but because two subsystems, both operating correctly according to spec, interact in a way that their designers didn’t forsee.

And this can be as simple as not realizing when designing the overall system that Subsystem A (which is spec’d to output values in metric) is connected to Subsystem B (which is spec’d to take input values in Imperial). (If you think I’m making this up too.) No number of redundant Subsystems A or B will save you from that accident.

There are many techniques for trying to surface such specification errors, both while designing a system and after an accident has happened. Nancy Leveson’s STAMP family of techniques (open access PDF at link) are possibly the best of these so far. Sometimes those techniques may help designers determine that the best path forward is to add redundancy (it works for semi truck tires!) But naïvely adding redundancy is as likely to hurt as it is to help.