I’ve launched a podcast and a YouTube channel!

The faces of the first four guests on the War Stories podcast

It only took three years since I initially teased it, but I’ve finally launched the Critical Point YouTube channel!   The first thing on it is a podcast I’ve also just launched  called “War Stories” in which I’m interviewing software engineers about that time they broke production (we’re also available on most other podcast services).

Critical Point logo and wordmark

 

 

The main site with all the links lives at criticalpoint.tv.

I’m having a ton of fun with it—the interviews, the editing, even the promo.  If you find me in person I even have stickers.  Please check it out!

What do you say when the system goes down? How to write an internal incident email article live at GitHub’s The ReadMe Project

tl;dr I have an article about incident management up.

Back in February, Virginia Bryant from GitHub reached out to me.  GitHub was spinning up a new online magazine, The ReadMe Project, on a model similar to Stripe’s Increment.  She’d read my article about threat modeling there and liked it, and would I be interested in writing an article on a security topic for her magazine.

I really felt like I had said what I had to say about threat modeling for the time being, but, after chatting a bit about who her audience was and what their needs were, we settled on the topic of incident management.

Because I have so much to say on the topic from my years helping to run the incident management process at Akamai, but had a relatively short article to say it in, I decided to focus tightly on composing the incident email—although so much about structuring the overall process turned out to be latent in that.

As I discuss in the article, at the highest level, an incident email needs to include six things—

  1. What we are perceiving which causes us to believe that something bad may be happening;
  2. Our best guess right now of how bad it is;
  3. How far along we are in our response to it;
  4. Which one person is directly responsible for coordinating the response;
  5. Where we’re coordinating;
  6. Who else is involved and in what capacity.

—but so much emerges from that.

Working with Virginia and the ReadMe Project folks was a great experience, highly recommended, and many thanks to her and them for providing me this venue to talk about a thing that I’ve wanted to talk about for a long time.

It turns out that I have a lot more to say about incident management, so I’m working to find more places to write about it in the future.  (One is already in the works, on incident action items, so watch this space. 🙂 )

In the meantime, go check out the article!

A quick fun thing: Ever wanted to run a nuclear power plant?

Over the last couple weeks, and partly in honor of the late Dan Kaminsky, who as far as I knew never met a weird machine he didn’t like, I’ve finished debugging my Inform 7 text-adventure port of Stephen R. Berggren’s 1980 “Apple Nuclear Power Plant” simulation and released it into a form you can play online! (Or offline, if you have a Glulxe interpreter, e.g. Gargoyle.)

(You can also play Berggren’s original Applesoft BASIC version online via Joshua Bell’s Applesoft BASIC in Javascript project—go to that link and select “Nuclear Power Plant” under “Other” in the “Select a sample…” dropdown. But the text adventure port has advances like modern fonts and improved graphics which will make it a friendlier experience for a lot of people I suspect.)

Edit 2021-05-10: Here’s a screenshot.

A screenshot of the game. It reads: "> sit down; You sit down in the heavy padded chair.  The screen reads:" and then the default screen output at the very beginning of the game showing the resting state of the nuclear power plant.

Just a Little Thing I’ve Been Working On

As part of my grand plans, which I haven’t talked about here much at all, I’m working on doing more video stuff. So far, so good! (More info to come.)

To start off, I wanted to do a video version of something that I’ve done often, live and extemporaneously, so I used my post about redundancy and reliability as a jumping-off point.

Tune in if you want to watch me get very worked up about why you can’t just add a backup component and expect your system to fail less often!

Some of the Problems with Root Cause Analysis

So this tweet blew up! If I could anticipate in advance which of my systems safety tweets would go big, I would have follow-up material prepped ahead of time, but instead you get this quick post.

I’m going to take this in two parts: the first, on the problems with root cause analysis, and the second, on Tarot and other methods of understanding complex systems (which I will link here once I publish it).

My basic frustration with Root Cause Analysis (both formal RCA as part of a Total Quality Engineering program as well as the informal and ad-hoc root cause analysis that we perform in the software industry every day) is that it fundamentally misunderstands how accidents happen.

As I’ve said here before (and drawing on a great body of literature), most really bad accidents happen not because of a component failure (a bolt shears, a belt snaps, a == is swapped for a != in a single line of code) but because multiple systems, which are behaving correctly (according to their designers’ understanding and intent), interact in a way that their designers didn’t foresee.  Which can be as simple as subsystem A being specified to output values in metric units, and subsystem B being specified to take input in US units.  If those systems ever talk to one another, without their designers realizing, bad things happen.

The mistake of root cause analysis (and many similar methods like “Five Whys”, fault tree analysis, and the “Swiss Cheese model”) is the belief that there is one, singular and necessary cause of any accident, when in fact there are many, varied and contingent instigating conditions, none of which need necessarily have caused an accident on their own.

Imagine a typical root cause analysis of the scenario I mention above.  If the fault was first noticed in subsystem A, the root cause analysis (performed by the subsystem A team) would almost certainly find that, “subsystem B failed to correctly accept metric units.”  Whereas if the fault was first noticed in subsystem B, the root cause would be identified by the B team as “subsystem A erroneously output metric units.”  Opposite findings!  Both can’t be true.  It’s only when we take a step back and look at the system as a whole that we have a hope of understanding what’s really going on.

This is especially true when there are relevant conditions which are outside the scope of our analysis, for example in the case of a common-mode failure.  A root-cause analysis which doesn’t consider the operating environment of the components will produce incorrect results.

Should the root cause in our example then be simply identified as “subsystem A was specified to output values in metric units, and subsystem B being specified to take input in US units, and they interacted”?  After all that seems correct and complete based on what we know so far.  But why did they interact?  It’s quite possible that under most normal scenarios they never would.  Under what exceptional scenarios is their interaction possible?  And then what causes those scenarios to occur?  Etc etc.

The five whys, you see, is all about asking “Why?” until you get an answer you like. Ideally one that shifts the blame to someone who cannot defend themselves and/or is very cheap to remediate.

— Richard Tibbetts (@tibbetts) December 30, 2018

All of these methods focus too much on the specific components and the path or paths which lead up to an accident.  The strength of modern methods, and especially STAMP-derived methods, is that they approach safety as avoiding a set of states or conditions of a system, rather than attempting to plug all the Swiss cheese holes one by one.

For the same reason that the STAMP-inspired security threat modeling approach I use and advocate ignores completely questions of “why does an adversary want to interfere with this system” and “how does an adversary gain access to this system,” and focuses only on “what can an adversary do to this system given access,” the components involved and the path by which the conditions under which an accident is possible are irrelevant to the safety of the system, only whether those conditions obtain or not.

Some folks on Twitter have told me that they use a form of root cause analysis which looks not for a singular root cause but for root causes, which is at least a step forward from the singular cause form of analysis.  Still there’s nothing in that analysis method which forces you to get out of the details far enough that you can see the proverbial forest for the trees.

That is, in brief, some of the problems with root cause analysis.  In part two, we consider the strengths and weaknesses of Tarot as a method of understanding complex systems (which I will link here once it’s up).