Some of the Problems with Root Cause Analysis

So this tweet blew up! If I could anticipate in advance which of my systems safety tweets would go big, I would have follow-up material prepped ahead of time, but instead you get this quick post.

I’m going to take this in two parts: the first, on the problems with root cause analysis, and the second, on Tarot and other methods of understanding complex systems (which I will link here once I publish it).

My basic frustration with Root Cause Analysis (both formal RCA as part of a Total Quality Engineering program as well as the informal and ad-hoc root cause analysis that we perform in the software industry every day) is that it fundamentally misunderstands how accidents happen.

As I’ve said here before (and drawing on a great body of literature), most really bad accidents happen not because of a component failure (a bolt shears, a belt snaps, a == is swapped for a != in a single line of code) but because multiple systems, which are behaving correctly (according to their designers’ understanding and intent), interact in a way that their designers didn’t foresee.  Which can be as simple as subsystem A being specified to output values in metric units, and subsystem B being specified to take input in US units.  If those systems ever talk to one another, without their designers realizing, bad things happen.

The mistake of root cause analysis (and many similar methods like “Five Whys”, fault tree analysis, and the “Swiss Cheese model”) is the belief that there is one, singular and necessary cause of any accident, when in fact there are many, varied and contingent instigating conditions, none of which need necessarily have caused an accident on their own.

Imagine a typical root cause analysis of the scenario I mention above.  If the fault was first noticed in subsystem A, the root cause analysis (performed by the subsystem A team) would almost certainly find that, “subsystem B failed to correctly accept metric units.”  Whereas if the fault was first noticed in subsystem B, the root cause would be identified by the B team as “subsystem A erroneously output metric units.”  Opposite findings!  Both can’t be true.  It’s only when we take a step back and look at the system as a whole that we have a hope of understanding what’s really going on.

This is especially true when there are relevant conditions which are outside the scope of our analysis, for example in the case of a common-mode failure.  A root-cause analysis which doesn’t consider the operating environment of the components will produce incorrect results.

Should the root cause in our example then be simply identified as “subsystem A was specified to output values in metric units, and subsystem B being specified to take input in US units, and they interacted”?  After all that seems correct and complete based on what we know so far.  But why did they interact?  It’s quite possible that under most normal scenarios they never would.  Under what exceptional scenarios is their interaction possible?  And then what causes those scenarios to occur?  Etc etc.

The five whys, you see, is all about asking “Why?” until you get an answer you like. Ideally one that shifts the blame to someone who cannot defend themselves and/or is very cheap to remediate.

— Richard Tibbetts (@tibbetts) December 30, 2018

All of these methods focus too much on the specific components and the path or paths which lead up to an accident.  The strength of modern methods, and especially STAMP-derived methods, is that they approach safety as avoiding a set of states or conditions of a system, rather than attempting to plug all the Swiss cheese holes one by one.

For the same reason that the STAMP-inspired security threat modeling approach I use and advocate ignores completely questions of “why does an adversary want to interfere with this system” and “how does an adversary gain access to this system,” and focuses only on “what can an adversary do to this system given access,” the components involved and the path by which the conditions under which an accident is possible are irrelevant to the safety of the system, only whether those conditions obtain or not.

Some folks on Twitter have told me that they use a form of root cause analysis which looks not for a singular root cause but for root causes, which is at least a step forward from the singular cause form of analysis.  Still there’s nothing in that analysis method which forces you to get out of the details far enough that you can see the proverbial forest for the trees.

That is, in brief, some of the problems with root cause analysis.  In part two, we consider the strengths and weaknesses of Tarot as a method of understanding complex systems (which I will link here once it’s up).

Redundancy is Not Reliability; or, The Parable of the Two Timing Belts

Lately when I’m trying to explain systems safety to people I often tell this story. I want to be clear upfront that 1), I completely 100% invented this story, even though I present it as a thing that actually happened; and 2), the story probably works better if you (like me) are not overburdened with knowledge of internal combustion engines. I will happily accept polite corrections in the comments.

The story I tell is this:

In the very early days of automobiles, when people were still figuring out a lot of the very basic things about them, one of the all-too-common problems was that the timing belts on their engines would break. The timing belt is a rubber belt which keeps the engine’s valves opening and closing in sync with the turning of the engine, and if it breaks the whole engine can be destroyed quite catastrophically. And so some clever engineer quite reasonably said to themself, “Well, why don’t we put a second timing belt on to keep the engine going if the first timing belt breaks?”

So they did, and they shipped it to their customers as a new and improved model that was less likely to fail catastrophically, and those people bought the cars and drove away happy… and then a short while later those new-model cars started to come back with broken timing belts at the same rate.

Now the engineers are really tearing their hairs out, and trying to figure out what’s going on. If the first timing belt has a mean time to failure of 1000 hours and the second timing belt also has a mean time to failure of 1000 hours surely the system with two timing belts should have a mean time to failure of 2000 hours, it just stands to reason, right? Why were the cars still breaking just as fast?

Well it turns out that there are a lot of reasons that this potentially could be, but in this particular instance some clever engineer eventually noticed that the timing belts were more likely to fail in the winter and the summer and less likely to fail in spring and autumn. Now this was the early days of rubber vulcanization, and the material they were making these belts out of wasn’t very good yet. Before long they realized that the extreme heat of summer and the extreme cold of winter were making the belts fragile and prone to breakage, and both belts were exposed to the same environment, and that was why both belts would break at about the same time.

I assume that improved materials were a big part of the fix, but I don’t rightly know. It’s not a common problem any more.

One thing to take from this story (which again, I 100% made up) is that we can’t (necessarily) build more-reliable systems by using more-reliable components. Most really bad accidents happen, not because a bolt sheared or a belt snapped, but because two subsystems, both operating correctly according to spec, interact in a way that their designers didn’t forsee.

And this can be as simple as not realizing when designing the overall system that Subsystem A (which is spec’d to output values in metric) is connected to Subsystem B (which is spec’d to take input values in Imperial). (If you think I’m making this up too.) No number of redundant Subsystems A or B will save you from that accident.

There are many techniques for trying to surface such specification errors, both while designing a system and after an accident has happened. Nancy Leveson’s STAMP family of techniques (open access PDF at link) are possibly the best of these so far. Sometimes those techniques may help designers determine that the best path forward is to add redundancy (it works for semi truck tires!) But naïvely adding redundancy is as likely to hurt as it is to help.

What We’re Talking About, When We Talk About Data Destruction

When I wrote my post back in May of last year about Apple’s recycled hardware reuse policy, I found myself frustrated by how hard it was to talk about how well a storage device had been destroyed, or even what threats one might be concerned about, which would lead one to want to physically destroy it.

Early in the work that I did at Akamai on data destruction, we built a very casual sort of threat model, but we never worked it up in any more rigorous fashion, which would have allowed us to talk consistently about the threats we were concerned about. We still managed to deliver a coherent solution, but I think it’s worth formalizing exactly what we were trying to achieve.

It’s very easy to get distracted by the spy-games aspect of data destruction. Everybody brings up thermite when I mention the topic. This DEF CON presentation by my friend Zoz a few years ago suggests the limits of it as a practical solution. In reality, somebody pulling your data off with a SATA cable because you forgot to wipe the drive before disposing of it is always your biggest worry.

Here is my attempt at a threat model for information disclosure attacks on storage devices at rest, on the Principals-Goals-Adversities-Invariants rubric I wrote about in Increment.  (“If you’re not talking about an adversary, you aren’t doing security.”)

Before going any further, a disclaimer, as always: although I talk about things that I’ve done for work here, I speak only for myself and not for any current or previous employers.

Continue reading “What We’re Talking About, When We Talk About Data Destruction”

“Approachable Threat Modeling” in Increment

I can’t believe I haven’t posted about this until now! Straight-up slipped my mind.

I have an article published in Increment, Stripe’s software engineering magazine. The latest issue is themed around Security, and in it I talk about threat modeling, particularly in a software-as-a-service context.  It’s based a lot on the work at Akamai that I talk about here from time to time.

From the article:

Threat modeling is one of the most important parts of the everyday practice of security, at companies large and small. It’s also one of the most commonly misunderstood. Whole books have been written about threat modeling, and there are many different methodologies for doing it, but I’ve seen few of them used in practice. They are usually slow, time-consuming, and require a lot of expertise.

This complexity obscures a simple truth: Threat modeling is just the process of answering a few straightforward questions about any system you’re trying to build or extend.

To read more, go check it out on the Increment site!

(Oddly enough, this is my first paid professional long-form writing ever. It was extremely good to work with Sid Orlando and team at Increment—I had the best first-time author experience I could possibly have hoped for. If you have stuff to write about which is related to their upcoming topics, I can’t recommend pitching them enough.)