Some of the Problems with Root Cause Analysis

So this tweet blew up! If I could anticipate in advance which of my systems safety tweets would go big, I would have follow-up material prepped ahead of time, but instead you get this quick post.

I’m going to take this in two parts: the first, on the problems with root cause analysis, and the second, on Tarot and other methods of understanding complex systems (which I will link here once I publish it).

My basic frustration with Root Cause Analysis (both formal RCA as part of a Total Quality Engineering program as well as the informal and ad-hoc root cause analysis that we perform in the software industry every day) is that it fundamentally misunderstands how accidents happen.

As I’ve said here before (and drawing on a great body of literature), most really bad accidents happen not because of a component failure (a bolt shears, a belt snaps, a == is swapped for a != in a single line of code) but because multiple systems, which are behaving correctly (according to their designers’ understanding and intent), interact in a way that their designers didn’t foresee.  Which can be as simple as subsystem A being specified to output values in metric units, and subsystem B being specified to take input in US units.  If those systems ever talk to one another, without their designers realizing, bad things happen.

The mistake of root cause analysis (and many similar methods like “Five Whys”, fault tree analysis, and the “Swiss Cheese model”) is the belief that there is one, singular and necessary cause of any accident, when in fact there are many, varied and contingent instigating conditions, none of which need necessarily have caused an accident on their own.

Imagine a typical root cause analysis of the scenario I mention above.  If the fault was first noticed in subsystem A, the root cause analysis (performed by the subsystem A team) would almost certainly find that, “subsystem B failed to correctly accept metric units.”  Whereas if the fault was first noticed in subsystem B, the root cause would be identified by the B team as “subsystem A erroneously output metric units.”  Opposite findings!  Both can’t be true.  It’s only when we take a step back and look at the system as a whole that we have a hope of understanding what’s really going on.

This is especially true when there are relevant conditions which are outside the scope of our analysis, for example in the case of a common-mode failure.  A root-cause analysis which doesn’t consider the operating environment of the components will produce incorrect results.

Should the root cause in our example then be simply identified as “subsystem A was specified to output values in metric units, and subsystem B being specified to take input in US units, and they interacted”?  After all that seems correct and complete based on what we know so far.  But why did they interact?  It’s quite possible that under most normal scenarios they never would.  Under what exceptional scenarios is their interaction possible?  And then what causes those scenarios to occur?  Etc etc.

The five whys, you see, is all about asking “Why?” until you get an answer you like. Ideally one that shifts the blame to someone who cannot defend themselves and/or is very cheap to remediate.

— Richard Tibbetts (@tibbetts) December 30, 2018

All of these methods focus too much on the specific components and the path or paths which lead up to an accident.  The strength of modern methods, and especially STAMP-derived methods, is that they approach safety as avoiding a set of states or conditions of a system, rather than attempting to plug all the Swiss cheese holes one by one.

For the same reason that the STAMP-inspired security threat modeling approach I use and advocate ignores completely questions of “why does an adversary want to interfere with this system” and “how does an adversary gain access to this system,” and focuses only on “what can an adversary do to this system given access,” the components involved and the path by which the conditions under which an accident is possible are irrelevant to the safety of the system, only whether those conditions obtain or not.

Some folks on Twitter have told me that they use a form of root cause analysis which looks not for a singular root cause but for root causes, which is at least a step forward from the singular cause form of analysis.  Still there’s nothing in that analysis method which forces you to get out of the details far enough that you can see the proverbial forest for the trees.

That is, in brief, some of the problems with root cause analysis.  In part two, we consider the strengths and weaknesses of Tarot as a method of understanding complex systems (which I will link here once it’s up).