Redundancy is Not Reliability; or, The Parable of the Two Timing Belts

Lately when I’m trying to explain systems safety to people I often tell this story. I want to be clear upfront that 1), I completely 100% invented this story, even though I present it as a thing that actually happened; and 2), the story probably works better if you (like me) are not overburdened with knowledge of internal combustion engines. I will happily accept polite corrections in the comments.

The story I tell is this:

In the very early days of automobiles, when people were still figuring out a lot of the very basic things about them, one of the all-too-common problems was that the timing belts on their engines would break. The timing belt is a rubber belt which keeps the engine’s valves opening and closing in sync with the turning of the engine, and if it breaks the whole engine can be destroyed quite catastrophically. And so some clever engineer quite reasonably said to themself, “Well, why don’t we put a second timing belt on to keep the engine going if the first timing belt breaks?”

So they did, and they shipped it to their customers as a new and improved model that was less likely to fail catastrophically, and those people bought the cars and drove away happy… and then a short while later those new-model cars started to come back with broken timing belts at the same rate.

Now the engineers are really tearing their hairs out, and trying to figure out what’s going on. If the first timing belt has a mean time to failure of 1000 hours and the second timing belt also has a mean time to failure of 1000 hours surely the system with two timing belts should have a mean time to failure of 2000 hours, it just stands to reason, right? Why were the cars still breaking just as fast?

Well it turns out that there are a lot of reasons that this potentially could be, but in this particular instance some clever engineer eventually noticed that the timing belts were more likely to fail in the winter and the summer and less likely to fail in spring and autumn. Now this was the early days of rubber vulcanization, and the material they were making these belts out of wasn’t very good yet. Before long they realized that the extreme heat of summer and the extreme cold of winter were making the belts fragile and prone to breakage, and both belts were exposed to the same environment, and that was why both belts would break at about the same time.

I assume that improved materials were a big part of the fix, but I don’t rightly know. It’s not a common problem any more.

One thing to take from this story (which again, I 100% made up) is that we can’t (necessarily) build more-reliable systems by using more-reliable components. Most really bad accidents happen, not because a bolt sheared or a belt snapped, but because two subsystems, both operating correctly according to spec, interact in a way that their designers didn’t forsee.

And this can be as simple as not realizing when designing the overall system that Subsystem A (which is spec’d to output values in metric) is connected to Subsystem B (which is spec’d to take input values in Imperial). (If you think I’m making this up too.) No number of redundant Subsystems A or B will save you from that accident.

There are many techniques for trying to surface such specification errors, both while designing a system and after an accident has happened. Nancy Leveson’s STAMP family of techniques (open access PDF at link) are possibly the best of these so far. Sometimes those techniques may help designers determine that the best path forward is to add redundancy (it works for semi truck tires!) But naïvely adding redundancy is as likely to hurt as it is to help.