Redundancy is Not Reliability; or, The Parable of the Two Timing Belts

Lately when I’m trying to explain systems safety to people I often tell this story. I want to be clear upfront that 1), I completely 100% invented this story, even though I present it as a thing that actually happened; and 2), the story probably works better if you (like me) are not overburdened with knowledge of internal combustion engines. I will happily accept polite corrections in the comments.

The story I tell is this:

In the very early days of automobiles, when people were still figuring out a lot of the very basic things about them, one of the all-too-common problems was that the timing belts on their engines would break. The timing belt is a rubber belt which keeps the engine’s valves opening and closing in sync with the turning of the engine, and if it breaks the whole engine can be destroyed quite catastrophically. And so some clever engineer quite reasonably said to themself, “Well, why don’t we put a second timing belt on to keep the engine going if the first timing belt breaks?”

So they did, and they shipped it to their customers as a new and improved model that was less likely to fail catastrophically, and those people bought the cars and drove away happy… and then a short while later those new-model cars started to come back with broken timing belts at the same rate.

Now the engineers are really tearing their hairs out, and trying to figure out what’s going on. If the first timing belt has a mean time to failure of 1000 hours and the second timing belt also has a mean time to failure of 1000 hours surely the system with two timing belts should have a mean time to failure of 2000 hours, it just stands to reason, right? Why were the cars still breaking just as fast?

Well it turns out that there are a lot of reasons that this potentially could be, but in this particular instance some clever engineer eventually noticed that the timing belts were more likely to fail in the winter and the summer and less likely to fail in spring and autumn. Now this was the early days of rubber vulcanization, and the material they were making these belts out of wasn’t very good yet. Before long they realized that the extreme heat of summer and the extreme cold of winter were making the belts fragile and prone to breakage, and both belts were exposed to the same environment, and that was why both belts would break at about the same time.

I assume that improved materials were a big part of the fix, but I don’t rightly know. It’s not a common problem any more.

One thing to take from this story (which again, I 100% made up) is that we can’t (necessarily) build more-reliable systems by using more-reliable components. Most really bad accidents happen, not because a bolt sheared or a belt snapped, but because two subsystems, both operating correctly according to spec, interact in a way that their designers didn’t forsee.

And this can be as simple as not realizing when designing the overall system that Subsystem A (which is spec’d to output values in metric) is connected to Subsystem B (which is spec’d to take input values in Imperial). (If you think I’m making this up too.) No number of redundant Subsystems A or B will save you from that accident.

There are many techniques for trying to surface such specification errors, both while designing a system and after an accident has happened. Nancy Leveson’s STAMP family of techniques (open access PDF at link) are possibly the best of these so far. Sometimes those techniques may help designers determine that the best path forward is to add redundancy (it works for semi truck tires!) But naïvely adding redundancy is as likely to hurt as it is to help.

What We’re Talking About, When We Talk About Data Destruction

When I wrote my post back in May of last year about Apple’s recycled hardware reuse policy, I found myself frustrated by how hard it was to talk about how well a storage device had been destroyed, or even what threats one might be concerned about, which would lead one to want to physically destroy it.

Early in the work that I did at Akamai on data destruction, we built a very casual sort of threat model, but we never worked it up in any more rigorous fashion, which would have allowed us to talk consistently about the threats we were concerned about. We still managed to deliver a coherent solution, but I think it’s worth formalizing exactly what we were trying to achieve.

It’s very easy to get distracted by the spy-games aspect of data destruction. Everybody brings up thermite when I mention the topic. This DEF CON presentation by my friend Zoz a few years ago suggests the limits of it as a practical solution. In reality, somebody pulling your data off with a SATA cable because you forgot to wipe the drive before disposing of it is always your biggest worry.

Here is my attempt at a threat model for information disclosure attacks on storage devices at rest, on the Principals-Goals-Adversities-Invariants rubric I wrote about in Increment.  (“If you’re not talking about an adversary, you aren’t doing security.”)

Before going any further, a disclaimer, as always: although I talk about things that I’ve done for work here, I speak only for myself and not for any current or previous employers.

Continue reading “What We’re Talking About, When We Talk About Data Destruction”

“Approachable Threat Modeling” in Increment

I can’t believe I haven’t posted about this until now! Straight-up slipped my mind.

I have an article published in Increment, Stripe’s software engineering magazine. The latest issue is themed around Security, and in it I talk about threat modeling, particularly in a software-as-a-service context.  It’s based a lot on the work at Akamai that I talk about here from time to time.

From the article:

Threat modeling is one of the most important parts of the everyday practice of security, at companies large and small. It’s also one of the most commonly misunderstood. Whole books have been written about threat modeling, and there are many different methodologies for doing it, but I’ve seen few of them used in practice. They are usually slow, time-consuming, and require a lot of expertise.

This complexity obscures a simple truth: Threat modeling is just the process of answering a few straightforward questions about any system you’re trying to build or extend.

To read more, go check it out on the Increment site!

(Oddly enough, this is my first paid professional long-form writing ever. It was extremely good to work with Sid Orlando and team at Increment—I had the best first-time author experience I could possibly have hoped for. If you have stuff to write about which is related to their upcoming topics, I can’t recommend pitching them enough.)

How to Interview Your Prospective Manager

I’m in the process of negotiating offers for my next role now. One of the things I’ve learned the hard way is how important good management is—especially for me, since I’m kind of a hard case, but in general.  It’s said that people leave managers, not companies, and I know that that’s been true of my experience. It turned out that I got very lucky in my early jobs, and up until recently my first managers were my high water mark.

Unfortunately the traditional job interview doesn’t give much time over to learn about the person who would be managing you.  (Sometimes you don’t even meet with them.) While you as the candidate are always implicitly interviewing your interviewers, it’s nice to have time set aside to it.

Mudge had not yet signed on as the new head of security when I got the offer from Stripe, but the recruiting team had told me he was considering it, and I knew I didn’t want to sign on to a new team without talking with the person I’d be reporting to.

I knew Mudge only by reputation and vaguely at that, and I didn’t want to join a team only to have some new manager come in and clean house and install all their own people. I delayed accepting until Mudge was ready to talk, and then we had a long phone conversation where I effectively interviewed him as my new manager.  (He was great, it turned out. 🙂

Going through the process again now, I’ve come back to these questions, and I’m going through the same process with my new potential managers.  It’s proving extremely fruitful.

Here’s what I’m asking:

  • What is your vision for the organization?
  • Where do you see the organization fitting in the overall picture at the company?
  • Where do you want the organization to grow?
  • What’s your plan for scaling the organization?
  • What do you like in a manager?
  • What do you dislike in a manager?
  • How do you view your relationship with the people who work for you?
  • What is your philosophy of management?
  • What makes you excited to come to work every day?
  • Can you tell me about a specific time that you were wrong, and how you handled it?
  • You have two employees who don’t get along. What’s your approach?
  • Have you handled harassment complaints before (sexual or otherwise)? What happened?
  • You have an employee who’s struggling. How do you handle that?
  • What do career paths forward look like for this position?
  • How much support is here to present at conferences/other professional development?
  • What are your preferences around hours/work from home?
  • How much contact do you need from the folks who work for you?
  • What problems do you see facing the company over the next three years
  • What problems do you see facing the industry over the next three years?

Interviewing your prospective manager is absolutely something you can and should do, and these are questions I’ve found useful.

Is there something I’ve missed that you like to ask about?  Leave a comment!