Blog

What We’re Talking About, When We Talk About Data Destruction

When I wrote my post back in May of last year about Apple’s recycled hardware reuse policy, I found myself frustrated by how hard it was to talk about how well a storage device had been destroyed, or even what threats one might be concerned about, which would lead one to want to physically destroy it.

Early in the work that I did at Akamai on data destruction, we built a very casual sort of threat model, but we never worked it up in any more rigorous fashion, which would have allowed us to talk consistently about the threats we were concerned about. We still managed to deliver a coherent solution, but I think it’s worth formalizing exactly what we were trying to achieve.

It’s very easy to get distracted by the spy-games aspect of data destruction. Everybody brings up thermite when I mention the topic. This DEF CON presentation by my friend Zoz a few years ago suggests the limits of it as a practical solution. In reality, somebody pulling your data off with a SATA cable because you forgot to wipe the drive before disposing of it is always your biggest worry.

Here is my attempt at a threat model for information disclosure attacks on storage devices at rest, on the Principals-Goals-Adversities-Invariants rubric I wrote about in Increment.  (“If you’re not talking about an adversary, you aren’t doing security.”)

Before going any further, a disclaimer, as always: although I talk about things that I’ve done for work here, I speak only for myself and not for any current or previous employers.

Table of Contents

Threat Model

While my work has primarily focused on sanitizing solid-state drives (SSDs), this threat model should apply equally well to any read-write storage device, including platter drives, although not all the examples may apply. I’m also not going to address read-only media like CDs and DVDs, although it should be straightforward to extend this threat model to cover them.

I’m choosing to only concern myself with physical storage devices which are “at rest” here in the fullest sense, i.e. powered off and not connected to a computer.  Threat modeling of online storage devices and logical storage devices is a much bigger topic, which I don’t have the understanding or experience to discuss.

This is also focused on laptop, desktop, and server storage devices. Mobile phones and similar are largely outside my experience.

Here’s the threat model:

Principals

(“Who and what is involved?”)

  • Storage device at rest
  • Storage device owner
  • Data destruction service
  • Data destruction service employees

System Diagram

Click through for full size

(“How are the principals related to each other?”)

Edit: This is a control loop diagram.  Arrows are actuators (pointing down) and sensors (pointing up).

This is perhaps more detail than is strictly necessary, but I include it here for completeness. The key takeaway is that the storage device owner is ultimately in charge of establishing and enforcing the invariants that all other objects and parties exist under.

Goals

(“What are we trying to do?”)

  • Dispose of storage device (relinquish physical possession)
  • Prevent information disclosure from the storage device

Adversities

(“What bad things can other actors, including ‘Murphy’, do that might prevent us from achieving our goals?”)

Adversary powers:

  • Use a SATA cable or equivalent
  • Attach drive to another computer
  • Buy a drive of the same make & model
  • Find a vulnerability in the storage device’s firmware
  • Access unlocked vehicles or storage areas
  • Become an employee at a data destruction provider
  • Soldering/rework/decap chips/other electronics bench techniques
  • Use an optical microscope and image processing software
  • Use a Scanning Electron Microscope (SEM) (~$175/hr circa August 2017)

Accidents (“adversary: Murphy”):

  • Storage device not wiped before handoff to data destruction provider employee
  • Storage device incompletely wiped before handoff
  • Storage device wiped using an inappropriate method for its type (e.g. zeroing an SSD)
  • Operator exposed to toxic chemicals or dust

Out of scope:

  • Find a vulnerability in software full-disk encryption (FDE)
  • Become an employee of the storage device owner

Invariants

“What must be true about this system in order for it to be safe?

  1. It is impossible to recover any data from the storage device once it leaves the control of the storage device owner. This includes the transitive control of the storage device owner, for instance, in this model the storage device owner still controls the device while it is in the possession of a data destruction service under contract to the storage device owner.  When the device is finally and fully out of the storage device owner’s direct or transitive control, all the data on it must be unrecoverable.
  2. Storage device owner must receive convincing evidence that it is impossible to recover any data from the storage device by the time it leaves the owner’s control
  3. Storage device owner must receive convincing evidence that data on the device was never accessed by the data destruction company or its employees prior to the time it leaves the owner’s control

Using the Threat Model to Evaluate Interventions

Now that we have a threat model, we can assess potential interventions for how well they uphold its invariants.  Mostly I’m interested in how well they uphold invariant #1. (Invariants #2 and #3 are equally if not more important, but they’re much more dependent on operational details than technical details, and too much so for me to go into them here.)

Filesystem-Level Formatting

The first intervention to consider is formatting the drive. (That is, telling the OS to recreate the filesystem, using e.g. `mkfs.ext4` on Linux.) This is what most people think of when they think of “erasing a drive”.  While this makes the data inaccessible to casual access, the underlying data is still present on the device, and any attacker who can use a SATA cable and attach the device to another computer can access the ‘raw’ device (via e.g. `/dev/sda`), read its full contents, and reconstruct the filesystem or recover individual files easily. This doesn’t uphold invariant #1 very well at all.

Zeroing

The second intervention is zeroing the drive or overwriting it with some kind of pattern or random data.  This is what the Defense Security Service’s Cleaning and Sanitization Matrix (DSS C&SM; commonly mis-cited as the Department of Defense’s 5220.22-M standard) used to require, as implemented in popular tools like Darik’s Boot and Nuke (DBAN). How well this process works has varied over the years as drive technology has changed, and its effectiveness was not well-tested to begin with.

For modern platter drives, a single application-level zero pass is sufficient to render data inaccessible even from fairly sophisticated attacks. However, a zero pass will not overwrite areas of the device which are host-protected in the device’s firmware, where sensitive data may reside. Worse, for SSDs, an application-level zero pass will also not overwrite general-use NAND flash blocks which have been remapped due to errors, but which may still be accessible through the device’s firmware. An attacker who can use a SATA cable and attach the device to another computer can, knowing something about the ATA standard and the particular drive’s firmware, easily recover such data. This intervention is better than simply formatting the device but still not very good.

Hardware Encryption or Deletion

Many modern SSDs claim to support some sort of hardware level feature to delete data permanently (e.g. ATA TRIM) or possibly hardware encryption, where rotating the key used to encrypt the drive is seen as equivalent to deleting all the data on the drive.  Unfortunately drive manufacturers are not security specialists, and many drives have vulnerabilities which allow the encryption to be circumvented. Even where the encryption is implemented correctly, there are many attacks at both the firmware and hardware level which could compromise the integrity of the encryption and deletion processes.  In these circumstances, an attacker who can use a SATA cable and attach the device to another computer can, knowing something about the firmware and hardware, easily recover data.  This is still not a very good guarantee of invariant #1.

Software Full-Disk Encryption

Essentially all major consumer desktop operating systems (Windows, Mac, Linux) now provide built-in support for software-level full-disk encryption, although none enable it by default (and some may use hardware encryption instead where available).  While there are from time to time such issues in their implementations, they use better cryptographic primitives and have been audited better than the device manufacturers’ hardware encryption support, and have a much better security track record.  Deleting or rotating the encryption key then serves to delete the data on the device with a high degree of confidence. While critical vulnerabilities almost certainly still exist in their implementations, assuming a strong passphrase, software full-disk encryption is the safest fully software way to uphold invariant #1.

High-Quality Physical Destruction

All of these software interventions require that a device be working correctly in order to delete its data. When a device has failed in some fashion at either the firmware or hardware level, we can no longer rely on software-level interventions.  The highest commercially-available standard is the NSA’s Media Destruction Guidance from NSA/CSS Policy Manual 9-12, which approves a variety of sanitization methods depending on device type, but, if disintegration is used, requires a 2mm grain size for both platter (magnetic) drives and solid-state drives.  This is the “gold standard” of data destruction and provides extremely strong guarantees that invariant #1 is upheld.

Standardizing Storage Device Destruction Levels

Based on this intervention analysis we can start to group potential data destruction interventions by how expensive they are to attack.  I’ve assigned these each a Destruction Level (DL) number, for ease of reference, e.g. “the Apple recycled hardware policy guarantees at minimum destruction to the DL3 standard.”

All attacker costs and defenses are current circa 2018 and to the best of my knowledge.  All attacker costs assume physical access to the device or some pieces of it and ability to do fairly invasive things to them.

However, I’m assuming that the attacker did not have physical access to the drive before it reached its “at rest” state, nor did the attacker have electronic access to any computer the storage device was installed & online in.

The important thing here is particularly what the attacker can do, with what resources.  The defenses are provided as examples and largely for context. As both attacker and defender tools and techniques evolve, defenses which were previously DL3 may become DL2 or lower.  A company’s goal will always be to maintain its data destruction processes at a particular destruction level given current known and anticipated attack & defense capabilities.

Destruction Level Attacker can… …because Defender did (e.g.)…
DL0 Just plug the drive in with standard equipment and extract data No sanitization (eg. forgot to wipe); forgot to rotate FDE & weak passphrase (“hunter2’); filesystem-level format
DL1 Plug the drive in with standard equipment and, knowing something about the firmware, extract data Zeroed but left host protected areas; low-level access to remapped blocks; bad firmware; FDE with a broken or malicious key rotation process
DL2 Specialized hardware required, but possible with a soldering iron or rework station and <$100 in additional services, parts, equipment, or exploits Destroyed the controller board but left the platters intact; left one or more NAND flash chips intact
DL3 Possible with >$100 in specialized hardware, services, parts, equipment, or exploits Well-implemented software FDE with a working key rotation process; unrotated FDE with a strong passphrase
DL4 No known process to recover data Destroyed to NSA 9-12
DL5 No known or hypothesized process to recover data Curie point ??? smelter ??? volcano ???

For reference, here’s where the interventions evaluated above fall:

  • Filesystem-Level Formatting: DL0
  • Zeroing: DL1
  • Hardware Encryption or Deletion: DL1
  • Software Full-Disk Encryption: DL3
  • High-Quality Physical Destruction: DL4

I think that DL3 is the sweet spot here for most companies and individuals. The description is slightly misleading—while I drew the line between DL2 and DL3 at $100 of services, parts, and equipment beyond a basic rework setup, since that caps DL2 nicely, the good news here is that all the DL3 examples as currently understood require much, much more than $100 to attack currently.

The distinction I’m drawing, and the reason that the examples in DL3 don’t fall under DL4, is because vulnerabilities in software FDE (or what appears from the outside to be software FDE) are not unheard of, the hardware and software details of key storage and rotation in many of the new and novel crypto chips like Apple’s Secure Enclave are not well-understood or well-audited, and given a large enough data center it’s possible to brute-force passphrases.

While I’m not currently aware of any outstanding zero-days against mainstream software FDE implementations (and of course I wouldn’t tell you if I were), I think we can reasonably anticipate that the public will discover more exploits against FDE implementation in the lifetime of any computer bought today which could reduce the destruction level of its FDE from DL3 to a lower level, and that it would be possible to buy or develop such given a reasonable amount of money (6-8 figures). In contrast, it is just much, much harder to make the same claim about a bag full of dust, as is produced by methods which meet the NSA 9-12 standard, which is why they are grouped under DL4.

Achieving Destruction in Practice

Using Commercial Data Destruction Vendors for DL3

Caveat emptor: this is not advice in my professional capacity. This is a blog post on the Internet. I make no representations about its applicability for your particular situation or purpose; no warranty is expressed or implied. If you want consulting on this for your particular situation in my professional capacity, my contact form is above and my rates are very reasonable.

That said:

Third-party offsite commercial data destruction varies wildly between DL0 and DL3 depending on how good the third-party vendor’s controls are around physical & transportation security and personnel management, and whether destruction is witnessed or not.  Chain of custody is key. Obviously based on the blog post I wrote, you know that I trust e.g. Apple’s destruction vendors more than I trust others’. Generally speaking, though, unless you’re willing to go to a lot of trouble to vet a vendor’s operations, I’d avoid third-party offsite data destruction as a category.

Third-party onsite commercial data destruction should be capable of achieving DL3 destruction fairly easily if you are wise about selecting the right destruction method and required grain size for your storage devices.

I’m fairly lax with platter drives as long as any kind of physical destruction of the platters is taking place (although NSA 9-12 is not), but with SSDs I aim for a maximum grain size of no more than half the longest dimension of the flash chips in question, to ensure that whole chips never survive the destruction process. (E.g. if the flash chips are in 2cm x 4cm packages I’ll require a grain size of no more than 2cm, to ensure each chip is cut in at least half.)  This is the distinction between DL2 and DL3 for SSDs.

The primary adversary powers at issue in onsite commercial data destruction are personnel-related—meaning, a malicious data destruction employee really can quite easily just pocket a drive.  Managing this goes to invariants #2 and #3, and is out of scope of this blog post.

Achieving DL3 Without Physical Methods

The good news is that Apple’s FileVault full-disk encryption on Macs and LUKS on Linux have decent track records, and so using the factory reset or OS restore functionality to rotate the encryption key should be sufficient to achieve DL3, which should be enough for most people and companies, if full-disk encryption has been turned on for the lifetime of the device (i.e., since before sensitive data was put on the device).

This last point is important.  If full-disk encryption has not been turned on for the lifetime of the device, then unencrypted data may persist in remapped blocks of flash storage, which means that when the encryption key is rotated, the device has actually only been destroyed to DL1 standards.  (An attacker can access the device with normal hardware and, using low-level firmware commands, potentially access unencrypted data from remapped sectors.) This is why device lifecycle management is so important to security in the context of data destruction, although again managing that is out of scope for this post.

All of this is within the context of encryption at rest.  Windows 10’s Bitlocker, as previously mentioned, has sadly has made some questionable implementation decisions, bringing it down to DL1, and I’m leery to fully trust it for either encryption at rest or online encryption.  Both Bitlocker and Apple’s FileVault had a cold boot attack, which, while an online attack and out of scope of this at-rest discussion, merits some consideration in practice.  And FileVault also had a reboot direct memory access (DMA) bug, again an online attack.  It’s quite likely LUKS is vulnerable to some variants of these as well, depending on the exact hardware and configuration.  An analogous data protection attacker effort chart for online attacks would place these implementations somewhere in the DL1-DL2 range—lower than we’d like. While well-implemented software full-disk encryption can provide good at-rest DL3, we can’t rely on software full-disk encryption as our only method of data protection.

Parsing out exactly how to achieve DL3 is a grey and evolving area, and I don’t think there is necessarily one right answer, but given the checkered history of software FDE implementations, I prefer to see software FDE as a failsafe in the case of lost or stolen devices and, where time and resources are available, prefer physical methods to achieve DL3 at end-of-life.

Who Needs DL4?

The attack against solid-state DL3 which I’m most concerned about is essentially the microscopic version of the 2011 DARPA Shredder Challenge.  The easiest attack against full chips is likely to resolder them on to another device of the same make and model, but I suspect partial chips as produced by DL3 need to be decapped and imaged, and the images reconstructed if the partial chips are small enough due to the grain size used in shredding, before data can be extracted.

(While flash chips store data using static electric charges, I hypothesize that they may be susceptible to a “burn-in” effect, especially when memory cells fail, which would be visible using some form of microscopy.  This is how we can image fuse-based ROMs, for example, although obviously those are designed to have a very evident burn-in effect! For old chips made using >200nm processes, as a lot of old ROMs were, optical microscopy may even be sufficient.)

Obviously, based on the challenge results, the reconstruction is quite likely possible without state-level resources.  (As the story was related to me, the winning team relied on printer dots, which are obviously not present in chip images—but is there something equivalent?)  Decapping and imaging modern flash chips is the tricky part, but I can’t rule out that at least some of them are potentially within the reach of a sufficiently patient and well-resourced private attacker, especially with large enough chip pieces.

Additionally the amount of data which needs to be recovered is quite small, for instance only a bit more than a 1KB random sample of the bits of a 4KB RSA key is necessary to reconstruct the whole key.  The trickiest part is likely to be identifying that you are pulling bits from the right region of the device rather than pulling any bits at all.

Choosing between DL3 and DL4 is ultimately also a gray and evolving question.  If you have a device which has ever held sensitive data unencrypted, and you’re up against state-sponsored adversaries (and it’s more likely than you think), consider DL4.  (Are they going through your trash?  How would you tell if they were? You should force your adversaries to use methods that are more likely to get them caught.)

To Sum Up

Obviously there’s a lot of grey area here.  No one destruction process or destruction level is right for every device and everyone. We will all need to update our processes to reliably achieve a targeted level of data destruction at a reasonable cost in time, money, labor, and other resources as more attacks are published and existing attacks become more common and scalable.  We will even need to update our threat models as wholly new adversary capabilities are discovered. Nevertheless, what this provides, I hope, is a base threat model on which to build, and a common language with which to describe interventions against it.

Postscript: A Legal Note

Most posts on this site are licensed under a Creative Commons BY-NC-SA 3.0 US License.  This post, however, is Copyright ⓒ 2019 Kevin Riggle. All Rights Reserved.

If you are interested to use this post in, for example, internal or external policies at your company, I take this copyright to mean that you may link to or otherwise reference this post (e.g. “we choose destruction methods which destroy storage devices to the Destruction Level 4 (DL4) standard as defined under the section ‘Standardizing Storage Device Destruction Levels’ in Kevin Riggle’s ‘What We’re Talking About, When We Talk About Data Destruction’ post (link) (accessed 2019-01-01)”). You may also of course use portions of this post under the terms of fair use, as allowed of any copyrighted work.

However, if you intend to include this post outright, either in its entirety or in substantial portions, in e.g. internal or external policies for your company, or if you intend to make derivative works, then you need to reach out to me for permission.  (And, purely as a personal matter, I’m always interested in and excited to hear about people using my work!) Thank you.

Copyright ⓒ 2019 Kevin Riggle. All Rights Reserved.

Header image by iStock.com/baloon111.

“Approachable Threat Modeling” in Increment

I can’t believe I haven’t posted about this until now! Straight-up slipped my mind.

I have an article published in Increment, Stripe’s software engineering magazine. The latest issue is themed around Security, and in it I talk about threat modeling, particularly in a software-as-a-service context.  It’s based a lot on the work at Akamai that I talk about here from time to time.

From the article:

Threat modeling is one of the most important parts of the everyday practice of security, at companies large and small. It’s also one of the most commonly misunderstood. Whole books have been written about threat modeling, and there are many different methodologies for doing it, but I’ve seen few of them used in practice. They are usually slow, time-consuming, and require a lot of expertise.

This complexity obscures a simple truth: Threat modeling is just the process of answering a few straightforward questions about any system you’re trying to build or extend.

To read more, go check it out on the Increment site!

(Oddly enough, this is my first paid professional long-form writing ever. It was extremely good to work with Sid Orlando and team at Increment—I had the best first-time author experience I could possibly have hoped for. If you have stuff to write about which is related to their upcoming topics, I can’t recommend pitching them enough.)

How to Interview Your Prospective Manager

I’m in the process of negotiating offers for my next role now. One of the things I’ve learned the hard way is how important good management is—especially for me, since I’m kind of a hard case, but in general.  It’s said that people leave managers, not companies, and I know that that’s been true of my experience. It turned out that I got very lucky in my early jobs, and up until recently my first managers were my high water mark.

Unfortunately the traditional job interview doesn’t give much time over to learn about the person who would be managing you.  (Sometimes you don’t even meet with them.) While you as the candidate are always implicitly interviewing your interviewers, it’s nice to have time set aside to it.

Mudge had not yet signed on as the new head of security when I got the offer from Stripe, but the recruiting team had told me he was considering it, and I knew I didn’t want to sign on to a new team without talking with the person I’d be reporting to.

I knew Mudge only by reputation and vaguely at that, and I didn’t want to join a team only to have some new manager come in and clean house and install all their own people. I delayed accepting until Mudge was ready to talk, and then we had a long phone conversation where I effectively interviewed him as my new manager.  (He was great, it turned out. 🙂

Going through the process again now, I’ve come back to these questions, and I’m going through the same process with my new potential managers.  It’s proving extremely fruitful.

Here’s what I’m asking:

  • What is your vision for the organization?
  • Where do you see the organization fitting in the overall picture at the company?
  • Where do you want the organization to grow?
  • What’s your plan for scaling the organization?
  • What do you like in a manager?
  • What do you dislike in a manager?
  • How do you view your relationship with the people who work for you?
  • What is your philosophy of management?
  • What makes you excited to come to work every day?
  • Can you tell me about a specific time that you were wrong, and how you handled it?
  • You have two employees who don’t get along. What’s your approach?
  • Have you handled harassment complaints before (sexual or otherwise)? What happened?
  • You have an employee who’s struggling. How do you handle that?
  • What do career paths forward look like for this position?
  • How much support is here to present at conferences/other professional development?
  • What are your preferences around hours/work from home?
  • How much contact do you need from the folks who work for you?
  • What problems do you see facing the company over the next three years
  • What problems do you see facing the industry over the next three years?

Interviewing your prospective manager is absolutely something you can and should do, and these are questions I’ve found useful.

Is there something I’ve missed that you like to ask about?  Leave a comment!

Why Is It So Hard To Build Safe Software?

Asking aircraft designers about airplane safety: Hairbun: Nothing is ever foolproof, but modern airliners are incredibly resilient. Flying is the safest way to travel. Asking building engineers about elevator safety: Cueball: Elevators are protected by multiple tried-and-tested failsafe mechanisms. They're nearly incapable of falling. Asking software engineers about computerized voting: Megan: That's terrifying. Ponytail: Wait, really? Megan: Don't trust voting software and don't listen to anyone who tells you it's safe. Ponytail: Why? Megan: I don't quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die. Ponytail: They say they've fixed it with something called "blockchain." Megan: AAAAA!!! Cueball: Whatever they sold you, don't touch it. Megan: Bury it in the desert. Cueball: Wear gloves.
XKCD #2030: “Voting Software”; used under the terms of its Creative Commons Attribution-NonCommercial 2.5 License.

Or, “Robert Graham is dead wrong”.

This XKCD comic on voting software security has been going around my computer security Twitter feed today, and a lot of folks have Takes on it.

It gets at something fundamental. What is it that makes software safety so hard?

A couple years ago, at the March 2016 STAMP Workshop in Cambridge, Massachusetts I gave a talk titled “Safety Thinking in Cloud Software: Challenges and Opportunities” where I tried to answer that. (As always, I talk about work here but don’t speak on behalf of any former employer.) What follows is based on my notes for that talk.

I would say that responses to the comic have fallen into two big groups:

  1. Software safety is really hard because we have adversaries.
  2. The comic is needlessly nihilistic about software safety.

Robert Graham‘s post “That XKCD on voting machine software is wrong” is the glass-case example of the first argument, that software safety is uniquely hard because we have adversaries.

This line of argument is fundamentally wrong, and betrays an ignorance of systems safety in general and its practice in aviation in particular.

First off, it’s just fundamentally incorrect to say that in software we have adversaries whereas in aviation we don’t. Remember, the statement the comic puts in the mouth of an aircraft designer isn’t a qualified statement—even in the presence of adversaries (9/11, MH17, even the infamous so-called “shoe bomber”) flying is still the safest way to travel.

Systems safety defines what we casually call an ‘accident’ formally as an unacceptable lossany unacceptable loss. It doesn’t distinguish between adversarially-induced losses and non-adversarially-induced losses.

Considered from a systems safety perspective, the aviation system includes organizations like the TSA, the air marshal program, and air traffic control. It includes cockpit door locks, the fence around the airport, even the folks who go out to scare the geese away from the runways, all of which have important anti-adversarial functions.  (Man, geese, now there’s an advanced persistent threat.)

So Rob’s argument is facially, factually wrong. Now, why is it so hard to build safe software systems?

There are five big factors which make it harder to keep modern software systems safe than to keep best-in-class physical systems like airliners and elevators safe. Namely:

  1. Software is leaner.
  2. Software is moves faster.
  3. Software is more complex.
  4. The geography & physics of networks are different.
  5. Consequences for adversaries are lower.

Let’s address each of these in turn.

Software is Leaner

At Akamai, we had a team of 4 people who reviewed about 50 incidents a year for a company of 6,000 people, part time around other responsibilities related to managing the incident process.

This isn’t unusual—at Stripe we had I think 2 people part-time reviewing incidents and managing the incident process for a company of around 1,000 people.

By contrast, I had the pleasure of sitting down with a member of the Dutch Safety Board at one of the STAMP workshops, a year or two before I spoke. He told me that they work in teams of 5 or 6, for a total of 30 people, and investigate between 5 and 10 accidents a year.

In software we need to make do with many fewer people on the problem—and partly, to be sure, this is an area which software companies could resource more heavily, but there’s a bedrock belief that we expect to be able to make do with fewer people.

Software Moves Faster

At Akamai, the Infosec design safety review team of four to eight people reviewed about 50 new systems a year. We were generally given two business days to read about 60 pages of design documentation and provide feedback, which a security representative would take to a broader architectural review board session. And Akamai was notably hard-nosed about safety compared to some of our more agile competitors.

By contrast, the aircraft which became the Boeing 787 Dreamliner was announced in 2003, based on technology which had been in development since the late 1990s. The first production aircraft was delivered eight years later, in 2011. And the planes have an expected operational lifetime of something like 40 years. In software, if code I write is still in production six months from now, I’ll consider it to have real longevity.

Software is More Complex

The core understanding of systems safety is that most accidents, especially the really bad ones, happen not because of component failures (a popped tire or a snapped timing belt) but because of interaction failures—two systems, operating according to spec, interacting in ways the designers didn’t forsee. And software systems provide exponentially more opportunities for interaction failures than physical systems do.

There are lots of ways to measure complexity, but the proper way in a systems safety context is to count the number of feedback loops, and software systems have a truly enormous number. Any state in a program, including its stack, provides the opportunity for a feedback loop. Any connection over a network is inherently a feedback loop. And any modern web server can support thousands of them a second.

The Geography & Physics of Networks are Different

The geography and physics of networks are very weird to compared to the geography and physics of the physical world.  Time and distance limit the interactions which can occur between planes. Three-dimensional space limits the interactions which can occur in an elevator shaft.

On a network, on the other hand, things which are separated by miles or continents can be more or less adjacent. In fact, it’s very hard to make things not be effectively adjacent on the Internet. We go to a lot of effort to erect barriers.

It’s so easy to connect everything to everything else in software that often we do so completely by accident, and it’s frankly a wonder that things short out as rarely as they do.

Consequences for Adversaries are Lower

In order to successfully hijack a plane, an attacker needs to run a very real risk of death or being arrested. In a suicide attack like the 9/11 hijackings, the attacker dying is even part of the plan!  This weeds out all but the most committed, ideological adversaries.  Even an attack like the MH17 shoot-down, where the adversaries weren’t themselves directly risking death, could have resulted in sanctions against their entire country.

By contrast—because of the weird geography of networks—there are very few cyberattacks where the attacker is directly risking death. While countries have been sanctioned over cyberattacks, that’s a relatively new phenomenon, and it’s harder work for law enforcement to track down the perpetrators in the first place, since cyberattacks are so much more likely to cross jurisdictional boundaries.

 

What, then, are we to make of all this? Are the nihilists right? Is software doomed to always be unsafe?

I don’t think so, and of course through my work I hope to make software safer.

Software Needs a Safety Culture

The Wright brothers were more or less two guys in their garage, who flew the first airplane at Kitty Hawk in 1902. Less than 25 years later, the Air Commerce Act assigned the Commerce Department responsibility for investigating accidents, in 1926, at which point air mail was still a pretty neat idea. The first web browser was released in 1991, and over the last 27 years we’ve built something extraordinarily more complicated on the Internet, with far greater access to and effect on people’s daily lives, without the same kind of investment in safety and accident investigation.

Even within large software companies, we’re only beginning to recognize that safety is a discipline and that we need to invest in it, and we’re struggling to identify and pull in knowledge and expertise from older fields like aviation to help us ensure it.

I believe that we can build safer software systems, even in the presence of asymmetric adversaries, in lean, fast-moving organizations, with massive complexity and the weird geography of networks. We have a lot of work ahead of us, but the same principles apply in software as anywhere else, and we can take a lot of inspiration from how fields like aviation have learned to keep us safe.

 

(P.S. Do I think that the comic is right about the current state of voting software? Absolutely.)

Why I Won’t Work For Facebook

I just sent an unintentionally blistering response to a Facebook recruiter. Having invested the time in writing it, I remembered that I have a very disused blog, and perhaps people reading here would find it useful, either as fodder for your own such messages, or as a snapshot of my concerns regarding Facebook and fascism in America in 2018. If either of these apply to you, enjoy.

Continue reading “Why I Won’t Work For Facebook”