When information systems fail
I’m reading Megan McArdle’s new book in galleys right now; its title is “The Up Side of Down: Why Failing Well Is the Key to Success”. Given the subject matter, McArdle spends just as much time discussing bad failures as she does discussing good ones — not the things which turned out in the end to be “the best thing that ever happened to me”, but rather the truly catastrophic things which result in wholesale destruction of wealth, health, or people’s lives.
Given the way in which the world is becoming increasingly dominated by complex technological systems, a lot of these failures are going to be technological in nature. Recent publicity, of course, has focused on healthcare.gov — a highly complex website devoted to solving the enormous problem of how millions of Americans will be able to access affordable medical care. And the general tone of the criticism, which is coming from all sides, is that if only the government had ______, then all of these problems, or at least most of them, could have been avoided.
That kind of criticism is always easy in hindsight, which doesn’t make it wrong. Virtually all problems are foreseen by someone, or should have been. But in no organization are all foreseen problems addressed promptly and directly. If they were, then nothing would ever happen. Which means that the real problem is often understood to be a managerial one: the lines of communication weren’t clear enough, the managers didn’t have their priorities right, how on earth could they have been so stupid as to _______.
David Wilson has found a wonderful example in the SEC’s censure of Knight Capital. Knight blew up as a result of badly-designed computer systems, and the cascade of mistakes in this case was particularly egregious: it kept important deprecated code on its active servers, it didn’t double-check to ensure that new code was installed correctly, it had no real procedures to ensure that mistakes like this couldn’t happen, it had no ability to work out why something called the 33 Account was filling up with billions of dollars in random stocks, despite the fact that the account in question had a $2 million gross position limit, it seemingly had no controls in place to stop its computers from doing massive naked shorting in the market, and so on and so forth.
In the end, over the course of a 45-minute period, Knight bought $3.5 billion of 80 stocks, sold $3.15 billion of another 74 stocks, and ended up losing a total of $460 million. Markets were definitely disrupted:
As to 37 of those stocks, the price moved by greater than ten percent, and Knight’s executions constituted more than 50 percent of the trading volume. These share price movements affected other market participants, with some participants receiving less favorable prices than they would have in the absence of these executions and others receiving more favorable prices.
Given the size of Knight’s losses, the only silver lining here is that Knight itself was the main actor receiving less favorable prices, while the rest of the market, in aggregate, ended up making more money that day than it otherwise would have done. But the SEC is right to fine Knight all the same, just as it was right to fine JP Morgan for its London Whale losses: legitimate trading losses are fine, but major risk-management failures are not allowed, and need to be punished by more than just trading losses.
Or, for a smaller-scale example, look at Dan Tynan’s misadventures with Box. Again, there was a series of management and permissioning failures, which ultimately resulted in Tynan’s entire account being vaporized, along with all its content. As he explains:
- Box handed control over my account to someone who was a complete stranger to me;
- They did it because of a one-time association with someone else, who happened to have access to some of my folders;
- They failed to notify me or any of my other collaborators that they were giving control of my account to someone else;
- They failed to confirm deletion of the account with the person who created it (i.e., me); and
- Box.com support was helpless to do anything about it or give me any information. Had I not pulled the journalist card, I’d still be scratching my head over what had happened.
That’s a lot of mistakes; nearly as many as can be seen in the Knight Capital case. But when you see a list this long, the first thing you should think about is Swiss cheese. Specifically, you should think about the Swiss cheese model of failure prevention, as posited by James Reason, of the University of Manchester:
In the Swiss Cheese model, an organization’s defenses against failure are modeled as a series of barriers, represented as slices of cheese. The holes in the slices represent weaknesses in individual parts of the system and are continually varying in size and position across the slices. The system produces failures when a hole in each slice momentarily aligns, permitting (in Reason’s words) “a trajectory of accident opportunity”, so that a hazard passes through holes in all of the slices, leading to a failure.
In other words, we should maybe be a little bit reassured that so many things needed to go wrong in order to produce a fail. The Swiss cheese model isn’t foolproof: sometimes those holes will indeed align. But a long list of failures like this is evidence of a reasonably thick stack of cheese slices. And in general, the thicker the stack, the less likely failure is going to be.
That said, there’s an important countervailing force, which mitigates in favor of more frequent failure, and which is getting steadily larger and scarier — and that’s the sheer complexity of all kinds of information systems. I mentioned this when Knight blew up, quoting Dave Cliff and Linda Northrop:
The concerns expressed here about modern computer-based trading in the global financial markets are really just a detailed instance of a more general story: it seems likely, or at least plausible, that major advanced economies are becoming increasingly reliant on large-scale complex IT systems (LSCITS): the complexity of these LSCITS is increasing rapidly; their socio-economic criticality is also increasing rapidly; our ability to manage them, and to predict their failures before it is too late, may not be keeping up. That is, we may be becoming critically dependent on LSCITS that we simply do not understand and hence are simply not capable of managing.
Under this view, it’s important to try to prevent failures by adding extra layers of Swiss cheese, and by assiduously trying to minimize the size of the holes in any given layer. But as IT systems grow in size and complexity, they will fail in increasingly unpredictable and catastrophic ways. No amount of post-mortem analysis, from Congress or the SEC or anybody else, will have any real ability to stop those catastrophic failures from happening. What’s more, it’s futile to expect that we can somehow design these systems to “fail well” and thereby lessen the chances of even worse failures in the future.