When information systems fail

By Felix Salmon
October 24, 2013
new book in galleys right now; its title is "The Up Side of Down: Why Failing Well Is the Key to Success".

" data-share-img="" data-share="twitter,facebook,linkedin,reddit,google" data-share-count="true">

I’m reading Megan McArdle’s new book in galleys right now; its title is “The Up Side of Down: Why Failing Well Is the Key to Success”. Given the subject matter, McArdle spends just as much time discussing bad failures as she does discussing good ones — not the things which turned out in the end to be “the best thing that ever happened to me”, but rather the truly catastrophic things which result in wholesale destruction of wealth, health, or people’s lives.

Given the way in which the world is becoming increasingly dominated by complex technological systems, a lot of these failures are going to be technological in nature. Recent publicity, of course, has focused on healthcare.gov — a highly complex website devoted to solving the enormous problem of how millions of Americans will be able to access affordable medical care. And the general tone of the criticism, which is coming from all sides, is that if only the government had ______, then all of these problems, or at least most of them, could have been avoided.

That kind of criticism is always easy in hindsight, which doesn’t make it wrong. Virtually all problems are foreseen by someone, or should have been. But in no organization are all foreseen problems addressed promptly and directly. If they were, then nothing would ever happen. Which means that the real problem is often understood to be a managerial one: the lines of communication weren’t clear enough, the managers didn’t have their priorities right, how on earth could they have been so stupid as to _______.

David Wilson has found a wonderful example in the SEC’s censure of Knight Capital. Knight blew up as a result of badly-designed computer systems, and the cascade of mistakes in this case was particularly egregious: it kept important deprecated code on its active servers, it didn’t double-check to ensure that new code was installed correctly, it had no real procedures to ensure that mistakes like this couldn’t happen, it had no ability to work out why something called the 33 Account was filling up with billions of dollars in random stocks, despite the fact that the account in question had a $2 million gross position limit, it seemingly had no controls in place to stop its computers from doing massive naked shorting in the market, and so on and so forth.

In the end, over the course of a 45-minute period, Knight bought $3.5 billion of 80 stocks, sold $3.15 billion of another 74 stocks, and ended up losing a total of $460 million. Markets were definitely disrupted:

As to 37 of those stocks, the price moved by greater than ten percent, and Knight’s executions constituted more than 50 percent of the trading volume. These share price movements affected other market participants, with some participants receiving less favorable prices than they would have in the absence of these executions and others receiving more favorable prices.

Given the size of Knight’s losses, the only silver lining here is that Knight itself was the main actor receiving less favorable prices, while the rest of the market, in aggregate, ended up making more money that day than it otherwise would have done. But the SEC is right to fine Knight all the same, just as it was right to fine JP Morgan for its London Whale losses: legitimate trading losses are fine, but major risk-management failures are not allowed, and need to be punished by more than just trading losses.

Or, for a smaller-scale example, look at Dan Tynan’s misadventures with Box. Again, there was a series of management and permissioning failures, which ultimately resulted in Tynan’s entire account being vaporized, along with all its content. As he explains:

- Box handed control over my account to someone who was a complete stranger to me;

- They did it because of a one-time association with someone else, who happened to have access to some of my folders;

- They failed to notify me or any of my other collaborators that they were giving control of my account to someone else;

- They failed to confirm deletion of the account with the person who created it (i.e., me); and

- Box.com support was helpless to do anything about it or give me any information. Had I not pulled the journalist card, I’d still be scratching my head over what had happened.

That’s a lot of mistakes; nearly as many as can be seen in the Knight Capital case. But when you see a list this long, the first thing you should think about is Swiss cheese. Specifically, you should think about the Swiss cheese model of failure prevention, as posited by James Reason, of the University of Manchester:

In the Swiss Cheese model, an organization’s defenses against failure are modeled as a series of barriers, represented as slices of cheese. The holes in the slices represent weaknesses in individual parts of the system and are continually varying in size and position across the slices. The system produces failures when a hole in each slice momentarily aligns, permitting (in Reason’s words) “a trajectory of accident opportunity”, so that a hazard passes through holes in all of the slices, leading to a failure.

In other words, we should maybe be a little bit reassured that so many things needed to go wrong in order to produce a fail. The Swiss cheese model isn’t foolproof: sometimes those holes will indeed align. But a long list of failures like this is evidence of a reasonably thick stack of cheese slices. And in general, the thicker the stack, the less likely failure is going to be.

That said, there’s an important countervailing force, which mitigates in favor of more frequent failure, and which is getting steadily larger and scarier — and that’s the sheer complexity of all kinds of information systems. I mentioned this when Knight blew up, quoting Dave Cliff and Linda Northrop:

The concerns expressed here about modern computer-based trading in the global financial markets are really just a detailed instance of a more general story: it seems likely, or at least plausible, that major advanced economies are becoming increasingly reliant on large-scale complex IT systems (LSCITS): the complexity of these LSCITS is increasing rapidly; their socio-economic criticality is also increasing rapidly; our ability to manage them, and to predict their failures before it is too late, may not be keeping up. That is, we may be becoming critically dependent on LSCITS that we simply do not understand and hence are simply not capable of managing.

Under this view, it’s important to try to prevent failures by adding extra layers of Swiss cheese, and by assiduously trying to minimize the size of the holes in any given layer. But as IT systems grow in size and complexity, they will fail in increasingly unpredictable and catastrophic ways. No amount of post-mortem analysis, from Congress or the SEC or anybody else, will have any real ability to stop those catastrophic failures from happening. What’s more, it’s futile to expect that we can somehow design these systems to “fail well” and thereby lessen the chances of even worse failures in the future.

More From Felix Salmon
Post Felix
The Piketty pessimist
The most expensive lottery ticket in the world
The problems of HFT, Joe Stiglitz edition
Private equity math, Nuveen edition
Five explanations for Greece’s bond yield
Comments
13 comments so far

ummm, cheese.

thanx for the shoutout, felix.

dt

Posted by tynanwrites | Report as abusive

“the SEC is right to fine Knight all the same, just as it was right to fine JP Morgan for its London Whale losses: legitimate trading losses are fine, but major risk-management failures are not allowed,”

Could someone point me to the language in SEC or FINRA regs where it says major risk management failures are not allowed. Knight’s net loss was the markets net gain. JPM’s net loss was the markets net gain. Investors in Knight got mauled by the mega-glitch and are now hurt again by the fine. JPM investors took a meaningful but manageable hit on their trading loss and another meaningful but manageable hit from the fine.

We now have unwritten rules where by your regulator will shoot you in the foot for shooting yourself in the foot.

Let me use another analogy. I drive my SUV into a telephone pole or worse yet a pedestrian: What kind of fine or punishment am I looking at for my terrible mistake? Nothing. I made a mistake. I did not intend to cause harm. I was not texting, I was not drunk, I was not speeding. I am 100% at fault but the government does not impose a negative outcome on me.

JPM made some terrible decisions in the Whale trade fiasco but absolutely broke no laws… ditto Knight Capital Group. These fines are examples of an overzealous overgrown federal bureaucracy. We fine because we can.

Posted by y2kurtus | Report as abusive

And this is why ordinary investors like me are scared to put our money back into the market.

These days somewhere around 60-70% of all trade volume is done by pre-programmed computerized algorithms…with no humans anywhere in sight.

How long before one of these “information systems” accidentally crashes the entire market because of some oversight or coding mistake?

Posted by mfw13 | Report as abusive

Felix,

Read Proust.

Not Megan McCardle

ACK!

Posted by crocodilechuck | Report as abusive

@y2kurtus: you wrote “JPM made some terrible decisions in the Whale trade fiasco but absolutely broke no laws”

That is incorrect. “Laws” include mandatory regulations imposed by statue, which JPM violated without question.

“Through those fines, the bank acknowledged that it violated banking rules by not properly overseeing its trading operations. In legal language, regulators said that the bank engaged in “unsafe and unsound practices.”"

Source: http://money.cnn.com/2013/10/16/news/com panies/jpmorgan-whale-settlement/

“The Justice Department, even after filing criminal charges against two former JPMorgan traders who allegedly helped conceal the losses, is still investigating whether to bring any action against the bank.”

“The CFTC, in its case, charged the bank with violating a prohibition on manipulative conduct when it traded in the credit default swaps that the bank had built an outsized exposure to by early 2012 and then needed to quickly exit to try to minimize the losses. By selling a huge volume of swaps in a concentrated period, the bank’s traders “recklessly disregarded” the principle that legitimate market forces should set prices, the CFTC said.”

“JPMorgan traders acted recklessly with respect to this fundamental precept by employing an aggressive trading strategy,” the CFTC said in a portion of the settlement to which the bank admitted to.”

Source: http://www.reuters.com/article/2013/10/1 6/us-jpmorgan-cftc-idUSBRE99F0JW20131016

Posted by SteveHamlin | Report as abusive

Have to agree with crocodilechuck.

Looking for insights from Megan “Math is Hard” McArdle is like looking for a Polar Bear in Egypt.

It will only be there if someone else caught it, and delivered it to the zoo.

Posted by Matthew_Saroff | Report as abusive

Have to agree with Matthew_Saroff on this one.

Shorter start for a review of the book: “Because at heart they are hedonists, many journalists need a way to finance their lifestyles at a level higher than their current salary or family support affords. This means they either have to grind away writing numerous freelance pieces over many years in addition to working their day jobs, or they can write a best selling non-fiction book and then spend their later years raking in royalties and basking in notoriety.

Having Ivy League credentials, but otherwise an obscure and talent-challenged blogger, Megan McArdle, has apparently opted for the latter career approach, and has a new book out. Robert Wright and Malcolm Gladwell are definitely not looking over their shoulders…”

Posted by Strych09 | Report as abusive

IT systems fail all the time. they just usually dont make headlines. and thats because they are basically an art form, with parts of planning, experience and skill. take any part out, and you are likely heading to a failure. most only effect the users of the system for a time. other failures this year, would include an airline, an IT/MEDIA company plus 100s of others. most never make the news

Posted by willid3 | Report as abusive

@y2kurtus, you say “I drive my SUV into a telephone pole or worse yet a pedestrian … I am 100% at fault but the government does not impose a negative outcome on me.”

Hmm. On my own property, maybe no charges for hitting a pole. But otherwise, I don’t have such superpowers. If I drove my car into a telephone pole on a public roadway, stone cold sober and no weather issues or darting animals or whatever, I’d expect to get charged with reckless driving or something like that and take some points on my record.

And if I hit a pedestrian I can say with some confidence that I should expect several different and related charges even if the lucky victim didn’t sustain a scratch. (Unless I lived in Ottawa, where I understand there’s no penalty for that kind of innocent fun.)

So I’m afraid I simply don’t understand how the analogy could work, unless vehicular violations charges and points don’t count as government-imposed negative outcomes.

Posted by Altoid | Report as abusive

The usual model for studying failure in the engineering world involves a web of causality and when something fails, there is talk of the chain of failure, the path through the web which led to disaster.

If you read NTSB reports after an airline crash or close call, you’ll see that the chain of failure often starts early, perhaps with a bit of fog, an overly optimistic assessment, failure to double check some piece of equipment. Then it builds intensity as failure leads to failure leads to failure, usually while well above ground level. The safety engineer’s goal is to break the chain, ideally as soon as possible which is why aviation and other high risk fields are full of checklists, test and safety gear, reams of documentation and big labels.

In the computer security world, they talk about defense in depth, designing so that if one security element fails, others will be in place to maintain security. (Diebold, when it was manufacturing voting machines was notorious for weakness in depth. One blogger discovered that the sample keys shown in their online catalog were the actual master keys for most of their systems.)

In fact, our economy has a certain defense in debt. Kicking out the Glass-Steagal firewall was expensive – try pricing an $800B emergency credit line on the free market – but there were mechanisms in place. Unfortunately, it is much more profitable to operate unsafely and then go on welfare than operating within safe parameters. We missed a major opportunity to restructure our increasingly inefficient, ineffective and unstable financial system.

(The recent $100 fine imposed on Morgan Stanley was ridiculous. If we treated small scale bank robbers like that, you’d be an idiot not to put a demand for $10,000 in small bills on the back of every deposit slip. After all, you could just pay the $100 fine and keep the rest of the money.)

Posted by Kaleberg | Report as abusive

Maybe (hopefully?) this means the end of suffering though Ms McArdle’s ‘work’ in otherwise reputable publications…?

Posted by CDN_Rebel | Report as abusive

You are kidding me, yes? Megan McArdle wrote a book about failing upwards? Although, I guess it would come down to her or George W. Bush in terms of qualifications.

Posted by ckbryant | Report as abusive

Please don’t refer to Reason’s swiss cheese model, it’s inadequate for explanation and prevention of these sorts of events, and has unfortunately been through a confusing history in the domains in which Reason originally spoke about (aviation, patient safety, power plants, etc.)

As to how indignant we can feel about Knight Capital’s shortcomings, we cannot take the SEC report as a postmortem document or accident investigation. I’ve written a good amount about that here:

http://www.kitchensoap.com/2013/10/29/co unterfactuals-knight-capital/

Posted by JohnAllspaw | Report as abusive
Post Your Comment

We welcome comments that advance the story through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can flag it to our editors by using the report abuse links. Views expressed in the comments do not represent those of Reuters. For more information on our comment policy, see http://blogs.reuters.com/fulldisclosure/2010/09/27/toward-a-more-thoughtful-conversation-on-stories/