Should we believe more in Big Data or in magic?

By Allison Schrager
November 6, 2013

One year I spent a lot of time with professional magicians. A few showed me the secrets to their tricks. Whenever they did, the skill and dexterity required for sleight-of-hand struck me as far more impressive than the idea that magic had been performed. It reminded me of my own experience with statistics.

Data analysis is very similar to performing magic. With great skill you can pull things together and create the perception of surprising relationships. Often the magic is getting people to look at one thing, when they should be seeing another. Similarly with statistics, it’s often not the correlation that’s interesting but what you did to find it.

This is important to keep in mind as the world embarks on the big data revolution. Big data is very large data sets, collected by the government, corporations, and institutions, becoming more available. Using this data, firms and policymakers can figure out what programs work (like health treatments, and which people respond to government incentives) and what consumers want. The deluge of information is expected to increase efficiency and lower prices. In a recent report, the McKinsey Global Institute encourages the increased availability of big data. It estimates that greater access to big data has the potential to create $3 trillion a year in value.

It is generally true that more information is better, though big data comes at a cost in terms of privacy and data collection. Yet what concerns me is the proper interpretation of big data. An earlier McKinsey report addresses this issue. It notes a dearth of trained statisticians, estimating that America is short 140,000 to 190,000 workers with the skills to handle data. But lack of talent is not just an impediment; it’s a potential source of danger. People, even those who know better, often take correlations literally and make decisions based on them, without appreciating the magic behind the numbers.

Interpreting data is more of an art than a science. But unlike magicians, most researchers do not intentionally mislead people. A big concern when you run statistics is bias (over- or understating a relationship) and mistaking correlation for causation (whether X causes Y or just that they tend to occur at the same time). You might get biased results by either using the wrong data or an inappropriate estimation technique. Minimizing bias requires making subjective judgments. If you ran numbers on a large data set without inspecting it, removing outliers, and choosing the best model — you’d have much more bias than if you used some discretion.

The process is complicated by human nature. It is easy to be seduced by your own results when they validate your prior expectation of what you’ll find. Take the financial crisis, in which bad statistics played a large role. Many quants priced exotic housing securities using models that were fed data from areas where house prices never fell. This made the price of risk look very attractive, but then the products couldn’t remain viable when house prices fell. In most cases the oversight was not intentional. It reflected the data available and the current industry standard. Without a significant drop in housing prices in recent memory, it was an easy mistake to make.

Often what’s most interesting isn’t the statistical relationship itself, but the data that was required to find it. Take the oft-cited statistic that American life expectancy is lower than that of many other OECD countries. That would suggest that American healthcare is not as successful as other systems. But when you look more deeply at the data, a different story emerges. Once you account for people who died from injury (like violence or car accidents) or obesity-related disease, American life expectancy is similar to Canada’s. America’s lower life expectancy is alarming and should get the attention of policymakers. But to remedy it, we need to understand what’s causing more car fatalities and obesity, and what factors — like poverty or arcane drug laws — lead to so much violence. American healthcare is certainly inefficient, but depending on how you parse the data, it’s not clear that it’s delivering worse results in terms of mortality compared to other OECD countries.

Such examples may seem straightforward, but in practice they are hard to spot, even for the most experienced and well-intentioned professionals. That’s why in academia, statistical work under goes a rigorous peer review process. In the same way a magician can discern an impressive or dirty trick, it takes a community with the same expertise to spot sources of bias. But expert peer review won’t be realistic as data becomes more wildly available and used commercially. It should be a serious concern that people, without adequate experience, might unknowingly produce biased results and make important decisions based on them.

But the use of big data is worth the risk. Statistical analysis is an imperfect process, but it’s all we have to make sense of big data. With any new, transformative innovation there exists potential to take it too far or use it incorrectly. The same can be said for cars, airplanes or new financial products. The benefits of more innovation and information usually outweigh the costs. We can minimize these risks with greater awareness of a new innovation’s limitations. McKinsey advocates more training and apprenticeships so we have more people who can run and manage data. This is certainly necessary, but not sufficient. We must also view any statistical result with the same humility and skepticism we experience when we see a magic trick.

PHOTO: A magician performs with a model who presents a creation by Indian designer Manish Arora as part of his Fall-Winter 2011/2012 women’s ready-to-wear fashion collection during Paris Fashion Week March 3, 2011.  REUTERS/Benoit Tessier 

3 comments

We welcome comments that advance the story through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can flag it to our editors by using the report abuse links. Views expressed in the comments do not represent those of Reuters. For more information on our comment policy, see http://blogs.reuters.com/fulldisclosure/2010/09/27/toward-a-more-thoughtful-conversation-on-stories/

Excellent point on common errors by those seeking to unfavorably compare America’s health care system to that of other economies with “socialized health care”.

Posted by OneOfTheSheep | Report as abusive

Excellent – read and acknowledged.

(One point you mention, which might be emphasized just a little more: LĂ©o Apotheker did an opinion piece on Reuters.com about four years ago, when he worked for SAP, extolling the benefits of databases and talking up the huge market growth potential of this technology, e.g. for emerging applications such as the accounting of carbon-credits. I pointed out at the time that it’s no use having $50M worth of database hardware + software crunching numbers for carbon credits, if the data being entered into the system are fraudulently manipulated, or if we only include the carbon sources we have already imagined might exist! My comment was motivated by personal experience of course: I’ve been fired from several jobs for refusing to lie to creditors or for refusing to manipulate raw data to give the desired result in “statistical” reports feeding the Key Performance Indicators! I’m tired of the naivity expressed when people trust a database just because it’s big, expensive and complicated and runs on an ostensibly infallible “computer”; or when colourful pie-charts are displayed. While I’m sure database technology still has huge growth potential; in the case of carbon credits, it pays to “think bigger”: we might instead measure COâ‚‚ concentrations/weather patterns/emissions with orbiting satellites: this method is somewhat less precise but more accurate overall when you consider human factors!)

Posted by matthewslyman | Report as abusive

……how many times have you seen a co-worker accept a brand new, exciting report as god’s truth because it came out of a computer.

…..and the blank look you get when you ask them if they verified any of the input data or tested the conclusions.

it has to be true…….it came out of a computer, see!!

Posted by Robertla | Report as abusive