One year I spent a lot of time with professional magicians. A few showed me the secrets to their tricks. Whenever they did, the skill and dexterity required for sleight-of-hand struck me as far more impressive than the idea that magic had been performed. It reminded me of my own experience with statistics.
Data analysis is very similar to performing magic. With great skill you can pull things together and create the perception of surprising relationships. Often the magic is getting people to look at one thing, when they should be seeing another. Similarly with statistics, it’s often not the correlation that’s interesting but what you did to find it.
This is important to keep in mind as the world embarks on the big data revolution. Big data is very large data sets, collected by the government, corporations, and institutions, becoming more available. Using this data, firms and policymakers can figure out what programs work (like health treatments, and which people respond to government incentives) and what consumers want. The deluge of information is expected to increase efficiency and lower prices. In a recent report, the McKinsey Global Institute encourages the increased availability of big data. It estimates that greater access to big data has the potential to create $3 trillion a year in value.
It is generally true that more information is better, though big data comes at a cost in terms of privacy and data collection. Yet what concerns me is the proper interpretation of big data. An earlier McKinsey report addresses this issue. It notes a dearth of trained statisticians, estimating that America is short 140,000 to 190,000 workers with the skills to handle data. But lack of talent is not just an impediment; it’s a potential source of danger. People, even those who know better, often take correlations literally and make decisions based on them, without appreciating the magic behind the numbers.
Interpreting data is more of an art than a science. But unlike magicians, most researchers do not intentionally mislead people. A big concern when you run statistics is bias (over- or understating a relationship) and mistaking correlation for causation (whether X causes Y or just that they tend to occur at the same time). You might get biased results by either using the wrong data or an inappropriate estimation technique. Minimizing bias requires making subjective judgments. If you ran numbers on a large data set without inspecting it, removing outliers, and choosing the best model — you’d have much more bias than if you used some discretion.