Dangerous Correlations

In an one of its latest pieces about digital strategy, McKinsey reminds readers of the benefits of big data: ‘Mining data greatly enhances the power of analytics, which leads directly to dramatically higher levels of automation—both of processes and, ultimately, of decisions.’ Amen! Some data fundamentalists go as far as proclaiming no less than the end of science. According to them, algorithms will make the data speak for itself and provide robust rules for prediction and action. Knowing ‘why’ will become superfluous.


Beyond the technical challenges of managing the 3 Vs of volume, velocity and variety which come with big data, one of the significant pitfalls which is faced by data analytics involves the treacherous relationship between correlation (to be understood as co-incidence) and causation. A website on ‘spurious correlations’ provides some entertaining illustrations: It shows for example that the total revenue generated by arcades correlates with computer science doctorates awarded in the US; or that the number of people who die by becoming tangled in their bedsheets almost perfectly correlates with per capita cheese consumption.


More seriously, a simple mathematical analysis shows that regressing a small set of randomly selected data leads to the identification of a decent explanatory model. In fact, in a paper entitled ‘The deluge of spurious correlations in big data’, the authors demonstrate that the probability to find misleading (i.e. random) correlations increases with the size of the available database. Worse, the majority of correlations are spurious when dealing with large sets of numbers. As is often heard and read in the world of big data, ’Raw data should be cooked with care’.


Edward O. Wilson, a scholar and renowned American biologist who won two Pulitzers, provides a powerful counterargument to those suggesting that data is everything in a WSJ article: ‘Many of the most successful scientists in the world today are mathematically no more than semiliterate. […] Pioneers in science only rarely make discoveries by extracting ideas from pure mathematics.’ No mathematical prowess will lead to productivity gains without good, common sense. Unfortunately, as noted by Voltaire, ‘Common sense is not so common’. Descartes would add that ‘[…] To be possessed of a vigorous mind is not enough; the prime requisite is rightly to apply it. The greatest minds, as they are capable of the highest excellencies, are open likewise to the greatest aberrations.


The relationship between the quantity of data and the quality of decisions is as flawed as the one that exists between correlation and causation. More data without more common sense is unlikely to bring much good to the world.

1 view0 comments

Recent Posts

See All

Super-cycle Schmycle

A peep at the S&P Commodity Index since the early 90s shows the beautiful China-fueled commodity super-cycle which started in early 2002 and ended in 2015 after a violent, albeit brief, hiccup in 2008

Free Will & Big Data

With rapid advances in data analytics, the distant philosophical debate between ‘free will’ and ‘determinism’ is coming closer to leadership teams. In 1814, the Marquis de Laplace, a French mathematic