Big Data’s Big Blind-spots

big data's big problems

Yesterday, we discussed how theoretical models can be used to draw biased conclusions by using faulty assumptions. If the models then get picked up without an understanding of those assumptions, it leads to expensive mistakes. But are empirical models free from such bias, especially if the data-set is big enough? Absolutely not.

In an article titled “Big data: are we making a big mistake?” in the FT, author Tim Harford points out that by merely finding statistical patterns in the data, data scientists are focusing too much on correlation and giving short shrift to causation.

But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down.

All the problems that you had in “small” data exist in “big” data, but they are only tougher to find. When it comes to data, size isn’t everything, you still need to deal with sample error and sample bias.

For example, it is in principle possible to record and analyse every message on Twitter and use it to draw conclusions about the public mood. But while we can look at all the tweets, Twitter users are not representative of the population as a whole. According to the Pew Research Internet Project, in 2013, US-based Twitter users were disproportionately young, urban or suburban, and black.

Worse still, as the data set grows, it becomes harder to figure out if a pattern is statistically significant, i.e., can such a pattern have emerged purely by chance.

The whole article is worth read, plan to spend some time on it: Big data: are we making a big mistake?