Very good material.
Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.Interesting. I wasn't aware of the term multiple-comparisons problem. Reading up on it now, I understand it to partially address an issue that in my ignorance I have treated more cavalierly. This issue arises more when you go prospecting for patterns than when you are testing a predetermined hypothesis.
[snip]
But while big data promise much to scientists, entrepreneurs and governments, they are doomed to disappoint us if we ignore some very familiar statistical lessons.
“There are a lot of small data problems that occur in big data,” says Spiegelhalter. “They don’t disappear because you’ve got lots of the stuff. They get worse.”
[snip]
Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster and cheaper these days – but we must not pretend that the traps have all been made safe. They have not.
[snip]
In 2005, John Ioannidis, an epidemiologist, published a research paper with the self-explanatory title, “Why Most Published Research Findings Are False”. The paper became famous as a provocative diagnosis of a serious issue. One of the key ideas behind Ioannidis’s work is what statisticians call the “multiple-comparisons problem”.
It is routine, when examining a pattern in data, to ask whether such a pattern might have emerged by chance. If it is unlikely that the observed pattern could have emerged at random, we call that pattern “statistically significant”.
The multiple-comparisons problem arises when a researcher looks at many possible patterns. Consider a randomised trial in which vitamins are given to some primary schoolchildren and placebos are given to others. Do the vitamins work? That all depends on what we mean by “work”. The researchers could look at the children’s height, weight, prevalence of tooth decay, classroom behaviour, test scores, even (after waiting) prison record or earnings at the age of 25. Then there are combinations to check: do the vitamins have an effect on the poorer kids, the richer kids, the boys, the girls? Test enough different correlations and fluke results will drown out the real discoveries.
There are various ways to deal with this but the problem is more serious in large data sets, because there are vastly more possible comparisons than there are data points to compare. Without careful analysis, the ratio of genuine patterns to spurious patterns – of signal to noise – quickly tends to zero.
[snip]
“We have a new resource here,” says Professor David Hand of Imperial College London. “But nobody wants ‘data’. What they want are the answers.”
[snip]
Recall big data’s four articles of faith. Uncanny accuracy is easy to overrate if we simply ignore false positives, as with Target’s pregnancy predictor. The claim that causation has been “knocked off its pedestal” is fine if we are making predictions in a stable environment but not if the world is changing (as with Flu Trends) or if we ourselves hope to change it. The promise that “N = All”, and therefore that sampling bias does not matter, is simply not true in most cases that count. As for the idea that “with enough data, the numbers speak for themselves” – that seems hopelessly naive in data sets where spurious patterns vastly outnumber genuine discoveries.
Without working through the maths, I have taken a relatively simple approach. Say you want to see whether there is any correlation between school inputs and child life outcomes (say in terms of grade average). Roughly 350 is the number used for adequate sample size in a population with multiple attributes. Say you are interested in whether kids from higher income families do better than kids from poorer families. You run the numbers and discover that in the 350, there are 75 who count as poor and there are 75 who count as high income. You then check against the actuarial tables and find that this is pretty close to the distribution in the population at large. Great. But now you are actually interested in two compounded factors: Income AND Grades. So you have to increase your sample size to 1,610 (=350 * (350/75) to make sure that the sub-population (the 75 who are rich/poor AND their grades) is large enough to also be representative. You only have to add a couple or three variables that are relative small in affect size to have the size of sampled population balloon into very large numbers.
In human systems where outcomes are typically the result of innumerable input or contingent events, the sampling issue quickly overwhelms the nuance of what we are seeking to determine.
No comments:
Post a Comment