Monday, January 4, 2021

Scientists who don’t come to their data with a proper plan can end up analysing themselves into an unreplicable corner

From Science Fictions by Stuart Ritchie.  Page 106. 

It gets worse. So far, I’ve made it sound as though all p-hacking is done explicitly – running lots of analyses and publishing only those that give p-values lower than 0.05. This undoubtedly happens a lot, but the true problem is much trickier. It’s this: even if you just run one analysis, you still need to consider all the analyses you could have run. The statisticians Andrew Gelman and Eric Loken compare the process of doing an unplanned statistical analysis to a ‘garden of forking paths’, from the Jorge Luis Borges short story of that name: at each point where an analytic decision is required, you might choose any one of the many options that present themselves. Each of those choices, as we’ve seen, would lead to slightly different results.72 Unless you’ve set out very specific criteria for what a result favouring your hypothesis would look like, unless you say that you want ‘a p < 0.05 with the variables treated this way under these precise conditions and with these controls’, then you might end up accepting any one of the many possible results as evidence that you’re right. But how do you know the one you ended “up with, having followed your unique combination of forking paths, wasn’t a statistical fluke? Even without the trial-and-error of classic p-hacking, then, scientists who don’t come to their data with a proper plan can end up analysing themselves into an unreplicable corner.
 
Why unreplicable? Because when they reach each fork in the path, the scientist is being strung along by the data: making a choice that looks like it might lead to p < 0.05 in that dataset, but that won’t necessarily do the same in others. This is the trouble with all kinds of p-hacking, whether explicit or otherwise: they cause the analysis to – using the technical term – overfit the data.  In other words, the analysis might describe the patterns in that specific dataset well, but those patterns could just be noisy quirks and idiosyncrasies that won’t generalise to other data, or to the real world. This is useless. Most of the time we’re not interested in the workings of one particular dataset (we don’t want to know ‘what is the link between taking antipsychotic drugs and schizophrenia symptoms measured in this specific sample of 203 people between April and May 2019 in Denver, Colorado?’) – we’re looking for generalisable facts about the world (‘what is the link between taking antipsychotic drugs and schizophrenia symptoms in humans in general?’).

 

No comments:

Post a Comment