Thursday, December 24, 2020

They’re predictably unpredictable

From Science Fictions by Stuart Ritchie.  Page 62.

“Fortunately, just as it’s a monumentally difficult task to forge a compelling Rembrandt or Vermeer (or a compelling western blot), it’s not at all easy to fake a dataset convincingly. Data pulled out of thin air don’t have the properties we’d expect of data collected in the real world.65 Fundamentally, this is because no science is really an exact science: numbers are noisy. Every time you try to measure anything, you’ll be slightly off from the true value, be it the economic performance of a country, the number of rare orangutans left in the world, the speed of a subatomic particle, or even something as simple as how tall someone is. With height, for instance, the person might be a bit slouched, your tape measure might slip by a fraction of an inch, or you might accidentally write down the wrong number. This is called measurement error, and it’s hard to get around completely, even if there are ways to reduce it. 
 
Measurement error’s equally annoying cousin is sampling error. As scientists we can rarely, if ever, examine every single instance of a phenomenon – no matter whether we’re trying to study a set of cells, or exoplanets, or surgical operations, or financial transactions. Instead, we take samples, and try to generalise from them to the set as a whole (statisticians call the whole set the ‘population’, even if it’s not a set of people). The trouble is, the characteristics of any given sample you take (say, the average height of all the people in your study) are never a precise match to what you really want to know (say, the average height of all the people in the country). Just through the random chance of who was included, every sample will have a marginally different average. And some samples, again just by chance, might be wildly different from the true average in the overall set.
 
Both measurement error and sampling error are unpredictable, but they’re predictably unpredictable. You can always expect data from different samples, measures or groups to have somewhat different characteristics – in terms of the averages, the highest and lowest scores, and practically everything else. So even though they’re normally a nuisance, measurement error and sampling error can be useful as a means of spotting fraudulent data. If a dataset looks too neat, too tidily similar across different groups, something strange might be afoot. As the geneticist J. B. S. Haldane put it, ‘man is an orderly animal’ who ‘finds it very hard to imitate the disorder of nature’, and that goes for fraudsters as much as for the rest of us.

 

No comments:

Post a Comment