Thing Finder

Saturday, December 26, 2020

Data Talks

Why are some people more prone to mental illness than others? There are many contributing factors, but the single most important is genes. And many of the gene variants that increase the risk of mental illness do so for multiple disorders, not just one. https://t.co/lXavpvMQrI pic.twitter.com/CIvjkK1krC
— Steve Stewart-Williams (@SteveStuWill) November 30, 2020

House in Tuscany, 1903 by Hans Emmenegger

Click to enlarge.

Friday, December 25, 2020

The range of the data (the difference between the highest and lowest scores) was nearly identical, although the groups were otherwise quite different.

From Science Fictions by Stuart Ritchie. Page 63.

This kind of reasoning is what caught out social psychologists Lawrence Sanna and Dirk Smeesters in 2011. Sanna published a study in which he claimed to find that people are more prosocial when standing at higher elevations; Smeesters claimed to show that seeing the colours red and blue affects how people think about celebrities. The results in both papers looked impressive at first glance, easily confirming their proposed theories about human behaviour. But a closer look revealed something distinctly odd. The psychologist Uri Simonsohn showed that in the various groups in Sanna’s experiment, the range of the data (the difference between the highest and lowest scores) was nearly identical, although the groups were otherwise quite different. Simonsohn calculated that the chances of this happening in real data were minuscule. It was the same for Smeesters, except it was the averages of his groups that were too similar; again, these similarities just weren’t consistent with what would happen in real data, where error would have nudged the numbers further apart. Once these problems, among others, were exposed, the offending papers were retracted, and both researchers resigned in disgrace.

These kinds of statistical red flags are analogous to what makes your bank freeze your credit card after it’s suddenly used to spend large sums on a tropical cruise: unusual activity that’s out of line with normal expectations, and which might be due to fraud. And there are a host of other features of fraudulent data that might cause readers to become suspicious when they dig into the details. The dataset might look a little too immaculate, for example, with too few missing datapoints, which come about for all sorts of reasons in real datasets: participants dropping out of the study or instruments failing, for example. Perhaps the distribution of numbers might not follow certain expected mathematical rules. Or the effects might be vastly larger than seems plausible in the real world, and thus too good to be true.

History

"Even insults can become names: 'Tory,' which refers to Conservatives in Britain and Canada, and 'Whig,' which refers to liberals, were both terms of abuse, meaning highway robber and yokel respectively, which progressively lost their pejorative meaning." https://t.co/fr3jsuuyVt pic.twitter.com/lFs80W8dk3
— Rob Henderson (@robkhenderson) December 9, 2020

An Insight

The distribution of animal biomass. If you put every member of each animal group on a scale:

-Arthropods would be the heaviest, followed by fish
-Human livestock would outweigh actual humans
-Human livestock would greatly outweigh wild mammals and birdshttps://t.co/EJDodmOGmO pic.twitter.com/QykrKpPh0u
— Steve Stewart-Williams (@SteveStuWill) December 8, 2020

I see wonderful things

This reunion.. ❤️

We don’t deserve them.. 🥺

Via @Thund3rB0lt 🙏 pic.twitter.com/w91QU2UtVh
— Buitengebieden (@buitengebieden_) December 4, 2020

Offbeat Humor

pic.twitter.com/P37x3ucFCq
— Hotspur #pleb (@stgeorgeiscross) December 5, 2020

Data Talks

Lined up exactly with where the Yankees of New England settled as they moved west. The “greater Yankeedom” cultural region. pic.twitter.com/OiqIbqZoEW
— J (@jamIVbears) November 29, 2020

Christmas Holidays by Trevor Mitchell

Click to enlarge.

Thursday, December 24, 2020

They’re predictably unpredictable

From Science Fictions by Stuart Ritchie. Page 62.

“Fortunately, just as it’s a monumentally difficult task to forge a compelling Rembrandt or Vermeer (or a compelling western blot), it’s not at all easy to fake a dataset convincingly. Data pulled out of thin air don’t have the properties we’d expect of data collected in the real world.65 Fundamentally, this is because no science is really an exact science: numbers are noisy. Every time you try to measure anything, you’ll be slightly off from the true value, be it the economic performance of a country, the number of rare orangutans left in the world, the speed of a subatomic particle, or even something as simple as how tall someone is. With height, for instance, the person might be a bit slouched, your tape measure might slip by a fraction of an inch, or you might accidentally write down the wrong number. This is called measurement error, and it’s hard to get around completely, even if there are ways to reduce it.

Measurement error’s equally annoying cousin is sampling error. As scientists we can rarely, if ever, examine every single instance of a phenomenon – no matter whether we’re trying to study a set of cells, or exoplanets, or surgical operations, or financial transactions. Instead, we take samples, and try to generalise from them to the set as a whole (statisticians call the whole set the ‘population’, even if it’s not a set of people). The trouble is, the characteristics of any given sample you take (say, the average height of all the people in your study) are never a precise match to what you really want to know (say, the average height of all the people in the country). Just through the random chance of who was included, every sample will have a marginally different average. And some samples, again just by chance, might be wildly different from the true average in the overall set.

Both measurement error and sampling error are unpredictable, but they’re predictably unpredictable. You can always expect data from different samples, measures or groups to have somewhat different characteristics – in terms of the averages, the highest and lowest scores, and practically everything else. So even though they’re normally a nuisance, measurement error and sampling error can be useful as a means of spotting fraudulent data. If a dataset looks too neat, too tidily similar across different groups, something strange might be afoot. As the geneticist J. B. S. Haldane put it, ‘man is an orderly animal’ who ‘finds it very hard to imitate the disorder of nature’, and that goes for fraudsters as much as for the rest of us.

Thing Finder

Saturday, December 26, 2020

Data Talks

House in Tuscany, 1903 by Hans Emmenegger

Friday, December 25, 2020

The range of the data (the difference between the highest and lowest scores) was nearly identical, although the groups were otherwise quite different.

History

An Insight

I see wonderful things

Offbeat Humor

Data Talks

Christmas Holidays by Trevor Mitchell

Thursday, December 24, 2020

They’re predictably unpredictable

Search This Blog

Links

Blog Archive

About Me

Blogs

Saturday, December 26, 2020

Friday, December 25, 2020

Thursday, December 24, 2020

Search This Blog

Subscribe To

Links

Blog Archive

About Me

Blogs