Tuesday, April 6, 2021

We found that there was actually a big difference between these groups!

An excellent piece from Two Unexpected Multiple Hypothesis Testing Problems by Astral Codex Ten (the old Slate Star Codex before the deplorable de-platforming behavior of the New York Times).

This is one of my major irritants when seeing mainstream media reporting studies - they don't understand or choose to statistics. 

The typical example something like this.  They want to know the answer to a particular question.  They survey some 1,200 people across the nation.  They aren't necessarily all that random.  The surveying company asks a few additional questions while they have them on the phone.  Then they close out with some basic demographic profile information such as age, race, political registration, religion, state, level of education attainment, any of a number of factors.

If all they did was answer the one central question, it would be marginally adequate, primarily undermined by poor randomization.  

But most companies, once they have the pool of data, they will do analysis on additional factors.  Not just whether (for example) the respondent set supports additional international policing obligations.  Then they want to see how that differs by region.  Then by race.  Then by party registration.  All good questions.  And all inappropriate for a poorly randomized sample size of 1,200.  

If you want to know the opinion of African-American, Democratic Registered, Southeastern living respondents, then they need something like, 1,200 respondents with all those attributes for that degree of detail.  But they don't do that.  They only may 30 respondents who are African-American, Democratic Registered, Southeastern living and willing to express an opinion.  That is in no way a representative sample size.

This happens all the time with the mainstream media.  As long as the answer is supportive of the narrative or is so interesting as to attract eyeballs, then they will run with it, statistical validity be damned.  

And that is just the easy stuff.

It is very hard to design rigorous statistical studies, ensure true randomization and obtain a large enough sample size to provide high confidence answers.  Oh, and expensive.

Hence the plethora, perhaps the supermajority of quick and dirty studies with low randomization, low sample size, non-registration of methodology, etc.  These studies give suggestive answers but not high confidence answers.

Astral Codex Ten in this article is discussing something rather deeper; how do you anticipate in advance the different ways in which your results might be shown to be non-meaningful owing to poor statistical controls.  This is not a "have they no shame" issue.  This is "robust statistical analysis is really hard" issue.

The story so far: some people in Cordoba did a randomized controlled trial of Vitamin D for coronavirus. The people who got the Vitamin D seemed to do much better than those who didn’t. But there was some controversy over the randomization.

[snip]

Remember, we want to randomly create two groups of similar people, then give Vitamin D to one group and see what happens. If the groups are different to start with, then we won't be able to tell if the Vitamin D did anything or if it was just the pre-existing difference. In this case, they checked for fifteen important ways that the groups could be different, and found they were only significantly different on one - blood pressure.

Jungreis and Kellis, two scientists who support this study, say that shouldn't bother us too much. They point out that because of multiple testing (we checked fifteen hypotheses), we need a higher significance threshold before we care about significance in any of them, and once we apply this correction, the blood pressure result stops being significant. Pachter challenges their math - but even aside from that, come on! We found that there was actually a big difference between these groups! You can play around with statistics and show that ignoring this difference meets certain formal criteria for statistical good practice. But the difference is still there and it's real. For all we know it could be driving the Vitamin D results.

Or to put it another way - perhaps correcting for multiple comparisons proves that nobody screwed up the randomization of this study; there wasn't malfeasance involved. But that's only of interest to the Cordoba Hospital HR department when deciding whether to fire the investigators. If you care about whether Vitamin D treats COVID-19, it matters a lot that the competently randomized, non-screwed up study still coincidentally happened to end up with a big difference between the two groups. It could have caused the difference in outcome.

(by analogy, suppose you were studying whether exercise prevented lung cancer. You tried very hard to randomize your two groups, but it turned out by freak coincidence the "exercise" group was 100% nonsmokers, and the "no exercise" group was 100% smokers. Then you found that the exercise group got less lung cancer. When people complain, you do a lot of statistical tests and prove that you randomized everyone correctly and weird imbalances in the group happen only at a chance level - we did say the difference was a freak coincidence, after all. But your study still can't really tell us whether exercise prevents lung cancer).

But this raises a bigger issue - every randomized trial will have this problem. Or, at least, it will if the investigators are careful and check many confounders. Check along enough axes, and you'll eventually always find a "significant" difference between any two groups; if your threshold for "significant" is p < 0.05, it'll be after investigating around 20 possible confounders (pretty close to the 15 these people actually investigated). So if you're not going to adjust these away and ignore them, don't you have to throw out every study?

This is hard not just in that rigorous large-scale, truly randomized studies are hard, expensive, and take a long time.  It is hard conceptually.  No-one is acting in bad-faith but many of the possible critiques only show up in hindsight.   

We are essentially at the frontiers of large scale rigorous studies, often looking for weak but real signals in noisy data.  We are societally (globally) climbing a very steep learning curve about how to do these studies well and consistently and there are no pat answers, theoretical or practical.


No comments:

Post a Comment