Wednesday, October 23, 2019

Checking the math - the downfall of the blind auditions research

I have long been very skeptical of claims of systemic discrimination against various groups. Discrimination by individuals certainly exists, sometimes deliberately, but it seems to me to be relatively rare and, more importantly, noisy. In other words whatever the category of identity, you have some individuals discriminating positively and some discriminating negatively.

Other forms of disparate impact exist. People of one identity or another are less prevalent in an occupation, for example, but almost always, once you begin to take into account reasonable confounding factors, the disparateness disappears. Confounding factors such as age, height, IQ, education attainment, criminal record, duration in the field, etc.

There is person-to-person discrimination, but it is noisy and rarely systemic.

There is frequent disparate representation which creates the appearance of discrimination but it is a product of confounds.

I am certain that there are likely occasions where there is systemic discrimination at an enterprise level because of someone in a particular position being able to hijack the selection process.

But that becomes hard to identify because of all the over-claiming.

One of the few studies I have seen which I took to be reasonable evidence of possible systemic bias was the orchestra blind selections study.

From Blind Spots in the ‘Blind Audition’ Study by Christina Hoff Sommers.
It is one of the most famous social-science papers of all time. Carried out in the 1990s, the “blind audition” study attempted to document sexist bias in orchestra hiring. Lionized by Malcolm Gladwell, extolled by Harvard thought leaders, and even cited in a dissent by Justice Ruth Bader Ginsburg, the study showed that when orchestras auditioned musicians “blindly,” behind a screen, women’s success rates soared. Or did they?

Nobody questions the basic facts that led to the study’s publication. During the 1970s and ’80s, America’s orchestras became more open and democratic. To ensure impartiality, several introduced blind auditions. Two economists, Claudia Goldin of Harvard and Cecilia Rouse of Princeton, noticed that women’s success rates in auditions increased along with the adoption of screens. Was it a coincidence or the result of the screens? That is the question the two economists tried to answer in “Orchestrating Impartiality: The Impact of ‘Blind’ Auditions on Female Musicians,” published in 2000 in the American Economic Review.

They collected four decades of data from eight leading American orchestras. But the data were inconclusive: The paper includes multiple warnings about small sample sizes, contradictory results and failures to pass standard tests of statistical significance. But few readers seem to have noticed. What caught everyone’s attention was a big claim in the final paragraph: “We find that the screen increases—by 50 percent—the probability that a woman will be advanced from certain preliminary rounds and increases by severalfold the likelihood that a woman will be selected in the final round.”

According to Google, the study has received more than 1,500 citations in academic articles and thousands of media mentions. It has been featured in TED Talks, celebrated at the Davos conference, and showcased in so many diversity workshops that one attendee begged never to hear about it again. Inspired by the “academically verified Orchestra study,” GapJumpers, a Silicon Valley startup, offers companies software to conduct blind interviews in other contexts.

The study’s appeal is clear: Two prominent economists, in a top journal, wielding state-of-the-art econometrics, captured and quantified bias against women and documented a solution. Or so it seemed.

The research went uncriticized for nearly two decades. That changed recently, when a few scholars and data scientists went back and read the whole study. The first thing they noticed is that the raw tabulations showed women doing worse behind the screens. But perhaps, Ms. Goldin and Ms. Rouse explained, blind auditions “lowered the average quality of female auditionees.” To control for ability, they analyzed a small subset of candidates who took part in both blind and nonblind auditions in three of the eight orchestras.

The result was a tangle of ambiguous, contradictory trends. The screens seemed to help women in preliminary audition rounds but men in semifinal rounds. None of the findings were strong enough to draw broad conclusions one way or the other.
A more technical review is available from Orchestrating false beliefs about gender discrimination by Jonatan Pallesen.
The paper also looks at the issue from another angle: How do blind auditions impact the likelihood of a woman ending up being hired? The results in this section (even after a statistically questionable data split) are not significant, as they state in the paper:
Women are about 5 percentage points more likely to be hired than are men in a completely blind audition, although the effect is not statistically significant. The effect is nil, however, when there is a semifinal round, perhaps as a result of the unusual effects of the semifinal round. The impact for all rounds [columns (5) and (6)] is about 1 percentage point, although the standard errors are large and thus the effect is not statistically significant.
So, in conclusion, this study presents no statistically significant evidence that blind auditions increase the chances of female applicants. In my reading, the unadjusted results seem to weakly indicate the opposite, that male applicants have a slightly increased chance in blind auditions; but this advantage disappears with controls.
I should have known better as it is sociology where hardly any research holds up. I have long respected Claudia Goldin (one of the authors of the original research) and the quality of her work. It sounded plausible, I trusted her, and I did not bother to do the detailed paper review which Pallesen has done. Shame on me.

But shame on us all as well. This is a seminal research paper widely cited in the academic literature and even more widely referenced in casual conversation. Nearly twenty years, 1500 citations, thousands of presentations, tens of thousands of listeners. And no one checked the math.

And one of the few seemingly valid pieces of evidence to support systemic gender discrimination goes up in smoke.

No comments:

Post a Comment