Disproving unsupported hypotheses in the absence of robust data

I have been involved in a discussion regarding how one might ascertain whether a given disparate outcome is the result of deliberate discrimination, systemic bias, unconscious bias, or is the result of other, legitimate causes. The decision-making and argument assessment aspect of the issue is covered in Thomas Gilovich’s rather good book, How We Know What Isn’t So but it is always interesting to test strongly held beliefs with data, fragmented, incomplete and questionable as some of that data might be. It forces an exercise of analytic imagination.

Useful contextual background to the exercise: The discussion has centered on the Caldecott and the Newbery Awards, annual Medals and Honors administered by the American Library Association. The Caldecott is awarded to the best children’s illustrated book of the year and the Newbery to the best Middle Grade/Young Adult fiction. Best is a subjective judgment arrived at by the judges. The awards are run separately, each with a fifteen person panel of judges reading and reviewing up to 700 candidate books per judge in the case of the Caldecott. Membership of the panel changes every year but the rules are broadly stable over time (the Newbery was established in 1922 and the Caldecott in 1938). Any book published in the USA by an American citizen or American resident is eligible for consideration. The Caldecott always awards a Medal and usually 1-5 Honors. The Newbery likewise. Both the Newbery and the Caldecott are very influential in the long term commercial success of a title. The Library profession is 75-85% female depending on how a professional librarian is defined and consequently both the award panels, made up of ALA members, are majority female.

The issue that arises each year is that there is a gender skew in the awards, a pattern of long standing. Year-in-year-out, on average roughly 65% of the Caldecott awards are received by male illustrators. Likewise, roughly 65% of the Newbery awards are received by females. The Newbery skew usually does not excite much comment because it is assumed that the significant majority of MG/YA writers are female.

It is the Caldecott skew that draws attention, principally because there is not a corresponding assumption that the majority of children’s book illustrators are male. The argument is made each year that there must be some form of bias or discrimination occurring simply because of the resultant male skew. No corresponding argument is made for Newberys. Members of Caldecott panels of judges in the past have routinely affirmed that they are aware of the controversy but that there is nothing in the process that would explain the outcome. Any book published in America by an American or an American resident can be submitted for consideration, there is no filtering barrier other than the judgment of the fifteen judges themselves. Past participants have indicated that they are gender blind through the process and are focused solely on merit and quality. Advocates that there is a bias in play usually concede that it is neither deliberate discrimination or conscious bias. Rather they argue that there is some unidentified systemic bias or some unconscious bias that must be generating the results. That is the argument that is being tested with the weak data sets available.

So the rub of the issue is how to explain two strong gender skews in awards but in opposite directions from award processes that are stable over time, are administered by the same organization which is majority female, are ungated in terms of books that can be considered and which have multiple different participants each year.

The intellectual challenge is exacerbated by four fundamental contextual issues. 1) Voluminous heterogeneity. The publishing industry produces some 25-35,000 new titles each year from seven large publishing houses but also from hundreds or thousands of mid and small sized regional and specialized publishers. There are tens of thousands of authors and illustrators or aspirants. 2) Data dearth. There is a paucity of robust, reliable and publicly available data and statistics. The consequence is that answering any question entails fairly laborious data collection and analysis with the sampling and calculation errors inherent to such a cumbersome process. 3) Statistical complexity. It is a statistical truism that elite performance and samples from the margin are unstable and never representative of the macro population no matter the field of endeavor. The consequence is that it is challenging to ensure that there is an apples-to-apples comparison being made. Finally, 4) Directional Causation. The data is so scarce that even when an outcome can be demonstrated to be factually true, it is difficult to ascribe root causes.

A final complication is whether this is even an issue of any consequence. If the awards are skewed and for valid reasons, is there any greater consequence? Some people are concerned that there should be demographic equivalence of outcome because they believe that is an important goal in its own right. Others are more concerned about equity of process; if everyone is subject to the same rules and process and there is a non-biased result, that is a good outcome whether or not it reflects the demographic population at large.

Argument to date: So far in the conversation we have established that there is in fact a longstanding skew in both the Newbery and Caldecott awards, the Newbery towards female authors and the Caldecott towards male illustrators with the skew being about 65% in either direction (65% female in Newbery and 65% male in Caldecott). The Effect is Real.

No one has identified any meaningful or measurable consequences of the skew other than individual concern that skews might reflect bias. Right now this is purely an academic argument because there appear to be no positive or negative consequences to the skews. There is no known or measured consequence.

No one has created a sample of all MG/YA authors and all book illustrators to test whether in fact there is a skew in the writing and illustrating population that would explain the awards skew.

However, a sample of starred reviews by the major reviewing organs in the children’s book industry for a particular publishing house indicates that male illustrators are 70% of the critically received books and females, 30%. A similar sample for MG/YA books from the same source indicates that MG/YA critically received authors are 62% female and 38% female. It appears from this small but random sampling that the Caldecotts and Newberys are indeed being awarded in proportion to the gender split of the participants in the respective fields of authorship and illustration. Direct cause is likely a result of gender skews in the candidate populations, not the judging process.

But if there is no systemic or unconscious individual bias among judges, what might be the cause of the skews in the candidate population? What might be the ultimate cause?

My working hypothesis is that, as in other industries, the elite performers (award winners) in publishing are characterized by prolonged, voluminous effort (work intensity) in their specified field. Any demographic skewing (gender or ethnicity, or orientation, or religion, or class, etc.) that occurs is usually a byproduct of the independent variable which is intensity and extensiveness of competitive effort in the field. In other words, demographic skewing usually arises because of barriers to voluminous and sustained effort, not because of bias and exclusion. But what data can be found to advance that hypothesis?

One of the participants in the discussion created a list (referred to as Overlooked Books list) of several dozen books illustrated by women in 2013 that she considered of superior quality and as evidence that good work might be being overlooked or not considered. That list doesn’t in itself advance the argument regarding gender skew because it is non-random, purposely selected data, i.e. an equivalent list could be constructed for overlooked males. With 35,000 new titles a year, it is the fate of most books to be overlooked regardless of illustrator gender. However, the Overlooked Books list does provide data to test the work intensity hypothesis.

I took a random sample of twenty titles on the Overlooked List as well as the twenty most recent Caldecott Medal and Honor winners (back to and including 2009) and looked for patterns in the data. The first observation is that only 40% of the Overlooked Books list are author/illustrator whereas, for whatever reason, Caldecott Medals go 71% to author/illustrators, i.e. books where the author and illustrator are the same person. In addition, four of the Overlooked Books list are from publishers that have never won a Caldecott.

What I then did was to try and find some proxies for voluminous and sustained effort. Of course the data is not nearly as granular or precise as we might wish. Allow also that I did this very quickly so there are going to be some errors because the data is scattered all over the place.

For each illustrator, I tried to determine how many titles they have illustrated (proxy for volume of effort) and when was their first book published (proxy for duration of effort). As a check, since many illustrators appear to omit early work which might be characterized as contract illustration of books, I also looked at the number of times the illustrator was cited in Finally, I also double checked with Amazon information, with Wikipedia, as well as the author’s own website where available. I can’t vouch for the strict accuracy of all this public data, and I had to make some judgment calls here and there where there were inconsistencies, but I think the accuracy is sufficient for our purposes. This analysis also fails to capture occasions where an illustrator might have had a multi-year hiatus in publishing.

I ended up omitting three of the illustrators from the Overlooked Books list of 20 (Marla Frazee, Melissa Sweet, and Erin Stead). All three won either a Medal or Honors in the 2009-2013 time frame, just not for the titles on the Overlooked Books list. It seemed illogical to designate them as overlooked if they have actually won recognition from Caldecott so they are included in the Caldecott winner list but not in the Overlooked List. In fact Frazee won two Honors.

If the work intensity hypothesis is correct, then we would predict that Caldecott winners should have evidence of more effort over longer periods of time than others whose work seems commendable but who have not won. In other words, the data should have a roughly normal distribution by years publishing and number of books illustrated with a peak presumably somewhere between 10-40 years. But that’s not all we can glean from the data.

Each year when the skewed Caldecott results come out, there is a lot of speculation as to what other causes there might be to explain the skew. We have already confirmed that the skew is unlikely to be a source of direct concern in that awards appear to be going in proportion to the gender mix of published and critically received books. However, the Overlooked Books data also allows us to test the other hypotheses that are usually advanced to explain the Caldecott skew. While there is a lot of speculation and a variety of theories, the most common hypotheses might be characterized as: 1) there is some unconscious negative bias against female illustrators, 2) there is some unconscious positive bias based on male illustrators being charming Lotharios, 3) that the awards are in essence simply a luck of the draw, and 4) the awards are the result of raw talent some of which is not being recognized. These are testable hypotheses using the Overlooked Books data.

If the Negative Female Bias hypothesis is correct, then the Overlooked books age and volumes data should look the same as the female Caldecott winners, i.e. equal qualifications but being disproportionately omitted. If the Lothario hypothesis is correct, then, based on the assumptions underlying this hypothesis, you would expect the male illustrator winners to have less time in industry. If the Luck-of-the-Draw hypothesis is correct, then there should be no pattern in the volume and intensity data at all. If the Raw Talent hypothesis is correct, there should also be a skew to the left, i.e. younger. Raw talent can enter a field at any time but is usually most prevalent at the earliest ages. So, if any of these hypotheses are correct, then we would expect four corresponding patterns. Negative Female Bias – Identical intensity patterns between Overlooked books and female Caldecott winners; Lothario – male left skewed for male Caldecott winners (younger and less experienced); Luck-of-the-Draw – No pattern to the data at all, simply random; Raw Talent – Left skewed for all participants by age, regardless of gender. More importantly, there should be no correlation with number of books ever published.

The results of the analysis are consistent with the intensity hypothesis. Caldecott winners have illustrated nearly twice as many books on average as the overlooked illustrators (23.2 books versus 12.9 books) and have spent more than twice as many years being published (18.9 years versus 7.8 years). The Caldecott winners have about twice as many worldcat citations (on a normalized year basis) as the overlooked authors.

There were only four illustrators on the Overlooked Books list that had an intensity profile close to the Caldecott average (23 books and 19 years) for number of books and years being published. Two had already won Caldecotts (Frazee and Sweet) and two had not, Anne Wilsdorf and Mary GrandPre. Anne Wildorf may produce excellent work but could not be considered for the Caldecott as she is Swiss and lives in Switzerland. Consequently, there is only one candidate on the list of twenty who is eligible for the Caldecott, has the work duration and intensity of other winners but has not yet won an award. Presumably, based on her numbers, we can forecast that there is a reasonable chance that Mary GrandpPre will win something in the next five years.

So this rough sampling of data is consistent with what we would expect if we believe that the publishing industry is like other competitive fields where awards are going in proportion to those that have produced the most work over longer time frames, i.e. voluminous and continuous effort in a competitive field.

The data confirms the hypothesis that there is a relationship between years in publishing and number of books published, and with the probability of winning a Caldecott Honor or Medal. But the data does more than that, it refutes the more popular alternate explanations.

Female Negative Bias is not evident as the Overlooked Books and the female Caldecott winners have entirely different work intensity profiles. With regard to the Lothario Hypothesis, there is no male left skew by age. In fact, male winners average 21 years versus the overall average of 19 years for both genders (17 years for female winners) and 8 years for the Overlooked Books list. The Luck-of-the-Draw Hypothesis is also refuted. Instead of no pattern to the data, there is a strong normal distribution. Finally, the Raw Talent Hypothesis is also not supported: there is no left skew by age for either gender and there is a strong correlation with winning and number of books ever published.

It would appear that the Caldecott judges who claim that they are not focusing on gender and that the results reflect real judgment based on quality are more likely to be correct than not. There has never been any evidence of deliberate or conscious discrimination. The Caldecott and the critically recognized (starred) books indicates that Caldecotts are being won in proportion of the likely candidate pool. The Overlooked Books data confirms that there is a material difference between overlooked illustrators in terms of volume and duration compared to Caldecott winners, both male and female. The Overlooked Books data and the Caldecott data also provide no support for the Female Negative Bias Hypothesis, the Lothario Hypothesis, the Luck-of-the-Draw Hypothesis, and the Raw Talent Hypothesis.

Perceived problems cannot be resolved without knowing real/ultimate root causes. Over this conversation, we haven’t produced much evidence regarding any sort of overt or unconscious gender bias nor that there is any negative consequence to the awards gender skew. However, as a result of trying to validate suppositions, we have surfaced seven separate patterns that are either notable (the first two) or intriguing and which I think are worth mulling on.
1) Only a handful of publishing houses ever win a Caldecott. This is not too surprising as it is a dynamic common in other industries, but it is worth having documented. The published Caldecott nomination process is inclusive and encompassing. It is not clear that there is any sort of a requirement or step that is having the unintended consequence of overly-constricting the population for review. I rather doubt it, but it is a logical possibility and ought to be considered.

2) The publishing industry appears to conform with other competitive fields of endeavor where elite performance is causatively and logarithmically associated with voluminous and sustained effort. This also is not particularly surprising as it is a common phenomenon.

3) The Caldecott Medal goes strongly (71%) to books that are author/illustrator whereas the population of Honors and general population of reviewed books are more 40-50% author/illustrator. I cannot identify a compelling (or really, any) reason for that variance but I find it notable and intriguing.

4) Female dominance of Newbery’s. We know that there is a preponderance of female authors in MG/YA which is something of an anomaly. Worked hours (compensated) in the national economy are, averaged over all sectors, about 65% male, 35% female. In adult fiction, in terms of reviews and prizes such as Pulitzer for Fiction, there is also a rough 70% male, 30% female split. So why is youth fiction skewed one way and adult literary fiction the other? I don’t think it is a material issue but it is a puzzling one.

5) There is a surprisingly long gap between initial publication and eventual award recognition (average is 18.9 years). For most of the very long established Caldecott winners (ex. David Small, Pamela Zagarenski, John Rocco, Chris Raschka, Lane Smith, Bryan Collier, Jerry Pinkney, Marla Frazee, etc.), it seems to me that their distinctive styles were clear and well established within five years of their first publication. In a handful of cases there seems to be continuing growth and experimentation but it appears to me that styles are quickly established. Is there really that much refinement that occurs between 7.8 years (omitted average) and 18.9 years (Caldecott winners)? Possibly, but this also strikes me as intriguing. Perhaps it just takes that long for new distinctive styles to become absorbed into the critical consciousness.

6) There is a very high standard deviation for the 2009-2013 population of Caldecott winners. While the average was 18.9 years, with most clustered 15-25 years of publishing experience, the actual range was 3 years since first published (Erin E. Stead) to 50 years (Uri Shulevitz). There clearly is little that precludes winner recognition at either the lower or the upper bound. Stead is by the far the outlier with there only being 5 illustrator winners in the 2009-2013 population who have fewer than ten years of experience in published illustration (Stead is the only female in that group).

7) Finally, there is not a material difference between male and female winners in terms of publishing experience. Males have an average of 20.8 years of published experience versus the average of 18.9 and females have an average of 16.3 years (even with the Stead outlier bringing down the average – without Stead’s number, the female average would be 18.1 versus the overall average of 18.9).

This has been a great conversation for forcing critical examination of assumptions and logical inference as well as for examining how both a positive argument and a negative argument can be made and supported with data even when the quality and availability of the data is less than might be desired. Nothing is proven but the odds are significant that there is no bias in the Caldecott selection process and that the gender skew is most likely a product of the gender proportions among experienced illustrators and that the proportion of experienced illustrators is a function of the capacity to invest long hours over an extensive period of time in the craft of illustration. Ultimate cause is individual career choices and investments (hours and duration).

