Wednesday, January 20, 2016

The globally consequential - what does the data tell us?

From Is fame fair? by Amy Yu and César A. Hidalgo reporting on this original research, Pantheon 1.0, a manually verified dataset of globally famous biographies by Amy Zhao Yu, Shahar Ronen, Kevin Hu, Tiffany Lu & César A. Hidalgo.

I posted about the fame aspect a few days ago. The more I thought about the database, the more intrigued I became as to what other questions it might be able to answer. The basic description of the database is from the article.
In our paper published today in Scientific Data, we introduce the Pantheon 1.0 dataset, a dataset that measures the historical fame of all of the individuals in human history that are recorded in more than 25 languages[2] in Wikipedia. The Pantheon 1.0 dataset annotates each individual with their occupation, demographics (year and country of birth), and several metrics of popularity (derived from the number of language editions in Wikipedia and the pageviews received across different languages). But what makes the Pantheon dataset special is that it focuses on a multilingual corpus (more than 200 language editions of Wikipedia), and it introduces a detailed taxonomy of occupations that classifies biographies into 88 distinct categories. The multilingual nature of the Pantheon dataset allows us to focus on globally famous individuals, while discarding those who are only locally famous (for instance, most American Football Players, who are popular in the United States, but unknown for the rest of the world, do not make the cut). Our taxonomy of occupations, on the other hand, allows us to identify individuals that have made similar contributions, allowing us to test the alignment between fame and accomplishment for narrowly defined groups of individuals.
Wikipedia has 290 different language editions. Ms. Yu et al have exploited the fact that Wikipedia exists in 290 languages. The beauty of this approach is that
Wikipedia is written collaboratively by largely anonymous volunteers who write without pay. Anyone with Internet access can write and make changes to Wikipedia articles, except in limited cases where editing is restricted to prevent disruption or vandalism. Users can contribute anonymously, under a pseudonym, or, if they choose to, with their real identity.

The fundamental principles by which Wikipedia operates are the five pillars. The Wikipedia community has developed many policies and guidelines to improve the encyclopedia; however, it is not a formal requirement to be familiar with them before contributing.

Since its creation in 2001, Wikipedia has grown rapidly into one of the largest reference websites, attracting 374 million unique visitors monthly as of September 2015.[1] There are about 70,000 active contributors
There is no organizing party who can impose decisions about content. What is reflected in Wikipedia is a grass roots effort across multiple cultures and languages with volunteers from multiple countries. Anyone, anywhere can contribute (or edit) anything.

Yu et al focus on entries of individuals who occur in at least 25 languages (of the 290 language versions of Wikipedia). If a person warrants being written about in 25 languages, they are defined as famous. This 25 languages requirement selects for globally recognized individuals such as Charles Darwin, Che Guevara, and Nefertiti and precludes people who are locally famous such as “Heather Fargo, who is the former Mayor of Sacramento, California.”

There are 11,341 such biographies. They collect in their database all 11,341 names as well as occupations, birth date, birth location, page views over a six year window from 2008-2013. Yu notes:
Also, 95% of individuals passing this threshold have an article in at least 6 of the top 10 spoken languages worldwide (Top 10 spoken languages by number of speakers worldwide: Chinese, English, Hindi, Spanish, Russian, Arabic, Portuguese, Bengali, French, Bahasa—see: http://meta.wikimedia.org/wiki/Top_Ten_Wikipedias), demonstrating that the Pantheon dataset has good coverage of non-Western languages.
They also collect measures of popularity.
We also introduce the Historical Popularity Index (HPI), a more nuanced metric for global historical impact that takes into account the following: the individual’s age in the dataset (A), or the time elapsed since his/her birth, calculated as 2013 minus birthyear; an L* measure that adjusts L by accounting for the concentration of pageviews among different languages (to discount characters with pageviews mostly in a few languages, see equation (1)); the coefficient of variation (CV) in pageviews across time (to discount characters that have short periods of popularity); and the number of non-English Wikipedia pageviews (vNE) to further reduce any English bias. In addition, to dampen the recency bias of the data, HPI is adjusted for individuals known for less than 70 years. Equation (4) provides the full formula for HPI. There we use log based 4 for the age variable in the aggregation to avoid age becoming the dominant factor in HPI (as it would if we would have used natural log).
HPI basically discounts English language bias, recency bias, and individuals with brief spikes of fame. There is a good discussion in their paper regarding the various remaining weaknesses and potential biases in the data set but this is good material. And the results are fascinating.

How to characterize the 11,341 individuals? I'll go with "globally consequential." The argument would be that if you are well enough known to be captured in at least twentyfive languages, i.e. people from 25 different languages/cultures are interested in you, then you can be considered to be globally consequential.

What are some of the broad outlines of what this database says about who the people of the world are interested in?

I'll take two different views - One for the whole database and then one for people born within the past hundred years, i.e. people of the modern era.

The top twentyfive people of interest by HPI are:
Aristotle
Plato
Jesus Christ
Socrates
Alexander the Great
Leonardo da Vinci
Confucius
Julius Caesar
Homer
Pythagoras
Archimedes
Moses
Muhammad
Abraham
Adolf Hitler
Wolfgang Amadeus Mozart
Charlemagne
William Shakespeare
Michelangelo
Augustus
Napoleon Bonaparte
Isaac Newton
Albert Einstein
Christopher Columbus
Johann Sebastian Bach
If you look at contemporary (the past hundred years) entries, there are 5,855 names, 52% of all the names.

What is the distribution of people by continent? The first percentage is for the entirety of history and the second is for contemporary history.
Africa - 419 names out of 10,903 (people whose birth places can be confirmed). 3.8% of the globally consequential. Contemporary: 303 names (out of 5,859) or 5.2%

Asia - 1,188 names, 10.9% of the globally consequential. Contemporary: 543 names, 9.3%

Europe - 6,368 names, 58.4%. Contemporary: 2,645 names, 45.1%

North America - 2,439 names, 22.4%. Contemporary: 1,945 names, 33.2%

Oceania - 123 names, 1.1%. Contemporary: 109 names, 1.9%

South America - 489 names, 4.5%. Contemporary: 310 names, 5.3%
The shift from all history to contemporary represents a diminution of Europe's influence from 58.4% to 45.1%. However, that decline is somewhat misleading from a cultural perspective when you take into account that virtually all of North America, South America, and Oceania are essentially extensions of European culture. From this perspective, European derived culture goes from 86.4% to 85.5%. A negligible decline, particularly when you take into account some of the names that are classified as Asian or African based solely on their birthplace and not their nationality: Rudyard Kipling, George Orwell, Doris Lessing, Liv Ullman, Cliff Richard, and the like.

Who are the top 25 among contemporaries? Quite a mixed bag. This is where the multinationalism of Wikipedia becomes much more apparent. Many of these names would not be here on a strictly American list.
Che Guevara
Martin Luther King, Jr.
Elvis Presley
Marilyn Monroe
Jimi Hendrix
Andy Warhol
Bob Marley
Bruce Lee
Bob Dylan
John F. Kennedy
Fidel Castro
Saddam Hussein
Stanley Kubrick
Gabriel Garcia Marquez
John Lennon
Marlon Brando
Pele
Pope John Paul II
Mikhail Gorbachev
Ãedith Piaf
Elizabeth II of the United Kingdom
Johnny Cash
Ingmar Bergman
Michel Foucault
Stephen King
Clearly this is dominated mostly by entertainment celebrities, politicians, and sportsmen. Not one scientist. Not one businessman. Only one philosopher. Only one religious figure. This might be a fun group at a party. Not so much for restarting civilization.

Che Guevara is the number one on that list in terms of HPI and nearly so for pageviews. Really, a Marxist mass murderer? Data like this brings you face to face with the realization that others see things differently.

What about gender breakdown?

In the contemporary period, of the 5,859 globally consequential people, 18.9% are women. This matches the ranges you see in the US. For any field of competitive endeavor, the number that reach the top (partners in law firms, judges, award winners in literature, etc.) in the US falls in the range of 15-30%.

Globally, it varies dramatically by region.
Africa - 23/303. 7.6% of contemporary globally consequential individuals in Africa are female.

Asia - 102/543. 18.8% are female.

Europe - 405/2,645. 15.3% are female.

North America - 518/1,945. 26.6% are female.

Oceania - 34/109. 31.2% are female.

South America - 28/310. 9% are female.
The range of globally consequential people ranges from 7.6%-31.2% female depending on the region.

For the USA, the numbers are 464/1,727. 26.9%, second best in the world. A different way of looking at it is that the US produces 464/1,110 of all the globally consequential females, i.e. 41.9%. Given that we are ~5% of the global population, that is pretty stellar.

But a lot of these people are entertainers and celebrities. Let's look at the numbers for people in the STEM fields as well as business.

There are 477 globally consequential people from the STEM fields. 16 of them are women, 3.4%.

The USA has 225 globally consequential scientists, 47.2% of the global total. Of American scientists, 7 are female, 3.1%.

Europe has 174 globally consequential scientists, 36.5% of the global total. Of European scientists, 5 are female, 2.9%.

What about Scandinavia with among the most accommodating laws regarding women in the workplace? They have 13 globally consequential scientists and none of them are women.

Let's include business as another field.
Global - 4/67. 6.0% are female.

USA - 1/34. 2.9% are female.

Europe - 1/21. 4.8% are female.

Scandinavia - 0/1. 0% are female
These numbers from business and STEM confirms what I have observed in the past. Countries with the most pro-natalist policies, theoretically making it easiest for women to remain in the workforce, also have the lowest levels of female participation and achievement. At first this appears to be either a paradox or an example of unintended consequences.

I think this actually ties back to the underlying attributes of achievement. Malcolm Gladwell popularized the notion that to be distinguished in any field, you have to invest 10,000 hours of practice.

The research I have read indicates to me that while the notion might be directionally right, there are important details. Among the caveats:
The number of hours varies by field of endeavor with more hours required for those fields that are more competitive and fewer hours in less contested fields.

It's not just anyone. You have to have at least some native talent and, even more importantly, some degree of motivation.

It's not just any hours. They have to be focused, purposeful, concentrated and continuous. Putting in 10,000 hours over 20 years doesn't get you close to the performance of someone who has put in 10,000 hours over four years. The hobbyist will not perform at the level of the expert.

Interruptions in practice are highly detrimental.
This is all consistent with the research of Claudia Goldin and others who have found that career advancement and compensation are entirely dependent on hours of purposeful, intense, and continuous work and that there are no measurable consequences attributable to discrimination.

While the popular chatter is about finding ways to reduce the speculated impact of discrimination, all the research indicates that discrimination is not a measurable issue. The research points to different strategies to achieving equal outcomes between the sexes in careers. The first is to change societal norms so that it is equally probable that either the male or the female in a married pair will be the designated secondary career/primary caregiver of the children. The caregiver, male or female, takes a significant career/achievement hit simply because of the hours spent on caregiving. The primary career, male or female, remains free to invest the purposeful, intense, and continuous hours necessary to achieve the exceptional outcomes that are most rewarded.

The second strategy is to change work habits in such a way that exceptional performance can be achieved more quickly than the putative 10,000 hours.

Neither of these strategies is easy, or even probable, which is why the default conversation is about putative discrimination. We can make discrimination illegal. The problem is that gender based discrimination is illegal and has been for fifty years and yet the gender gaps remain. Discrimination is by-and-large a red herring which distracts from the real, but really difficult, actions that could be taken to change the outcomes.

One last cut of the data. Let's include the numbers for Business & Law, Exploration, Humanities, and STEM. These are people who create things. Let's exclude those who are Artistic performers (entertainers) and Sports (entertainers), and governance (Institutions and Public Figures). How many producers are there compared to the number of entertainers/governors?

893/5,859 are producers, 15.2%. That seems dreadfully low. 15% to produce and 85% to govern and entertain? Yikes. What if we were able to invert those numbers? 85% of our globally consequential produced new ideas and knowledge and products and services and only 15% were entertainers or governors. But that's not our world. Not today.

No comments:

Post a Comment