Friday, April 3, 2020

It is hard to let go of bad analysis on which you have spent time and money

Yesterday, the NYT had one of their cloistered hot-takes on the ineffable moral deficiencies and insufferable ignorance of all the rest of the nation. From Where America Didn’t Stay Home Even as the Virus Spread by James Glanz, Benedict Carey, Josh Holder, Derek Watkins, Jennifer Valentino-DeVries, Rick Rojas and Lauren Leatherby. Count them - seven journalists, cartographers, data analysts.
Stay-at-home orders have nearly halted travel for most Americans, but people in Florida, the Southeast and other places that waited to enact such orders have continued to travel widely, potentially exposing more people as the coronavirus outbreak accelerates, according to an analysis of cellphone location data by The New York Times.

The divide in travel patterns, based on anonymous cellphone data from 15 million people, suggests that Americans in wide swaths of the West, Northeast and Midwest have complied with orders from state and local officials to stay home. Disease experts who reviewed the results say those reductions in travel — to less than a mile a day, on average, from about five miles — may be enough to sharply curb the spread of the coronavirus in those regions, at least for now.

“That’s huge,” said Aaron A. King, a University of Michigan professor who studies the ecology of infectious disease. “By any measure this is a massive change in behavior, and if we can make a similar reduction in the number of contacts we make, every indication is that we can defeat this epidemic.”

But not everybody has been staying home.
Which is fine as far as it goes.

But boy, it doesn't go very far. This has the feel of a research fishing expedition, fed on petty prejudices which got out of hand, and which they then had to salvage as best they could since they had already spent so much time and money on it. They needed to show they had at least something for their efforts. Even if it is a GIGO something.

Cool! We have data, let's see what is can tell us.
The location data, from Cuebiq, a data intelligence firm, measures the range that people travel each day. It cannot predict where outbreaks will spread, and it does not track how many interactions people had while they were traveling. Not all travel is problematic: A person driving for a few miles to pick up groceries would not be violating stay-at-home orders. And people in cities can infect others without traveling far.
OK. We know from movement of cellphones that some people in some areas are traveling more than others. We, the NYT team, want to show that quarantines should have happened sooner, that people ought to comply better, and that with good compliance with quarantines, the results will be better.

What they end up conclusively demonstrating is that people in rural areas with less density do more driving.

For the same organ which got confused about the difference between the popular vote and the electoral college in 2016, this might pass muster as insight, but for most of us outside the Mandarin compound, it is an appalling reflection on the journalists and the NYT.

Instead of getting caught up in the wizardry of data sets and pretty maps, the NYT team should have thought in terms of arguments, measurements and information. They are making an argument. Something like: "Social distancing is an important tool in the arsenal of controlling Covid-19. To increase social distancing, people should be commuting less. If you reduce commuting, you will reduce the spread of Covid-19."

It is not a good argument, but it is clearly better than the random generation of maps and speculative interpretation that they chose to do instead.

The first element of the argument is speculative at this point but I am pretty confident is going to end up being proven true. The whole pandemic might have been smothered in its natal bed had everyone immediately 1) social distanced/reduced interactions, 2) used masks in all social environments, 3) Adopted the habit of routinely washing hands throughout the day, and 4) conducted massive on-demand testing of suspected cases as well as population level testing.

There are plenty of reasons why leaders chose not to do this but that is what we will probably end up deciding we should have done.

So let's accept that social-distancing will turn out to have been beneficial. Is there anything in the data presented which highlights that social-distancing is beneficial to epidemic containment? If the maps show that early adopters of social-distancing and constrained travel are also the beneficiaries of reduced exposure, that would be compelling, though not definitive, evidence of the efficaciousness of social-distancing.

They don't do so. Pulling in evidence not presented, we in fact know that the locations with markedly shorter commutes also have a lot of Covid-19 cases. It is hard not to reach the incorrect conclusion that there is a tight correlation between places with early-onset social distancing and rapid onset of new Covid-19 case loads.

To strengthen their argument, they need to correlate social-distancing with lower incident rates. They make no effort to establish such an association, much less a causal association.

They also fail to establish a baseline. They are showing maps of average commute distances in conjunction with date of lock-downs at a national level. Fine. We need granularity. We need before and after by locations. We need to know what the average commute distance was before the lock down and afterwards. In one part pf the article they offer that the average travel distance fell in Seattle from 3.8 miles in February to 61 feet in March. Seems improbable but lets take it as read. They compare this to Daytona Florida where, in the same period, the travel distance fell from 4.4. miles to 1.9 miles. Nice examples but that is cherry-picked data. We need to know the average travel distance pre and post and they do not demonstrate that in a systematic and comparative fashion.

Let's turn the focus towards probably the worst error in their argument. When measuring a phenomenon, definitions and measurement are critical capabilities. We are seeing the failure of good definitions now worldwide. Do we define Covid-19 deaths as only those with no other contributing cause? If we include contributing causes, do we count as a Covid-19 death all deaths associated with an underlying disease or illness? Or only those where Covid-19 was the precipitating event to death? Do we count deaths only in hospitals or those which occur elsewhere? We are seeing dramatically different descriptions of the course of Covid-19 simply due to the variance in definitions such as these.

In this NYT argument, they are using transportation distances as a proxy for social interaction and then social interaction as a proxy for risk of death. This is a weak proxy at best. What we really need is an index which captures the degree of risk/exposure. A person who goes from their suburban home in their car to their stand-alone office five miles away has far less exposure to Covid-19 than does the New Yorker going from her apartment on a half mile subway ride to their office building. If you accept the NYT premise, the suburban car-driver is a far greater risk of transmitting the disease than is the New York subway rider. Of course that is nonsense.

The seven journalists acknowledge this fatal flaw and then dance on by it.
Not all travel is problematic: A person driving for a few miles to pick up groceries would not be violating stay-at-home orders. And people in cities can infect others without traveling far.
They need an exposure index which takes into account distance, travel time, density of exposure during travel (car versus train), volume of exposure (a half mile trip on a subway and then a bus is worse than a single half mile subway ride), probability of exposure to an infected person (number of Covid-19 cases in the geographic vicinity being traveled, etc.

By using distance traveled as a proxy for risk of exposure, they are making a mockery of the cartographic data analysis.

This leads to the third weakness. If you limit commuting, will you limit the spread of Covid-19. Yes, probably but it is going to differ substantially by context. It is not the commuting that is going to kill you, it is the nature of the exposure you have during the commute. If I travel ten miles once a week to my local 20,000 square foot grocery store in my county where there have been no recorded Covid-19 cases, I am at far less risk than the person in the city who, because of limited storage space, makes three to five trips to the corner 1000 square foot store during the week in a city with a high incidence rate of Covid-19.

Their definitions and measures are just not up to making the argument they want to make. But the assumptions they make to justify using commuting distance as a proxy for exposure sure does make the argument that 1) urban-based journalists don't understand the rest of America, 2) that one policy size does not fit all, and 3) in a large, diverse, federal system, local risks, costs, benefits and viable policies are going to vary enormously.

The NYT journalists look at bad measures and appear to believe they see differences that reflect their own regional prejudices.

Which is too bad. I am a big fan of good data analysis and good data visualization. But that is not what we have here.

But that is not quite why this caught my eye. In fact, glancing at it, I saw the nature of the problem of their analysis and decided it did not warrant reading. Which turned out to be true. But others did read it. It quickly became an example of the possibilities of social thinking/social communication.

This is actually an interesting epistemic challenge. If you aren't going to use commuting distance as the proxy, what can you use? What data set is adequate?

Ann Althouse read the article, and the comments and points out one reader's good-faith effort to address the challenge:
I'm a Fed, but live in rural Oklahoma. I normally travel 33 miles each way to work in Oklahoma City M-F, but we're on telework order, so that part is down to zero. However, most of you can't imagine the distances we have out here. The nearest decent grocery store is 16 miles away, and a Walmart is about 22. So, if we go out food shopping once every two weeks, that could be 44 miles for that alone. The maps therefore are missing once critical element (which is admittedly VERY hard to compute), and that would be "Average Essential Travel Distance". That should be the denominator in a ratio, with the numerator being "Average Miles Traveled". To be fair, I do statistics for a living, and I wouldn't even begin to know how to estimate that denominator, other than to ask a sample of individuals to take a guess at it.
Then there is this effort which implicitly suggests the correlation between travel distances and store distances.

I agree that social distancing is a good policy. I agree that we ought to use good data, good analysis, and good data visualization to make good decisions.

And I applaud the NYT for their willingness to invest the journalist time and presumably money for the purchase of the data set and/or related services. But once they realized they were conducting bad analysis, they should have pulled the plug.

Instead they went to press with a flawed analysis which serves to reinforce the impression of the mainstream media as smug, ignorant, and arrogant with ill-founded prejudices against anyone who is not like themselves.

And that is not useful.

No comments:

Post a Comment