Tuesday, June 5, 2012

How to assess a piece of writing, especially outside one's expertise

This will be a long post and the genesis needs some explaining. I belong to a listserv of individuals, probably some three or four hundred, who are interested in the field of children’s literature – what is good reading, what is good to read, etc. The members are mostly authors, academics, literary critics, librarians, and publishers, though with a smattering from other areas of specialty. Basically these are reasonably knowledgeable and certainly educated individuals.

Recently, one of the members forwarded a link to an article, commenting to the effect that she thought it was a good article that was worth reading. I read it and was rather appalled. Poor logic, faulty assumptions, self-serving, bad data, etc. There were so many things wrong with it as a piece of writing and as an argument. I dismissed it as just an oversight on the part of the person who forwarded it. Perhaps she only read the first paragraph and sent it on. Then two or three others responded on the list thanking the forwarder for bringing this good article to their attention. It was something of a cognitive dissonance. Could these people be reading the same article? What might explain the chasm between praise and bad performance?

Which set me to thinking; why did I think it was so poor an article? Would it be possible to put some parameters around that opinion and make it somewhat objective? In doing so, would it be possible to create a tool that would allow a reasonably easy but sensible assessment of the quality of a piece of writing, particularly one that might be outside one’s own realm of expertise?

In the past ten years, to our benefit, we have been deluged with greater volume and greater access to data and opinions. With the older generation of people who are accustomed to expending great time and effort in unearthing useful information and who therefore already have the habits of skepticism, prioritizing effort and judging the value of data for the effort required to acquire it, this is a time of blessings. We can now so much more easily find and use that which used to be so difficult to obtain.

But what about the guys coming along? The ones who have not acquired the habits of skepticism, prioritization, and trade-off decisions about effort and return. For them, there is more than enough data at their fingertips, all equally credentialed by its presence on a computer screen. How will they acquire the skills and habits of skepticism, prioritization and making trade-off decisions? Coming without an already established domain expertise or body of broad knowledge and experience, how are they to know which sources are to be trusted, how to identify inherent contradictions, how to recognize a well formed argument as distinct from an emotional assertion? The floodgates of data are open and we are now deluged not just with useful information but also a tidal wave of cognitive pollution.

The obvious bulwarks against a flood of cognitive pollution are the traditional, widely praised skills of critical thinking and logic (with statistics as a less traditional skill but of greater pertinence in an increasingly complex data environment). Widely praised but rarely practiced and even more rarely taught. Who has had a course in statistics, logic and critical thinking? I regard myself as well educated but did not have exposure to logic and statistics until university. And while I have heard much praise of critical thinking, I am not sure I have ever seen it offered as a course. These ought to be fundamental skills acquired in high school.

My effort in this essay is to lay out a means to usefully assess the value of a piece of writing. Is it worth spending time reading, using or disseminating? I will secondarily use the original essay to explore the usefulness of the model. My desire is that perhaps the assessment tool can be useful not only for screening out cognitive pollution but also for redirecting our attention to real problems and real arguments that can be fruitfully disputed. As rich as our knowledge environment has become, there are still very real limits. Frequently the most critical need is an understanding of exactly where that frontier of knowledge lies. Too often “facts” are accepted as true that are not and equally, too often certain data is regarded with skepticism that is unwarranted (usually because the data does not comport with some aspect of our worldview rather than that the data is wrong).

The dynamics of world integration, connectivity, trade (commerce and ideas), competition and freedom have created such great wealth and opportunity in the past five hundred years, and especially the past fifty, that is easy to become unmoored. Opportunities, even in the current turbulence, are so manifold that we can lose sight of what has been accomplished. With increasing prosperity has come huge increases in funding for education and for research. Pick even the most obscure specialization and there is some group of tireless experts exploring the field, advancing the frontiers of knowledge. The progress has been so great for such a long period of time that it is easy to assume that that is just the nature of reality and not recognize how unusual and unique is this experience in the 100,000 year existence of modern man.

We take for granted that if there is an agreed upon problem, then there must be a solution to that problem. A solution that we can know. We have succumbed to hubris and fail to acknowledge that, for all our good intentions and for all our effort, there are very real limits to our knowledge. (See How Reliable are the Social Sciences? by Gary Gutting for a discussion of the limits of knowledge. See also A Sharp Rise in Retractions Prompts Calls for Reform by Carl Zimmer for a discussion of faulty research. See also Why Most Published Research Findings Are False by John P. A. Ioannidis. See also Lies, Damned Lies, and Medical Science by David H. Freedman).

In thinking this through, one of the first issues I addressed was the recognition that there is some form of continuum of purpose of writing. I grouped these roughly as: Entertainment, Essay (connecting ideas), Opinion (a statement of beliefs usually with a logical consistency), Advocacy (Opinion plus some one-sided data based support), and Argument (a thorough exploration of all sides of an opinion). If I write a column in which I am preaching to the choir and simply affirming pre-existing views but am doing so in a fashion to entertain those that are reading, should I be held to the same standard as someone legitimately exploring an issue trying to find a consensus? I think the answer is clearly no. Writing intended to entertain is something of a different beast. It may inform indirectly and subtly but it is essentially an act of affirmation – in order to be entertaining to a reader, the reader must share some basic values and assumptions. All the other forms are related to truth telling or more accurately, truth discovery. What is out there? Is it real? Is it important? Is it worthwhile? The upshot is that the mechanism I will advance for assessing a piece of writing is really only pertinent to writing that is intended to advance an idea or to make an argument. We can narrow it down even a little further. We aren’t really looking at reference documents or things that are a didactic transmission of information. We are looking at writing that seeks to establish an idea that might be useful, predictable, and non-obvious.

What I have done is create five overall categories by which an article might be assessed and I have then, within each category, identified five elements by which we can make an assessment. Each element can be assessed on a scale of 0-2. Zero indicates that the element is missing or demonstrably flawed. 1 indicates that there is some evidence for the element. 2 indicates that the element is clearly present. The result is that one can make a summary judgment of an article on an aggregate scale of 50 points. There is useful comparative information in terms of the absolute number as well as in terms of the root issues. Where points are lost indicates specific weaknesses in the structure of the article.

This scoring system is not intended to be demonstrably objective. All it seeks to do is provide a guideline for assessment. There is much that is subjective about it. None-the-less, I suspect it will be useful.

The five categories are 1) Integrity of the argument, 2) Quality of original research, 3) Quality of data, 4) Refutability, and 5) Eminence of authority.

Integrity of the Argument

Integrity of the argument attempts to put some parameters on the whole proposition. If you are time constrained, you can stop after completing just this one assessment. There are five elements used to judge the integrity of the argument.
• Clarity and structure of the argument – Is it clear what the author intends to argue? Are there well marked linkages from one idea to another in the construction of the argument? Ideally you have a clear and explicit statement of the proposition or argument as well as an articulation of how the argument will be supported. It is not uncommon though to have to infer what the author intends the argument to be as well as to make assumptions about missing steps in the construction of the argument.
• Logical consistency – Is there an internal consistency to the argument? Are there any or many instances where the acceptance of one piece of information contradicts another? Usually this manifests itself as the acceptance of a number of predicates, each of which is desirable but some of which are contradictory of one another. For example, we might all accept that in raising a child it is important that they be kept safe and healthy but also that they be allowed the freedom to make their own decisions. On the face of it, it is not logically consistent that a child making their own decisions will also be kept safe and healthy.
• Relevant support – Is there evidence to support the assertions made? This has two elements. First is whether there is evidence made available at all and second whether that data is relevant to the argument being made (broadly known as a non sequitur).
• Accuracy and completeness of data – Is the data that is advanced accurate? Again there are two elements. For issues that are at the frontier of research, there may be no body of data and new research is being conducted to create the supporting data for the first time. This is addressed in the Quality of Research category. Alternatively, there may already be much data but whether it is being interpreted correctly or whether it is even pertinent may be problematical. Quite frequently the data used is that which is available rather than that which is pertinent. Data inferred from a collection of anecdotal information is less accurate or complete than data collected with a rigorous methodology, reliably, over long periods of time, by practiced individuals.
• Assumptions and definitions – Are the critical assumptions and terms defined clearly? Both issues are common sources of weakness of argument. Critical assumptions are not identified or terms are used in non-standard fashion. For example “Seatbelt laws have led to a decline in vehicular deaths” assumes that we are measuring deaths inside the vehicle and not vehicular deaths at large. Seat belt laws, for example, have in some countries led to declines in driver/passenger deaths and an increase in pedestrian deaths (see Peltzman Effect, Moral Hazard, Risk Homeostasis and Risk Compensation for related discussions)
An article which has a clearly stated argument and clear structure for supporting that argument, that is logically consistent, marshals relevant, complete and accurate data and which is careful to identify key assumptions and define obscure terms would be rated a 10. Articles in which many inferences have to be made as to what the author is arguing, does not advance any data or uses irrelevant or inaccurate data and which makes critical but unidentified assumptions and which fails to define terms that might be being used in non-traditional fashion would be rated a 0.

It is a truth dictated by constraints that an article that is short but covers a complex issue may be not be able to meet many or even any of these criteria. It is not uncommon for newspapers or magazines to commission an article on a topic but then provide too little scope for it to be adequately addressed. That may mitigate opprobrium directed towards the author but it would remain true that the article would be of little relevance or use to anyone. If you already accept the author’s premises and conclusions, it adds nothing. If you are dubious, the absence of these attributes means that your doubts will not have been addressed.


Quality of Research

Some articles are presenting original research or depend on original research to make their case. In these instances, where knowledge is new, there is a critical need for being able to assess the likely reliability of the research. This assessment is especially critical when the topic is outside of one’s domain of knowledge. I have a deep knowledge of economics. My knowledge of biochemistry is comparatively shallow. If someone is making an argument based on some original biochemical research, how much credence ought I to lend to that research?

There are five elements that would allow an individual to informally assess the credence of research conducted outside one’s own domain of knowledge.
• Sample size –How many participants are there in the study? There are narrow circumstances where one can get away with small numbers but in general, the more analysis you are going to do and the broader the inferences you are going to draw, the larger the number of participants/data points need to be in the study. The social sciences are notorious for sweeping conclusions drawn from sample sizes of 40 or even less. These studies are close to meaningless, at best they are indicative. A variant on sample size is duration. A once off study is of much lesser value than a study which examines large numbers of people over long periods of time.
• Representativeness of the sample – How reflective are the participants in terms of the necessary attributes of the population about which one wishes to draw conclusions. Again, it is a notorious fact that much social research trumpeted in the headlines consists of some small number of undergraduate university students. In practical terms, a population of upper middleclass healthy 20-years are only representative of upper middleclass healthy 20-year olds.
• Research has been replicated – Have the original results been duplicated by the original team in a second test or replicated by independent third-parties. Most published results (see above) are withdrawn or found to be in error. Until research has been widely replicated, it is of dubious value.
• Does the research lend itself to a testable hypothesis? – Research that does not lend itself to a testable hypothesis, or more specifically does not make a specific prediction that can be tested is only useful at an abstract level. An article that has no prediction at all is close to useless. One that makes only a general or binary prediction (ex. It will rain tomorrow) is of some, but limited value. An article that makes a measured and specific prediction (ex. It will rain ten inches tomorrow afternoon) reflects much greater testability and hence potential reliability.
• Counterfactuals – Does the research recognize, identify and address counterfactuals? Everyone is subject to confirmation bias. The trick to supporting a robust argument is to be able to play devil’s advocate and address the counterfactuals and alternative hypotheses. If the original research makes no attempt to identify and address counterfactuals, there is a reasonable probability that the argument is inherently weaker than it might otherwise be.
Research which is based on large sample sizes, longitudinal in nature, with representative populations, the key results of which have been replicated and for which there are no counterfactuals and which lends itself to specific and precise forecasts or predictions would be rated 10. In contrast, studies of small groups of unrepresentative individuals done on a snapshot basis, which haven’t been replicated and which does not lend itself to any sort of useful prediction and which does not address known counterfactuals should be rated close to 0.


Quality of Data

Many arguments or essays rely not so much on original research as they do on already existing data, though frequently with a different interpretation. Unless the data sources are familiar, it is challenging to assess whether the data is being over-manipulated or presented in a misleading fashion. Just as in personal interactions, often times the greater sin is that of omission rather than commission, particularly regarding context. It is one thing to be told that shark attacks resulting in death have doubled from five years ago. It is another to understand that there was one death five years ago and two this year. There are two issues regarding quality of data then, how good is the data itself (reliably collected and presented) and how well is the data used to support conclusions (completeness and context).

There are five elements for assessing the quality of data sources and uses. In order to be able to do this quickly, it is necessary to introduce an even greater degree of subjectivity than usual as will be evident from the discussions below.
• Completeness – Are all the key points supported by reputable sources of data? For example, if one is making the argument that poverty is a significant cause of crime, then one cannot simply show that poverty has increased, one has to show that with the increase in poverty there was also an increase in crime and that there were no other sources of crime increase. In the US, crime has been on a steady decline for the past thirty years. In the past three years, the number of people living in poverty has increased even though crime has fallen. Even if crime had increased, we would still need to show that there were not other factors in play. For example, the percentage of foreign born citizens has also steadily risen in the past thirty years. Is that simply correlated or is there a causative link? It has to be ruled out in order to indict poverty. (See Crime and the Great Recession by James Q. Wilson for a discussion of the complex mix of causes for crime).
• Frequency of citation – Is the data set widely cited? This is a proxy for reliability. If a data set is widely used and relied upon then it is likely to have fewer unknown errors. Or more accurately, errors are likely to have already been winnowed out. This is not a foolproof assumption but it is generally true that the more widely cited is the data set, the more reliable it is likely to be.
• Transparency of collection – How rigorous and stable is the data collection methodology, how transparent is it and how publicly available is it? Ideally, there is a clear and stable methodology of data collection and the results are publicly available.
• Independence of source – Is the source of the data independent of external influences? As Donne noted, no man is an island and we are all subject to pressures both subtle and not. The most obvious sources of lack of independence are when the data originates from a source which depends on the data (or its interpretation) for its source of income or when the data is collected by an advocacy group. The ideal is that the data is collected and interpreted by parties who have no stake (financial or reputational) in the data showing one outcome or another.
• Reputation of source – What is the reputation of the individual providing or analyzing the data? This can be difficult to determine and is not infallible (See Fraud Scandal Fuels Debate Over Practices of Social Psychology by Christopher Shea for discussion of fraud by an eminent practitioner in the field of psychology. Also see Why Experts Get It Wrong by James Warren.) However, the more one’s reputation is on the line, the more likely in general one is to hew to the straight and narrow. If the source of the data has an established and good reputation, you are more inclined to rely on the data.
Arguments which are well supported by copious (but pertinent) data from trusted sources using public methods and with no clear stake in the outcome of the data and who are frequently cited by others in the field are highly rated. Data originating from unknown sources who have no established reputation, are not relied upon by others in their field, who are likely to be rewarded should the data reveal one outcome over another, who do not publish their methodology for collecting and analyzing the data and/or who do not make the source data public are rated 0.


Refutability

The gold standard of any argument is the degree to which it can make useful, predictable and non-obvious predictions which can either be refuted or proven. It is not that the absence of refutability destroys an argument, merely that it makes it more tenuous. Everyone hesitates to make predictions, particularly concrete and specific predictions. Nobody wishes to set themselves up for failure. The willingness to do so is a measure in the confidence (which may be misplaced) of the individual in the integrity of their argument. The classic example of refutability is the famous Simon-Ehrlich Wager. Paul Ehrlich is famous for writing The Population Bomb in 1968 which made numerous specific forecasts virtually all of which have turned out to be wrong.

In making an argument, it is usually the case that the author is more focused on establishing the truth of their argument than they are in the capacity of that truth to generate a testable prediction. Frequently, one has to make an inference based on the author’s argument in order to make a prediction. For example, an economist might make the case that increased gas drilling will be good for the national economy and leave it at that. Using basic economic principles and ceteris paribus, it is fair to infer that gas prices should come down in the future (increased drilling leads to increased production leads to increased supply leads to lower prices). The point of the economist’s article may not be to forecast gas prices but it is a natural inference of the economist’s argument that gas prices should go down. If they do not, it calls into question whether in fact increased drilling is good for the economy. The key point is that sometimes the reader has to fill in the author’s absent forecast on a good faith basis.

There are five elements for assessing refutability.
• Measured phenomenon – Does the author define and measure the pertinent outcomes or variables with any specificity? It is one thing to talk about America publishing many new books each year. It is quite another to report that there were 87,000 new titles published in 2011. What are the key variables and how are they measured?
• Critical path identification – Which critical assumptions or conclusions have to be true for the overall argument to be true and has the author identified those and provided the necessary data?
• Inferences made explicit – Is it clear what the author is concluding from the data they are presenting? The classic example is the Butterfield Paradox named after a journalist given to writing articles about the paradox of crime falling while the prison population was increasing. From these articles it was clear that the journalist was blind to the possibility that increased incarceration was leading to decreased crime as opposed to his unspoken inference that reduced crime must mean there are fewer criminals needing to be locked up.
• Concrete forecast – Does the author make an explicit and detailed forecast? Predicting that gas production will increase is marginally useful. Predicting that it will increase by X% in Y time period is actually useful.
• The forecast is useful, reliable and non-obvious – Does the forecast have all three of these attributes?
Arguments that contain explicit and specific forecasts in measured terms for each of the critical elements of the argument and which can be expected to be affirmed or refuted score highly. Arguments that have no clear forecast, fail to measure outcomes or mechanisms, leave it up to the reader to make critical inferences and to determine which elements of the argument are most crucial would be rated 0.


Eminence of Authority

To some extent this is covered in some of the elements above. It is included as an independent category because the logical fallacy of appeal to authority is so frequently used and indeed misused. (See Untrue: 1 out of Every 10 Wall Street Employees is a Psychopath by John M. Grohol for a fairly typical example of misuse of appeal to authority). When a writer is tightly constrained in terms of the number of words in an article, it is often far easier to simply allude to an authority rather than to spell out the details. The challenge is that often the author hasn’t understood what the authority actually said, or misuses the authority. The cited article above is an example where journalists simply did not understand what was in the original study and misreported the findings. The other common abuse of authority is to cite a recognized authority in one field about a topic in a second field in which they are not an authority. A fictitious example of misappropriation of authority would be a claim that you ought to drive a Ford because that was Einstein’s preferred car. Unless you believe that Einstein was also a preeminent mechanic, the fact that he was the leading thinker in physics does not make his choices in other fields equally qualified.

There are five elements that contribute to an assessment of authority. Again, as with Quality of Data, there is an element of subjectivity in these elements and the guidelines are definitely more indicative than infallible.
• Affiliated institutions – Where did the authority receive their education and where do they teach or research? The more competitive and prestigious, the more likely that the individual might be considered an authority. This is a weak predictor. The bulk of innovative thinkers come out of more mundane institutions and there are a plethora of domain knowledge leaders outside the small number of world ranked research institutions. That said, the preeminent institutions do function as a filter on excellence.
• Duration in domain – How long has the person of authority functioned in their particular domain of expertise? With a handful of exceptions (in specific fields) in general there is a correlation between duration and competency/expertise.
• Domain relevancy – Is the authority being quoted really a leading expert in the domain of knowledge pertinent to the issue on which they are being quoted?
• Autonomy – Is the authority shielded from undue influence?
• Awards and citations – Has the individual’s contributions been recognized in some fashion by their peers, usually in the form of awards and citations?
Referencing the opinion of an expert on an issue in the domain in which they are expert is a perfectly legitimate tactic in making an argument. Referencing individuals who do not hold a comparable level of competitively achieved expertise is of low value.


Summary

In each of the categories, each element can be rated as 0, 1, or 2. The total potential score across all five categories is 50 points. The aggregate score allows an assessment of the relative value of the article or essay in terms of advancing knowledge.
• 0-5 points – Dog’s breakfast. Incoherent, unsupported and/or ill supported. Cognitive pollution.
• 5-10 points – Negligible value.
• 10-20 points – Weak value. If it is primarily an essay (an exploration of connected ideas rather than advancing an argument) it might still have value.
• 20-30 points – Respectable value. The legitimacy of the argument is established but not necessarily its validity.
• 30-40 points – Strong value. The case for the argument is good but not complete or impervious.
• 40 points or more – Robust. The case is well established and likely to be true given all that is currently known.
As described at the beginning, this approach allows one to pinpoint the weak points in an article. Perhaps the author has not been clear or not provided enough data or made it clear why the data is pertinent. It should serve as a catalyst for improving an argument.

Having created this from whole cloth, in the next post I will use it to come up with a semi-objective assessment of the original article which served as a catalyst for this train of thought. In doing so, it should also lead us to clarify how the article might have been better argued and even identify that there is a different argument to be made entirely.

No comments:

Post a Comment