The singular of data is anecdote

Amy Perfors

This post started off as little more than some amusing wordplay brought on by the truism that "the plural of anecdote is not data". It's a sensible admonition -- you can't just exchange anectodes and feel like that's the equivalent of actual scientific data -- but, like many truisms, it's not necessarily true. After all, the singular of data is anecdote: every individual datapoint in a scientific study constitutes an anecdote (though admittedly probably a quite boring one, depending on the nature of your study). A better truism would therefore be more like "the plural of anecdote is probably not data", which of course isn't nearly as catchy.

The post started that way, but then I got to thinking about it more and I realized that the attitude embodied by "the plural of anecdote is not data" -- while a necessary corrective in our culture, where people far more often go too far in the other direction -- isn't very useful, either.

A very important caveat first: I think it's an admirable goal -- definitely for scientists in their professional lives, but also for everyone in our personal lives -- to as far as possible try to make choices and draw conclusions informed not by personal anecdote(s) but rather by what "the data" shows. Anecdote is notoriously unreliable; it's distorted by context and memory; because it's emotionally fraught it's all too easy to weight anecdotes that resound with our experience more highly and discount those that don't; and, of course, the process of anecdote collection is hardly systematic or representative. For all of those reasons, it's my natural temptation to distrust "reasoning by anecdote", and I think that's a very good suspicion to hone.

But... but. It would be too easy to conclude that anecdotes should be discounted entirely, or that there is no difference between anecdotes of different sorts, and that's not the case. The main thing that turns an anecdote into data is the sampling process: if attention is paid to ensuring not only that the source of the data is representative, but also that the process of data collection hasn't greatly skewed the results in some way, then it is more like data than anecdote. (There are other criteria, of course, but I think that's a main one).

That means, though, that some anecdotes are better than others. One person's anecdote about an incredibly rare situation should properly be discounted more than 1000 anecdotes from people drawn from an array of backgrounds (unless, of course, one wants to learn about that very rare situation); likewise, a collection of stories taken from the comments of a highly partisan blog where disagreement is immediately deleted -- even if there are 1000 of them -- should be discounted more than, say, a focus group of 100 people carefully chosen to be representative, led by a trained moderator.

I feel like I'm sort of belaboring the obvious, but I think it's also easy for "the obvious" to be forgotten (or ignored, or discounted) if its opposite is repeated enough.

Also, I think the tension between the "focus on data only" philosophy on one hand, and "be informed by anecdote" philosophy on the other, is a deep and interesting one: in my opinion, it is one of the main meta-issues in cognitive science, and of course comes up all the time in other areas (politics and policy, personal decision-making, stereotyping, etc). The main reason it's an issue, of course, is that we don't have data about most things -- either because the question simply hasn't been studied scientifically, or because it has but in an effort to "be scientific" the sample has been restricted enough that it's to know how well one can generalize beyond it. For a long time most studies in medicine used white men only as subjects; what then should one infer regarding women, or other genders? One is caught between the Scylla of using possibly inappropriate data, and the Charybdis of not using any data at all. Of course in the long term one should go out and get more data, but life can't wait for "the long term." Furthermore, if one is going to be absolutely insistent on a rigid reliance on appropriate data, there is the reductive problem that, strictly speaking, a dataset never allows you to logically draw a conclusion about anything other than itself. Unless it is the entire population, it will always be different than the population; the real question comes in deciding whether it is too different -- and as far as I can tell, aside from a few simple metrics, that decision is at least as much art as science (and is itself made partly on the basis of anecdote).

Another example, one I'm intimately familiar with, is the constant tension in psychology between ecological and external validity on the one hand, and proper scientific methodology on the other. Too often, increasing one means sacrificing the other: if you're interested in categorization, for instance, you can try to control for every possible factor by limiting your subjects to undergrad students in the same major, testing everyone in the same blank room at the same time of day, creating stimuli consisting of geometric figures with a clear number of equally-salient features, randomizing the order of presentation, etc. You can't be completely sure you've removed all possible confounds, but you've done a pretty good job. The problem is that what you're studying is now so unlike the categorization we do every day -- which is flexible, context-sensitive, influenced by many factors of the situation and ourselves, and about things that are not anything like abstract geometric pictures (unless you work in a modern art museum, I suppose) -- that it's hard to know how it applies. Every cognitive scientist I know is aware of this tension, and in my opinion the best science occurs right on the tightrope - not at the extremes.

That's why I think it's worth pointing out why the extreme -- even the extreme I tend to err on -- is best avoided, even if it seems obvious.

Posted by Amy Perfors at March 28, 2007 10:06 AM