28 September 2007
I have earlier written about using anchoring vignetttes to correct for biases in self-reported measures such as health outcomes (here and here). One issue with self-reports is that respondents may interpret identical questions in different ways. The idea of vignettes is use controlled scenarios to measure this bias and adjust the self-reports accordingly, so that they are informative about the actual health status.
An interesting application of this method is a paper by D'Uva et al (2006 working paper here), who use vignettes from the World Health Surveys to identify and correct reporting heterogeneities in Indonesia, China and India. Their objective is to establish whether the reporting differences affect measures of within-country inequality in several health domains (mobility, self-care etc). They find evidence for reporting heterogeneity but also suggest that the bias is not large in their data.
The paper also discusses in more detail two assumptions underlying the vignette method, ``response consistency'' and ``vignette equivalence'' (also discussed in King et al 2004).
``Response consistency'' requires that respondents assess their own health in the same way that they assess other people's health (i.e. the vignette scenarios). This may fail if there is strategic reporting, for example when one's reported health status could provide access to entitlement programs for which other people's health status is irrelevant. ``Vignette equivalence'' essentially requires that the scenarios in the vignettes are perceived similarly across respondents; no systematic differences are allowed. The authors suggest that a failure of this latter assumption may underlie findings with respect to age that are in contrast with other studies. Elderly people might interpret the vignette scenarios differently since they are more likely to have own experiences with the described health problems.
I am curious whether these assumptions have been tested in detail. This might also stimulate some thinking about what elements of self-reports we want to correct for, and whether the determinants of reporting biases are of their own interest.
27 September 2007
Not to take anything away from David Lazer's presentation today at the Applied Stats workshop, but the star of his talk was the data. The crowd favorite appeared to be a dataset of all cell phone transactions over a several-week period for 7,000,000 subscribers somewhere in Europe (wouldn't say where). David and his colleagues have built a graph of interpersonal connections based on the call data, and are trying to answer questions like, "How many degrees of separation are there between two randomly selected people in the network?" (Answer: 13.) But to me an even more compelling question came up in the Q&A session: where do you get data like this?
David's answer was basically that you need to know the right people; it sounded as if he or one of his colleagues knew key executives at the phone company who were able to provide the call records. Lee Fleming offered that grad students might find their way to data like this by getting to know scholars like David who have access to it. (How many degrees of separation are there between you and your dream dataset?)
But the importance of knowing cell phone execs would be the wrong takeaway from David's talk, which after all was basically about how we are all awash in data these days. Yes, to get data on cell phone calls you may need to have friends at the phone company, and yes, to get information on where a group of MIT students spends every hour of the day over a few weeks you will have to launch your own experiment (as described in David's talk today), but for those of us with fewer connections and smaller research budgets there is still an enormous amount of data out there to collect, much of it from the web. I've actually spent a fair amount of time in the past year learning how to collect data from the web, and I look forward to blogging here about web scraping and other data collection approaches in the next few months. But right now I'm going to go check whether David left any tracking devices in my bag.
25 September 2007
I was reminded again the other day that the word “data” is plural, since it means more than one “datum”, and thus “data” requires a plural verb. The Economist style guide says so, as does the European Union translation manual. The Oxford English Dictionary doesn’t even have an entry for “data,” subsuming it under “datum,” and it identifies sentences with singular constructions as “irregular or confused usage.”
End of story, right? Maybe, maybe not. There are a couple of problems with the “data is the plural of datum” story. (These have been discussed widely on the web, and I’m drawing freely on those discussions). First, it is not quite right even in Latin to say that “data” is the plural of the singular count noun “datum”; both are conjugations of the verb dare, to give. Second, in English, we hardly ever refer to one piece of data as a datum; at least in political science it is an observation, a case, or perhaps a data point. When the word datum is used, it usually has a specialized meaning and takes the plural form “datums.”
The bigger problem, from my perspective, is that fully adhering to “data” as a plural count noun forces you into constructions like
How many data are enough?
How much data is enough?
The first of these “How many data are…” is correct for a plural count noun, while the second, “How much data is…” is appropriate for a mass noun such as “gold” or “water.” The second sentence sounds much better to me. It also wins on a Google Scholar search by a margin of 10 to 1 (2120 to 198). There are also about 400 hits for “How much data are…”, no doubt from those who want to treat “data” as a mass noun but have been reminded that “data is plural.” It seems to me that data has come to be like the mass nouns described in this post from Language Log:
A great many M nouns denote collectivities of things, but small things, especially small things whose indivual identities are not usually important to us: CORN, RICE, BARLEY, CHAFF, CONFETTI, etc. Some of these contrast minimally with C nouns of similar denotations, like BEAN, PEA, LENTIL. In any case, it would be easy to think of barley in "The barley was almost cooked" as "meaning more than one" in much the same way as lentils in "The lentils were almost cooked" does -- and in fact, every so often someone misidentifies little-thing M nouns as "plural".
I kind of like the idea of data as a collection of small things that aren’t that important to us as individual objects but that are meaningful when taken together.
So, in the end, is “data” a plural count noun or a mass noun? I would certainly prefer the latter, but at least on this side of the Atlantic it looks like it will be both. Here are some usage notes to ponder:
Data leads a life of its own quite independent of datum, of which it was originally the plural. It occurs in two constructions: as a plural noun (like earnings), taking a plural verb and plural modifiers (as these, many, a few) but not cardinal numbers, and serving as a referent for plural pronouns (as they, them); and as an abstract mass noun (like information), taking a singular verb and singular modifiers (as this, much, little), and being referred to by a singular pronoun (it). Both constructions are standard. The plural construction is more common in print, evidently because the house style of several publishers mandates it.
The word data is the plural of Latin datum, “something given,” but it is not always treated as a plural noun in English. The plural usage is still common, as this headline from the New York Times attests: “Data Are Elusive on the Homeless.” Sometimes scientists think of data as plural, as in These data do not support the conclusions. But more often scientists and researchers think of data as a singular mass entity like information, and most people now follow this in general usage. Sixty percent of the Usage Panel accepts the use of data with a singular verb and pronoun in the sentence Once the data is in, we can begin to analyze it. A still larger number, 77 percent, accepts the sentence We have very little data on the efficacy of such programs, where the quantifier very little, which is not used with similar plural nouns such as facts and results, implies that data here is indeed singular.
24 September 2007
Please join us this Wednesday (9/26) when David Lazer, Associate Professor of Public Policy and Director of the Program on Networked Governance at the Kennedy School of Government, will present "Life in the Network: The Coming Era of Computational Social Science". Professor Lazer provided the following summary of his talk:
An increasing fraction of human behavior (especially relational behavior) leaves substantial digital traces-- whether in the form of phone logs, e-mail, instant messaging, etc. Further, increased computational power allows the analysis of these digital traces-- e.g., through natural language processing, statistical analysis of massive (millions of individuals) longitudinal data, etc. These two points suggest that we are on the precipice of dramatic new insights into collective human behavior. I will discuss the potential future of a "computational social science", with reference to four ongoing research projects.
As always, our workshop begins at 12 noon in CGIS-Knafel room N-354. And a free lunch will be provided.
21 September 2007
This is a video worth watching: Hans Rosling: Debunking third-world myths
20 September 2007
I just came across this interesting article by Angus Deaton, who reflects on changing fashions in graduate work in recent years based on the recruiting for junior positions at Princeton's economics department. Princeton had eighteen candidates to come visit this year and Deaton is impressed by the "the breadth of topic that currently falls within the ambit of applied economics." While twenty years ago applied theses mostly focused on "traditional topics such as applied price theory and generally agreed-upon (preferably ‘frontier’) econometric methods", today's candidates seem to use much less theory, simpler econometrics, but work on topics as widely ranging as HIV/AIDS in Africa, child immunization in India, political bias of newspapers, child soldiering, racial profiling, rain and leisure choices, mosquito nets, malaria, treatment for leukemia, stages of child development, special education, war and democracy, etc. etc. He also observes a trend towards experimental methods in field settings; apparently one candidate even persuaded a Mexican city to pave a random selection of its streets.
I wonder whether other social science disciplines exhibit similar trends. In political science, it seems to me that there still is a strong focus on traditional topics and a reluctance to investigate more "exotic" (but socially important) topics because they apparently have "little to do with political science." However, one could argue that just as economics is everywhere, politics always has its role to play in most social phenomena. Also there is still very little work using field experiments (apart from important exceptions such as for example here or here). The same is true for quasi-experimental designs, which are still rarely used it seems to me. How about other disciplines?
17 September 2007
The applied statistics workshop begins this Wednesday (9/19) at 1200pm in N-354. The applied stats workshop is billed as a tour of the applied statistics community at Harvard University, with scholars from Economics, Political Science, Public Health, Sociology, Statistics, and other fields coming together to present cutting edge research. We are happy to have Ben Goodrich (Government G-5) presenting his work on Semi-Exploratory Factor Analysis. Below is a summary of his talk:
I develop a new estimator called semi-exploratory factor analysis (SEFA) that is slightly more restrictive than exploratory factor analysis (EFA) and considerably less restrictive than confirmatory factor analysis. SEFA has three main advantages over EFA: the objective function has a unique global optimum, rotation is unnecessary, and hypotheses about models can easily be tested. SEFA represents a very difficult constrained optimization problem with nonlinear inequality constraints that, for all practical purposes, can only be solved with a genetic optimization algorithm, such as RGENOUD (Mebane and Sekhon 2007). This use of new features of RGENOUD is potentially fruitful for difficult optimization problems besides those in factor analysis.
We have a preliminary schedule posted on the course website; please contact me (Justin Grimmer, firstname.lastname@example.org) if you are interested in presenting in one of our few remaining open spots. And of course, a light lunch will be provided.