September 16, 2012
The NYT just alerted me to a paper by Joan Serra and coauthors demonstrating what we can learn about popular music with a big data approach. I'll leave it to you to interpret the trends they identify (music is getting louder, also more similar), but it was interesting and gave me a lot of ideas for how I could borrow some of this technology for my own research.
July 2, 2012
Ok, maybe not, but I was just introduced to kaggle, which is sort of like oDesk and a lot like topcoder: people post a problem and other people compete to win a prize by solving it most effectively. Kaggle is devoted to data analysis problems. For example, there is currently a contest to win 3 million if you can "Identify patients who will be admitted to a hospital within the next year, using historical claims data." Another contest is to "Identify people who have a high degree of Psychopathy based on Twitter usage," for $1000. Even at those low stakes, there 113 teams competing.
If it's like topcoder, it will be a remarkably cheap way to get great solutions to data problems. I'm not sure why (I guess competing analysts overestimate the probability of winning?), but this is what Karim Lakhani tells me. I'm already working on a paper using topcoder, but I'm now wondering -- half seriously -- if I could outsource my data analysis as well. I probably couldn't, but only because I enjoy this kind of work myself and I'd miss it.
If you work with data, will Kaggle eventually take your job? Well, it makes it easier for people with data to find the best analysts out there. I'm not directly in competition with these folks, since none of these companies would come to me with their consulting gigs anyway. But it does suggest that the consulting market for quantitative social scientists is on the verge of being restructured. I'm excited about that -- I'm willing to bet that we'll learn more this way.
May 30, 2012
Ever wondered how to get twitter data? I spent much of today listening to a great presentation on this subject by Derek Ruths at the 2012 Computational Social Science workshop hosted at IQSS and conveniently broadcast on the web. I'm not sure if video of the presentation will be available, but the code is all available and is extremely well done. For the most part, it worked "out of the box" while I was sitting there listening to the lecture, which is incredible. More incredible, it seemed like it worked for most of the people there. I've never seen a live scripting demonstration go so smoothly.
May 12, 2012
1) I love seeing other people's first sketches. I sketch first too, and I find that the quality of any graphic can mostly be determined by how good the idea was when I first sketched it.
2) This reminded me that rather than using R to make my final figures, I really need run them through Illustrator. Nathan Yau's book Visualize This gives some awesome worked examples of how to clean up R graphics in Illustrator. (And for Harvard folks, the book is available online through Widener library!).
March 6, 2012
Every discovery of a plausible instrumental variable sparks a cottage industry of papers all using the same instrument to ask different questions. A working paper by Heather Sarsons, titled "Rainfall and Conflict" calls one of these cottage industries into serious question. From the abstract:
Starting with Miguel, Satyanath, and Sergenti (2004), a large literature has used rainfall variation as an instrument to study the impacts of income shocks on civil war and conflict. These studies argue that in agriculturally-dependent regions, negative rain shocks lower income levels, which in turn incites violence. This identi cation strategy relies on the assumption that rainfall shocks affect conflict only through their impacts on income. I evaluate this exclusion restriction by identifying districts that are downstream from dams in India. In downstream districts, income is much less sensitive to rainfall fluctuations. However, rain shocks remain equally strong predictors of riot incidence in these districts. These results suggest that rainfall affects rioting through a channel other than income and cast doubt on the conclusion that income shocks incite riots.
It's a short, readable paper -- worth checking out if you're into this kind of thing.
January 17, 2012
November 26, 2011
Using racially charged google searches as a proxy for racism, this paper by Seth Stephens-Davidowitz shows that Barak Obama lost 3-5 percentage points of the popular vote in 2008 because he is black. I found it very interesting and the emprical strategy invites imitation and application to other areas. Worth a look.
November 9, 2011
We already knew that scholars who provide replication data get cited more. Now we know that they are also more likely to be right! Paper by Wicherts, Bakker, and Molenaar here. Blog post by Gelman here.
The authors asked for replication data to 49 psychology studies. Amazingly, many of them did not comply even though they were explicitly under contract with the journals to provide the data.
1) Papers whose authors withheld data had more reporting errors, meaning that the reported p-value was different than the correct p-value as calculated from the coefficient and standard error (as reported in the paper). I'd really like to think that these were all just innocent typos but: in seven papers, these typos reversed findings. None of those seven authors shared their data.
2) Papers whose authors withheld data tended to have larger p-values, meaning that their results were not as "strong" in some sense. This interpretation tortures the idea of the p-value a little bit, but it certainly represents how many researchers think about p-values. It's striking that researchers who think their results are "weaker" were less likely to provide data. It also suggests that researchers who are getting a range of p-values from different, plausible models tend to pick the p-value just below 0.05 rather than the one just above. But then, we already knew that.
This is frightening, not least because most of these were lab experiments, where we tend to think that the results are less sensitive to analyst manipulation because of strong design. Also, these are only the problems that were obvious without access to the replication data.
Most responses to this study include appeals for better data sharing standards, but I don't think it's necessary. As long as we know which authors provide replication data and which don't, we can all update accordingly.
September 29, 2011
Benedict Carey of the New York Times discusses a paper (gated) by Scott Golder and Michael Macy showing that people's moods -- as expressed within the character limit of twitter -- have remarkably predictable patterns. The authors' interpretation is that our moods are fundamentally linked to our circadian rhythms.
First, I really like this paper and I'm glad to see it come out in Science. An earlier version was presented at one of the conferences put on by Arthur Spirling and the Harvard Program on Text Research, and it caught my eye then.
On one hand, it's obviously innovative research that is making great use of the reams data that are now sitting in the interwebs somewhere, waiting to be analyzed. The possibilities in this new, data rich realm are seemingly endless: the culturomics/ngrams project, work on political blogs (Abe Gong), congressional tweeting (Drew Conway), the news cycle (Leskovec, Backstrom, and Kleinberg), and so on.
But we also should spend more time stepping back and asking hard questions about the data. Are tweets really a great measure of sentiment if the decision to tweet isn't random? Who is online and how do they differ from the offline folks? There is basically no discussion of this in the Golder and Macy article. Perhaps the lack of attention to the limitations of "big data" research is just an inevitable part of the fad cycle, but that doesn't mean we should let our standards slide just because someone has cool data.
June 8, 2011
A recent New York times piece by Gina Kolata summarizes some recent writing in medical journals bemoaning the proliferation of side effects in drug labels.
From the FDA:
"extensive lists of rare and minor adverse events for which there are no data to support a causal relationship" are not useful.
I don't know exactly what the FDA is means by "data to support a causal relationship", but considering that it takes millions of dollars of randomized experimentation to get evidence of a causal relationship that the drug works, it's going to be slightly expensive to provide any comparable level of evidence about side effects.
Patients weighed in as well:
Jim Murrell, a 54-year-old telecommunications consultant who lives in the Atlanta suburbs, says he wants to know all about adverse drug reactions but he has decided the labels are not helpful; he looks for better sources on the Internet.
"I took a medication that had the side effect of drowsiness," he said. "I read a little further and saw it had another side effect. Insomnia. One medication had diarrhea as a side effect and it also had constipation."
"It makes no sense," Mr. Murrell said.
Of course it's entirely possible for a medication to cause both insomnia and drowsiness because the drug may interact with patient characteristics to produce different side effects. But wow, it's going to be pretty tough to provide any warning about interactive side effects without lowering our standards of evidence because of the sheer proliferation of possible effects.
May 25, 2011
I got an email a few days ago urging me to try www.zanran.com, a search engine "for finding data and statistics". I'm skeptical of some of these search engines (I can never find anything I want on www.rseek.org for example), but I was pleasantly surprised. I typed in "fatwa" and, lo and behold, uncovered a few papers that are of some relevance to the project I'm doing on Islamic fatwas.
My first reaction is that this is really a search engine for finding figures and tables, not necessarily "data" or "statistics". But it turns out that I really like finding new papers by searching their figures. It's much easier to tell if a paper is going to be relevant to my research by looking at figure 4 than by looking at the title.
Searching through papers by figures has had me thinking about the search engine I'd really like to see: one that searches for images. It'll work like this: you take a picture, upload the picture to the search engine and then it uses image similarity algorithms to generate search results ranked by the similarity of the image. This would alleviate the somewhat rare but annoying situation where you have a picture of something but don't know what it is. There's really no way to google that...
"Who was that random guy in our group photo?"
"Hold on, I'm googling 'guy with brown hair'..."
May 19, 2011
In response to a comment by Chris Blattman, the Givewell blog has a nice post with "customer feedback" for the social sciences. Number one on the wish-list is pre-registration of studies to fight publication bias -- something along the lines of the NIH registry for clinical trials.
I couldn't agree more. I especially like that Givewell's recommendations go beyond the usual call for RCT registration to suggest that we should also be registering observational studies. If we're dreaming about discipline-wide reforms to enhance the credibility of political science, it would be nice if we had reforms that weren't only applicable to the research that is already most credible.
The most though-provoking reform idea thrown out by the Givewell blog is this:
As food for thought, imagine a journal that accepted only studies for which results were not yet known. Arguably this journal would be more credible as a source of "well-designed studies addressing worthwhile questions, regardless of their results" as opposed to "studies whose results make the journal editors happy."[Thought experiment round two: how would this journal differ from the APSR?]
I've been trying to think of ways to personally implement the principle of preregistration (short of organizing a registry or starting the above journal). The most obvious thing I can think of is to keep a detailed lab notebook (see discussion by Lupia here). Ideally, it would be public so that I couldn't go back and fudge it -- "Oh, I expected all along that the coefficient would be negative." Or maybe I'd keep it private during the research but somehow make deletions impossible.
Actually, even if I never made this public, taking better notes as I do research could have serious benefits. For one thing, it would be incredibly helpful for mitigating the inevitable bit-rot from letting a project sit for a while. And it's nice to be able to remember how a project actually unfolded. As Fox sagely observes, "It is best...not to fool yourself, regardless of what you think about fooling others" (p. 511, in reference to standard errors).
Perhaps there's actually a market for this kind of thing. Would reviewers look more favorably on papers submitted with a time-stamped preregistration? I guess not, or else at least a few people would be doing it already.
Still, I'm tempted to give public lab notes a whirl myself. Suggestions welcome!
May 2, 2011
Every so often, I try to take some time to read something that I should have read ages ago. Tonight's gem was the 2010 draft of "The Industrial Organization of Rebellion: The Logic of Forced Labor and Child Soldiering" by Bernd Beber and Chris Blattman (link). The paper gets a lot of traction out of a formal model and then matches these predictions up to reality using some unique and hard-earned data.
An off-hand comment in the paper caught my attention: Beber and Blattman's assertion that "a single case helps to refine our theory and validate some basic assumptions, but cannot test it" (p 19). Ordinarily I'd say "sure", except that they are referring to their own nuanced analysis of a large number of child soldiers painstakingly tracked down in Uganda. Moreover, they actually have a credible identification strategy -- abduction into the Lord's Resistance Army was essentially random after conditioning on age and location. I was pretty convinced by the analysis; more so than by the regressions on the novel but dubious rebel dataset they introduce at the end of the paper. Maybe the Uganda child-soldier analysis wasn't a "test" but it sure moved my posterior beliefs about their hypotheses.
So, I wonder what they meant. I guess I can just email them and ask, but that would be no fun. Instead, I'll speculate wildly.
One possibility is that the Uganda data was collected and analyzed prior to the development of the model. Or perhaps they are noting that there isn't any variation in rebel groups, so we obviously can't estimate the effect of some of their important moving parts (motivating the turn to other data). On the other hand, their theory does generate a number of observable implications that they do successfully
test "validate" with the child-soldiering data. And like I said, it was this evidence that convinced me. I think even the realize this: they spend 8 pages on Uganda and 4 on the cross-rebel group comparisons.
I think the usefulness of single cases is worth pondering, especially since it's often much easier to get the kind of unique data that allows causal identification for a single case!
March 29, 2011
I just happened across this paper on Steven Levitt's website entitled "What Does Performance in Graduate School Predict? Graduate Economics Education and Student Outcomes." In addition to learning all kinds of fascinating things about the economics profession and professionalization process, I was struck by the non-causal use of the word "effect" when discussing the results of their statistical (er, econometric) models.
I counted 8 uses in all, none of which were actually a believable effect of any kind. To wit:
When admissions rank is excluded from the model, the math GRE has a statistically significant effect on micro, macro, and metrics grades, and the verbal GRE has a statistically significant effect on macro and metrics grades. (page 514, second column).
It's pretty hard for me to believe that GRE scores actually affect grades apart from their effect on grad school admissions (which affects a student's ability to get grades at all). Clearly, they don't mean it causally, especially since they are dropping a post-treatment variable in and out of the model as they talk about it.
Ok, I get it that not everyone is as
apoplectic concerned as I am about using the word "effect" to denote non-causal relationships. I realize that the casual (non-causal) lingo of "effects" can just be short-hand among people who know better. And sure, it's just kind of a fun paper. But I would have expected this particular group of economists to have the catechisms of causal inference so well memorized that writing this kind of sentence would give them hives.
February 4, 2011
January 18, 2011
A few months ago, I read this review of instrumental variables in political science by Allison Sovey and Don Green. I enjoyed it tremendously, so I was pleased to see that it just came out in the American Journal of Political Science. The (mis)use of instruments in the political science literature has been driving me crazy, so I'm hoping that Sovey and Green's article will help raise the bar. In some ways, the dataset they created shows that things are already getting better. More and more articles are offering justifications for their choices of instruments and more articles are using "just-identified" models to avoid the "embarrassment of riches" problem that comes from using multiple instruments. On the other hand, a plurality of articles still fail to give any justification for their instruments, so there is a long way to go. Hopefully compliance with the check-list Sovey and Green provide will become the new standard in the literature.
But wait, there's more...
Sovey has another paper (with Peter Aronow) that may be the future of instrumental variables. It is now well known that IV set-ups generally identify a local average treatment effect (LATE) which is rarely the quantity of interest for researchers. Aronow and Sovey show how to recover sample average treatment effects by estimating compliance scores for each unit (even in the face of two-sided non-compliance!) and then using these weights to estimate the treatment effect if every unit in the sample had complied with their treatment. This idea strikes me as very smart. It also strikes me as crazy, but possibly crazy enough that it might just work. If I ever find an instrument I actually believe for a problem I actually care about, I'll be trying this out.
January 13, 2011
I just read an interesting article in the New Yorker (hat tip to John Sheffield) that gives an entertaining introduction to the so-called "decline effect" in scientific discovery. Apparently, at least a few scientists have had trouble reproducing the large effect sizes of initial studies on various topics (I'm shocked, shocked!). Some argue that the trend is general -- the initial studies on any topic will find big effect sizes that will be harder and harder to replicate over time. My guess is that this is simply a story about the problems of searching for "significance," but people interviewed in the article offer other explanations as well, including the possibility that nature is out to get scientists by teasing them with results and then making them go away.
"But wait," you say, "wouldn't the discovery of the 'decline effect' also be subject to the decline effect?" Is this yet another situation where people focus on apparently confirmatory cases while ignoring cases that don't confirm their hunch? I'm hoping "yes", if only because it would be deliciously ironic.
And if all this doesn't make you depressed about the durability of published findings, try this gem by John Ioannidis.
December 6, 2010
A recent New York Times article highlights recent research by Dan Cohen and Fred Gibbs that uses computational power and statistics* to answer questions about what Victorian literature says about the Victorians. I've known that this kind of thing was in the works in various parts of the humanities, but I haven't been keeping up. I think this kind of analysis will be making more inroads into the humanities and social sciences in the future (a previous NYT article in the series takes up this issue).
* Ok, they are just using word frequencies at the moment, but the data they are collecting in collaboration with Google made me drool at the possibilities for machine learning applications.
One quick observation:
It was interesting to me that the criticisms of quantitative text analysis are the same in literature as they are in political science.
(1) It lets people get away with not reading or interpreting the texts.
(2) It undermines the ability of research to get nuanced meaning out of texts.
(3) It shapes the kinds of questions researchers ask.
My quick thoughts on these:
(1) People who use statistical text analysis in their work generally have to read a lot of texts. I don't think quantification is a substitute for reading.
(2) Quant text analysis often does gloss over nuanced meanings, but it can often reveal broad trends in a huge body of texts that a close reading of a handful of texts can't.
(3) We tend to only really entertain research questions that we think we have the tools to answer, so until recently, very few people have tried to answer questions where you'd need to read 100,000 documents to get an answer. Putting these types of questions in the realm of possibility is not a bad thing.
November 9, 2010
A recent New York Times piece by Nicholas Wade makes the point that research is an extremely risky enterprise with far more failures than successes.
"Nature yields her secrets with the greatest unwillingness, and in basic research most experiments contribute little to further progress, as judged by the rarity with which most scientific reports are cited by others."
In political science, it seems like most of this risk gets passed on directly to the researcher with possibly detrimental effects on the way we do research. In theory, if a project is a dead end, we should probably just walk away. In practice, however, projects can become "too big to fail". My sense is that the need to squeeze something out of a research project leads to a lot of poor statistical practice -- specification searches for something that is "significant", overly optimistic claims about causal identification, and other shady dealings. On the other hand, attempts to mitigate this risk by avoiding the cost of large data collection projects typically mean that we keep running model after model on the same 5 datasets that everyone else is using.
Is the inherent riskiness of research at the root of these problems? How do you manage these risks?
October 29, 2010
I get asked this question from time to time, but when I got asked this question multiple times on Friday, I guessed that something had gone down.
What went down was Chris Blattman offering a rant (his description, not mine) about the "cardinal sin of matching" -- the belief that matching can single-handedly solve endogeneity problems. Most of the questions I got went something like "Chris says matching can't help with endogeneity. You say it can. What gives?"
First, let me say that I agree with most of Chris' rant and I think that his blog post should be required reading for anyone using matching right now. There are too many people out there that think that matching is a magical method that fixes endogeneity automatically. It's not, and reading Chris' discussion should be the first step in a 12 step process for those of us that have drunk the matching cool-aid too hard.
Now for the statistics:
Basically, matching can solve your endogeneity/selection/confounding problem if you can measure the variables that influence treatment assignment. That is a big "if" and measurement is the key here. Matching is generally a pretty smart way to condition on observables, but it doesn't buy you anything if you believe that there are unobserved variables that systematically influence treatment assignment. Thus, if you think your regression is biased because of unobservables, then matching by itself won't help you. What you really need to do is go out and measure the unobserved confounders and condition on them.
In the end, I think that people who like matching methods (and other conditioning methods) tend to believe that most confounders can be measured (perhaps with a lot of hard work) and that there aren't a lot of lurking unobservables. In contrast, people I talk to who are skeptical of matching almost always argue that there will always be problematic unobservables lurking no matter how hard you try to measure them. In general, these types of people prefer instrumental variables approaches (and tend to be economists rather than statisticians, interestingly enough).
Fair enough -- there may be lurking unobservables. Frankly, there's no way to get empirical traction on how many lurking unobservables are out there (definitionally), so I think it comes down to subjective beliefs about the nature of the world. But what always gets me is that the same people who tell me that lurking unobservables are everywhere tend to be fairly comfortable making the types of exclusion restrictions that make IV approaches work. The crazy thing is that just like matching, these assumptions rely on assumptions about unobservable causal pathways. The claim that an instrumental variable is valid is the claim that there are no unobserved (or observed) variables linking the instrument to the outcome except through the path of the instrumented variable. So it always puzzles me that the same people who think that lurking unobservables are everywhere in matching somehow think that all these lurking uobservables go away as soon as you call something an instrument and try to defend it as exogenous.
I'm pretty skeptical of most observational IV approaches -- unless you flipped the coins yourself or you can really tell me a plausible story about how nature flipped coins, I probably won't believe your instrument. So why am I falling into the reverse trap: believing that unobservables are more likely to undermine IV than conditioning approaches? Maybe I'm just wrong here and I need to become an even more extreme skeptic of most empirical research than I already am. But my sense is that the conditions for an IV to hold are more knife-edge than the ignorability assumptions. Perhaps that's wishful thinking.
But wishful thinking aside, matching can help solve endogeneity problems if you can measure the variables that influence selection (and if there happens to be sufficient overlap, yadda, yadda). All those people out there who make blanket statements like "matching can't solve endogeneity" are either making the assumption that there are always lurking confounders or else they are just plain wrong.
October 24, 2010
Lately I've been thinking a lot (and writing a little) about ways to combine the qualitative and quantitative empirical traditions in political science, so I was quite interested to read a new post on the philosophy blog at the New York Times written by mathematician John Paulos. He contrasts the logic of story-telling with the logic of statistics to draw out some interesting implications for how each mode of understanding colors the ways we think about the world.
In a sentence that could have come out of a "scope and methods" text, Paulos identifies the fundamental difference between literary and statistical traditions: "The focus of stories is on individual people rather than averages, on motives rather than movements, on point of view rather than the view from nowhere, context rather than raw data." I think this is an accurate description of how two empirical cultures in social science have developed, but I disagree that this divide is inherent.
This may be unorthodox, but I don't see statistics as inherently "quantitative" or focused on the "general" rather than the "particular". I see statistics as a relatively young field attempting to develop answers to the question "how should I go about formulating my beliefs about the world now that I've observed some part of it." Eventually, statistics will need to offer advice on how to update our picture of the world after observing any type of information -- not just information that comes from randomized experiments, fits neatly in rectangular matrices, or involves enough "N" for some central limit theorem to hold.
Narrative research seems ideally suited to work with the types of information that traditional statistics has largely ignored. Why then should statistics take up the task? Narratives are rich with data but researchers using narrative methods have little advice on how to make inferences from these data. In the richest of literary narratives this ambiguity enhances the text, allowing the reader to reach many conclusions about the meaning and implications of a work. In empirical social science, this ambiguity can become a liability. If statisticians spent more time developing ways of making appropriate inferences from data in these settings -- frankly the most common settings that we face -- it might lessen this ambiguity by offering a clear set of rules for mapping complex narrative data to inference.
My hunch is that the people who work with data that lends itself to narrative research already have ideas about the best practices for making valid inferences from these data. Perhaps we should be more interested in learning to speak statisticians' language so that we can suggest these insights to them and they in turn can suggest refinements for us. This exchange would help statisticians develop a science of inference and help us develop knowledge of social phenomenon.
March 16, 2010
Last Halloween, I alerted readers of the social science statistics blog to cutting edge research suggesting that if zombies attacked, humans faced serious risk of extinction.
It turns out that some of these conclusions may have been premature. Some recent research by Blake Messer suggests that if there is terrain that favors humans in some way, then humans may have a better shot at survival.
But it doesn't end there.
UCLA's Gabriel Rossman points out that Messer's model doesn't account for the possibility of human stupidity/sabotage (always a good thing to include in our models, I guess). Rossman's findings suggest that in the face of a zombie onslaught, small islands of weapons stockpiles might be more favorable for the long-term survival of the human race than a single cache -- perhaps the most important policy implication to come out of this renewed debate.
I think future research in this area will be worth following. First, I hear there is interesting work afoot on the spread of zombification through social networks, although getting the zombies to accurately report who bit who can be difficult. I've also heard rumors of some machine learning research that attempts to classify zombie speech (early results suggest that there is only one category: "BRAINS!"), and I believe some economists are using the apparent exogeneity of zombie outbreaks to finally identify the effect of education on wages.
February 8, 2010
Viri Rios has a great op-ed in the New York Times about mathematical social science and Mexican drug politics.
February 2, 2010
The other week, I read Jared Diamond's Guns, Germs, and Steel which managed to get me a little worked up about a pet peeve of mine: the term "natural experiment." Just when I had gotten calmed down, the Polmeth list serve alerted me to an entire issue of Political Analysis devoted to natural experiments. Arghhh...
Don't get me wrong -- in my own research I try to use observational data to make causal claims that are probably far more dubious than anything in the special issue of Political Analysis. I'm highly impressed by the research and I'm even more supportive of social scientists who are looking for "natural experiments" in political science. I just wish we could call them something else because I'm skeptical that they are really experimental.
The lead article of the PA special issue urges scholars "to use the language of experimental design in explicating their own research designs and in evaluating those of other scholars." I'm on board with using the language of experiments, but I've also seen more than a few recent papers framed as "natural experiments" that are really just observational studies with no particular claim to special status. The spread of experimental language into observational studies may have downsides as well as benefits.
Until recently, I basically assumed that when people said they had a natural experiment, what they really meant was that they had a credible instrument: a variable that breaks the link between treatment assignment and the potential outcomes for some or all of the units. However, the lead PA article places difference-in-differences, regression discontinuity, and matching methods under the tent of natural experiments. While I like (and use) these techniques and find them compelling, only some of them explicitly rely on an IV-type argument. Maybe I have more to learn.
The problem with any randomization that isn't controlled by the researcher is that extreme skeptics like me can then try to spin complicated stories about how confounding could occur. This is what I found myself doing while reading Guns, Germs, and Steel. An extremely simplified version of Diamond's argument is that geography, not genetics, determines which human societies become dominant and which are conquered or destroyed. He devotes the entirety of chapter 2 to discussing the settlement of Polynesia by people who come from essentially the same genetic stock but experienced different geographies once they settled particular islands. The random variation in geography is interpreted as the cause for significant variation in the trajectories of the peoples of each island or group of islands.
This might be a natural experiment if Diamond could show that people were somehow randomly assigned to different islands. The problem is that different types of people might chose to live on different islands. Although it may be random which islands an exploratory party reaches, the explorers can choose to stay or move on for reasons that might be related to genetic variation. Similarly, explorers and colonists are probably not a random sample of the population, so the types of people that reach a far off island might have different genetic traits than those that remain in already established population centers. You get the idea.
I should reiterate that these reservations are just my gut reactions rather than a well thought-out assault on the use of natural experiments. I'm interested to read more: Jared Diamond and our very own James Robinson have a new book out on the subject that I'm excited to read. Thad Dunning has written on the topic, as have others.
Bottom line: I'm thrilled (and jealous) whenever social scientists find some plausibly exogenous variation to exploit for causal inference. I think it should happen more. I just worry that by attaching the "experimental" label to these studies, we endow them with undue credibility.
January 19, 2010
I'm into biking (mostly road-biking these days) so I was interested to read a post on the New York Times' "Freakonomics" blog about a study that uses variation in bike helmet laws across US states to show that helmet laws decrease bike riding among kids and teens. Since I think that most people should ride bikes most of the time AND I have been known to bug people to wear helmets, perhaps I've been working against myself.
A few things came to mind while reading the study. First, the study shows that helmet laws have an effect on bike safety for kids in the same age ranges. Unless I missed something, it seems like part of this effect could be due to fewer kids riding bikes (in addition to the obvious safety improvement that comes from actually wearing a helmet). I'd be curious how much the decrease in bike use is influencing the increase in safety, especially if kids are simply deciding to do other things like skateboarding that are perhaps equally dangerous but don't require helmets (a possibility mentioned by the authors). This may mean that the total effect of helmet laws on child safety is less than the effect estimated in the paper because some of the decreases in bike injuries are counter-balanced by increases in other types of injury that aren't part of the study.
Second, the authors use some fixed effects and diff-in-diff models, but I think this paper is calling out for the synthetic control method developed by Abadie, Diamond, and Hainmueller. The policy intervention is clean and there are a reasonable number of states that don't have laws, so building synthetic matches might be feasible. There might be some interference problems with states that pass helmet laws later, but those are details...
I'll end this post with a shameless plug: bike more! (and wear a helmet)
December 22, 2009
This morning the New York Times alerted me to a Science piece written by two economists working on measuring happiness. Their basic finding is that objective measures of quality of life (nice climate, etc) are pretty highly correlated with subjective, self-reported measures of how satisfied people are with their lives. They provide a ranking of US states by happiness level, accessible here, which shows Louisiana first and New York last, with Massachusetts falling to 43rd. Go figure -- I like living in MA.
I really want to see some cross-national comparisons but I doubt anyone will be moving on to that unless the World Bank picks up Bhutan's Gross National Happiness measure as one of their development indicators.
Happy holidays to all!
November 28, 2009
Judea Pearl describes his new article Causal inference in statistics: An Overview as "a recent submission to Statistics Survey which condenses everything I know about causality in only 40 pages." That seemed like a bold claim, but after reading it I'm sold. I don't come from Pearl's "camp" per se, but I found this a really impressive overview of his approach to causation. His overtures to folks like me who use the potential outcomes framework were much appreciated, although it is clear throughout that there is still intense debate on some of the issues. The bottom line: if you've ever wondered what the structural equation modeling approach to causal inference is all about, this is your one-stop, must-read introduction (and an insightful, engaging, and thorough one at that).
November 11, 2009
Brandon Stewart pointed me to an interesting blog post by Andrew Gelman that touches on the issue of explaining the "causes of effects." The basic point is that "why" questions are difficult to answer in a potential outcomes framework but often we really care about them. Some folks in political science have gone so far as to argue that researchers using "qualitative" methods are more inclined (and better able) to tackle these "why" questions than their "quantitative" colleagues who mostly focus on "effects of causes."
This has been on my mind lately -- as part of a class in the statistics department, I've had several conversations with Don Rubin about how retrospective "case-control" studies might fit into the potential outcomes framework. The goal of the medical researchers that execute these studies is usually a "why" question: why did an outbreak of rare disease X occur, which genes might cause breast cancer, etc. Case-control studies and their variants are great for searching over a number of possible causes and pulling out the ones that have strong associations with the outcome, but they aren't so great for estimating treatment effects. Rubin suggests that the proper way to proceed is probably to first use a case-control study to search over a number of possible causes and then estimate treatment effects for the most likely causes using a different sampling method (matched sampling for situations where the research has to be observational, experimentation when it's possible). It seems like this already happens to some extent in biostatistics and epidemiology and it also happens informally in political science.
I think this formulation suggests that answering a "why" question requires both "causes of effects" and "effects of causes" approaches; we need to search over a number of possible causes to identify likely causes, but we also need to test the effectiveness of each likely cause before we can say much about the causal effect. We probably still can't answer questions like "what caused World War I" but maybe this gets us somewhere with more tractable types of "why" questions.
October 30, 2009
It made my day when this showed up in my inbox this morning. I'm glad to see someone knows what to do if/when the zombie outbreak occurs.
October 21, 2009
I recently found a paper by Angus Deaton that attempts to (1) discount the usefulness of instrumental variables for making causal inferences in development economics and (2) discount the usefulness of field experiments. He has definitely stirred the pot a little and is now part of an interesting debate, although the discussion seems to be more focused on Deaton's controversial claims about experiments.
I think Deaton overlooks some of the benefits of experimental research but his criticism of instrumental variables seems dead on, especially on the use of multiple instruments (see pages 12-13). Intuitively, we might think that having many instruments makes for better causal inference -- if one doesn't work out, then the others will pick up the slack. Following this logic, studies that use multiple instruments and "test" for exogeneity with overidentification tests have become popular in the instrumental variables literature. Essentially, these tests boil down to re-estimating the model with subsets of the instruments and showing that the estimated coefficients don't change dramatically. This can mean one of two things: (a) not just one, but all of the instruments are exogenous, or (b) not just one, but all of the instruments are endogenous. Personally, I think the probability of finding even a single good instrument for a given problem is small, so when shown a research design with multiple instruments, I need some serious convincing that miraculously all of the instruments are valid.
I am probably overly skeptical and I am very sympathetic to heroic attempts to solve difficult problems of causal inference to answer important questions. Still, it seems that having multiple instruments can become an embarrassment of riches. A good instrument is so hard to come by that having too many starts to lend evidence against an empirical argument rather than for it.