May 2012
Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries

Categories

Blogroll

SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


January 5, 2010

Discovering Causal Knowledge?

How do we learn about causal relationships when we can't run experiments? In my own work, the answer has been to look around for "natural experiments" in which something important varies for roughly random reasons: for example, the winners of close elections are selected almost at random, which allows you to draw conclusions about the effect of being elected on various outcomes (like the winner's wealth).

I recently read a paper by David Jensen and coauthors from the UMass Knowledge Discovery Laboratory that proposes a systematic way of uncovering causal relationships from databases. Their approach (which they call AIQ -- "Automated Identification of Quasi-experiments") is not to mine the joint density of variables for independencies that can produce a causal graph (as discussed in Jamie Robins' talk last March), but rather to produce a list of feasible quasi-experiments based on a standard database schema that has been augmented with some causal information (e.g. A might cause B, C does not cause A or B) and some temporal information (i.e. ordering and frequency of events). In the paper, the authors provide an overview of the approach as applied to three commonly-used databases, including some candidate quasi-experiments that the algorithm suggests.

My impression after reading the paper was that AIQ's discovery potential is pretty limited (at least at this stage), because most users who could provide the inputs AIQ needs could very likely think up the quasi-experimental design themselves. Any valid quasi-experiment design that AIQ can discover at this point appears to come from the user specifying that the treatment and outcome have no common cause or confounding factors, which is a very unusual situation that is either quite obvious (e.g. because there is a lottery or other explicit randomization) or requires significant substantive knowledge. I wonder how commonly a researcher would a) have in mind a causal model that is sufficiently restrictive to produce plausible quasi-experimental designs through AIQ, and b) not have already thought of those designs.

The example of causal discovery the authors provide comes from a combined IMDB/Netflix movie database; they assert that winning an Oscar improves the reviews a movie receives on Netflix. In order for AIQ to suggest this quasi-experiment, the authors had to specify in advance that the Oscar-winning film is chosen from among nominees at random. One can of course criticize that assumption, but the point is that once you make that assumption it should be quite obvious that you have a quasi-experiment with which to study the effect of winning the Oscar on various outcomes; any film-specific, post-awards ceremony outcome should do. AIQ may provide a structured way to go through that exercise, but I'm not convinced there are many circumstances in which it would be useful to a researcher.

Posted by Andy Eggers at 10:18 AM

December 28, 2009

Spirling cited in Nature, underlying pattern of human conflict found

A recent paper in Nature documents power-law patterns (i.e. scale invariance) in the distribution of events within insurgencies: The number of casualties per insurgent event, and the number of insurgent events per day, apparently follow striking regularities across an array of insurgencies. Power laws everywhere!

What makes the paper especially notable around IQSS is that our own Arthur Spirling is cited in the first sentence:

The political scientist Spirling and others have correctly warned that finding common statistical distributions (for example, power laws) in sociological data is not the same as understanding their origin.

The citation is to Arthur's unpublished paper The Next Big Thing: Scale Invariance in Political Science, which provides a breezy overview of scale invariance as a concept and documents a few previously unremarked examples from political science.

Part of the point of Arthur's paper is that political science (and social science more broadly) has mostly ignored research in natural sciences that, like the Nature article, examines emergent patterns in social phenomena. As he points out, it's not how we "do business." The hard scientists chasing power laws attempt to explain an underlying random process starting from the distribution of outcomes; we're more accustomed to starting from the joint density of outcomes and covariates.

In a way, the fact that Arthur's paper was cited at all highlights the lack of interest in this style of work in social science. The authors of the Nature piece wanted to cite social science work on power laws, and they ended up with Arthur's piece, which is, for all its merits, several years old and unpublished.

I admit I've been a bit of a power-law curmudgeon, like other social scientists, but lately I've come to better understand the value of this approach. I don't expect that I'll be focusing on this kind of work myself, but, like Arthur, I believe it is a growth industry.

Posted by Andy Eggers at 9:20 AM

December 11, 2009

Academic Specialization Time and Career Switches -- Malamud Working Paper

Among the new working papers at NBER is this interesting paper by Ofer Malamud, an education economist at Chicago's Harris School.

Malamud is interested in the relative benefits of specializing early or late in one's academic career: specializing early presumably allows you to accumulate more skill in your specialization, but it also probably results in a less optimal match between the individual and the specialization. In the new working paper, he compares the rate of switching careers between graduates of English and Scottish universities, who have similar educational backgrounds and enter a fairly integrated labor market but are required to specialize at different points in their educational careers: students in English universities typically must choose a specialization before entering the school, while students in Scottish universities typically specialize after two years of general education.

Malamud does sensible things to address possible differences between the two groups of students and the labor markets they entered, and his placebo tests effectively validate the design. (For example, to confirm that the students attending English and Scottish universities don't have a basically different propensity to switch careers, he shows similar switching rates between students receiving graduate degrees in English vs Scottish universities.) Ultimately he finds that switching is lower among Scottish university graduates (about 6 percentage points lower, where the mean rate of switching is about 42%), which he takes as confirmation that students are better off in their chosen field when they have a longer time to pick the field (and thus achieve a better match), even if it means they have less time to develop specific skills in that field.

Among working papers I've seen recently, I thought this one was unusually good in applying reasonable identification to a genuinely interesting substantive question.

Posted by Andy Eggers at 7:30 AM

November 30, 2009

Climate change and conflict in Africa

A paper just published in PNAS finds that armed conflict in Africa in recent decades has been more likely in hotter years, and projects that warming in the next twenty years will result in roughly 54% more conflicts and almost 400,000 more battle deaths. This is an important paper and it probably will attract significant attention from the media and policymakers. I think it's a good paper too -- seems fairly solid in the empirics, nice presentation, and admirably forthright about the limitations of the study. I'll explain a bit about what the paper does and what questions it leaves.

To establish the historical connection between temperature and conflict in Africa, the authors conduct a panel regression with country-level fixed effects, meaning that they are examining whether conflict is more likely in a given country in unusually hot years. Their main model also includes country time trends, so it seems that they are not merely capturing the fact that the 1990s had more conflict than the 1980s due to the end of the Cold War and also happened to be hotter due to the overall warming trend. In a supplement that I was not able to access, they show that the correlation between temperature and conflict is robust to a number of other specifications. So the pattern of more conflict in especially-hot years seems fairly robust. (Arguments to the contrary very welcome.) They then link this model up to a climate model to produce predictions of conflict in the next twenty years, under the assumption that the relationship between temperature and conflict will remain the same in the future.

Given that hot years saw more conflict in the 1980s and 1990s, should we expect a hotter Africa to have more conflict in the future? To some extent this depends on why hot years saw more conflict in the past. The authors note that hotter temps can depress agricultural productivity by evaporating more water and speeding up crop development, and they view this as the most likely channel by which hot temperatures lead to more conflict. They also note that hot weather has been shown in other settings to increase violent crime and make people less productive, and they admit that they can't rule out these channels in favor of the agricultural productivity story. If it's a matter of agricultural productivity, then adjusting farming techniques or improving social safety nets could avoid some of the projected conflict; if the main issue is that especially hot weather makes people rash and violent then there may not be much to do other than reverse the global process (and possibly provide air conditioning to potential insurgents). Overall the "policy implication" from the paper seems to be that changes that should happen anyway seem more urgent.

To be more confident about the paper's projection I'd like to see more detail about the mechanism -- even just anecdotal evidence. I'd also like to hear about whether politics in Africa seems to have changed in any way that would make a model of weather and conflict from the 1980s and 1990s less applicable now and in the future.

Posted by Andy Eggers at 7:31 AM

October 28, 2009

Physics of politics

A physicist recently emailed me asking if I could help him access election data; he sent me one of his papers, which (to my astonishment) began "Most of the empirical electoral studies conducted by physicists . . .", followed by a string of citations. I had no idea physicists were studying elections! I suppose I should have known; from what my biologist friend tells me, physicists have been colonizing his field the way economists have done to much of social science. So I guess politics was next.

Reading a few articles in the "physics of politics" as a political scientist, one has the sense of observing an alternate universe. For example: a paper on the effect of election results on party membership in Germany that has no references to work outside of physics; features many exotic (to me at least) terms like Wegscheider potentials, the Sznajd model, and the Kronecker symbol; and takes a time-series approach to causation that I suspect would be unacceptable to most reviewers in political science and economics these days.

In general, it's clear that physicists doing work on political phenomena (or "sociophysics" more generally) are primarily interested in exploring the individual-level social interactions that might underpin the macro-order we observe in, e.g., regularities in turnout or vote share distributions. As such, political institutions (which are the major preoccupation of political scientists) necessarily disappear from the model and are typically not even mentioned, even when they would seem to be of first-order importance in explaining a particular phenomenon. (Another example of the alternate universe: a paper that argues that party vote shares in Indonesia follow a power law, but which does not describe or mention the electoral system.) These omissions seem foolish on first reading, but it's clear that they reflect a different choice of explanatory variable: physicists seek their explanations in micro-interactions, and we seek them primarily in political institutions. It's probably both of course, but models can only be so complex.

Despite my overall sense of disorientation in reading these papers, there were also somewhat surprising moments of familiarity. Physics heavily influenced economics in an earlier period of colonization, and much of what we read in economics and political science descended from those models. In reading these newer physics papers, there is therefore a sense of distant kinship, the knowledge of a common ancestor several generations back.

I wonder about the scope for collaboration between physicists and social scientists. Based on my admittedly very cursory reading of one area in which physicists have ventured, it's hard to know whether the potential gains from trade are sufficient to overcome the apparent difference in goals. For all I know there already is a lot of productive collaboration going on -- if you know of something interesting, share it in the comments!

Posted by Andy Eggers at 6:58 AM

October 5, 2009

Engaging Data Forum at MIT

There is a lot of discussion around IQSS these days about managing personal data -- both best practices for researchers and what policy should be at the university and government levels. I just heard about an event at MIT that is engaging a bunch of these issues that looks very interesting; today is apparently the last day for early registration.

First International Forum on the Application and Management of Personal Electronic Information

October 12-13, 2009

Massachusetts Institute of Technology

http://senseable.mit.edu/engagingdata/

The Engaging Data: First International Forum on the Application and Management of Personal Electronic Information is the launching event of the Engaging Data Initiative, which will include a series of discussion panels and conferences at MIT. This initiative seeks to address the issues surrounding the application and management of personal electronic information by bringing together the main stakeholders from multiple disciplines, including social scientists, engineers, manufacturers, telecommunications service providers, Internet companies, credit companies and banks, privacy officers, lawyers, and watchdogs, and government officials.

The goal of this forum is to explore the novel applications for electronic data and address the risks, concerns, and consumer opinions associated with the use of this data. In addition, it will include discussions on techniques and standards for both protecting and extracting value from this information from several points of view: what techniques and standards currently exist, and what are their strengths and limitations? What holistic approaches to protecting and extracting value from data would we take if we were given a blank slate?

Posted by Andy Eggers at 8:05 AM

May 20, 2009

Debates on government transparency websites

A few weeks ago my friend Aaron Swartz wrote a blog post called Transparency is Bunk, arguing that government transparency websites don't do what they're supposed to do, and in fact have perversely negative effects: they bury the real story in oceans of pro forma data, encourage apathy by revealing "the mindnumbing universality of waste and corruption," and lull activists into a false sense of accomplishment when occasional successes occur. It's a particularly powerful piece because Aaron uses the platform to announce he's done working on his own government website (watchdog.net). The piece appears to have caused a stir in government transparency/hacktivist circles, where Aaron is pretty well known.

On looking back at it I think Aaron's argument (or rant, more accurately) against the transparency websites is not very strong: indeed, data overload, apathy, and complacency are all dangers these efforts face, but that shouldn't have come as a surprise.

I had two other responses particular to my perch in academia. First, there is some good academic research showing that transparency works, although the evidence on the effectiveness of grassroots watchdogging is less strong than the evidence on auditing from e.g. Ferraz and Finan on Brazilian municipalities (QJE 2008, working paper version) or Olken's field experiment in Indonesia (JPE 2008, working paper version).

Second, my own work and that of other academics benefits greatly from these websites. I have a project right now on the investments of members of Congress (joint with Jens Hainmueller) that is possible only because of websites like the ones Aaron criticizes. I think this paper is going to be useful in helping watchdogs understand how Congress invests and whether additional regulation is a good idea, and it would be a shame if the funders of these sites listened to Aaron and shut them down.

I do agree with Aaron that professional analysis may be better than grassroots citizen activism in achieving the goals of the transparency movement. Sticking with the example of the congressional stock trading data I'm using, I suspect that not much useful watchdogging came out of the web interface that OpenSecrets provides for the investments data. While it may be interesting to know that Nancy Pelosi owns stock in company X, it's hard to get any sense of patterns of ownership across members and how these investments relate to political relationships between members and companies. This is what our paper tries to do. It takes a ton of work, far more than an investigative journalist is going to put in. We do it because of the rewards of publishing interesting and original and careful research, and also because these transparency websites have made it much more manageable: OpenSecrets.org converted the scanned disclosure forms into a database and provided lobbying data, and GovTrack provided committee and bill info, as well as an API linking company addresses to congressional districts. Most of the excitement around these websites seems to center on grassroots citizen activism, but their value to academic research (and the value of academic research to government accountability) should not be overlooked.

Posted by Andy Eggers at 10:53 PM

May 10, 2009

Dobbie and Fryer on Charter Schools in the Harlem Children's Zone

David Brooks wrote a column a few days ago about Will Dobbie and Roland Fryer's working paper on the Harlem Children's Zone charter schools, which the authors report dramatically improved students' performance, particularly in math. Looking at the paper, I think it's a nice example of constructing multiple comparisons to assess the effect of a program and to do some disentangling of mechanisms.

The program they study is enrollment in one of the Promise Academy elementary and middle schools in Harlem Children's Zone, a set of schools that offer extended class days, provide incentives for teacher and student performance, and emphasize a "culture of achievement." The authors assess the schools' effect on student test scores by comparing the performance of students at the schools with that of other students. The bulk of the paper is concerned with how to define this group of comparable non-students, and the authors pursue two strategies:

  • First, they examine cases where too many students applied to the school and slots were handed out by lottery; the comparison of lottery winners and non-lottery winners (and the accompanying IV estimate in which attending the school at some point is the treatment) allow them to compare the effect of attending these schools under nearly experimental conditions, at least in years when lotteries were held.
  • Second, they compare students who were age-eligible and not age-eligible for the program, and students who were in the schools' recruitment area vs not in the schools' recruitment area. (This boils down to an IV in which the interaction of cohort and address instruments for attendance at the school.)

The estimated effect is very large, particularly for math. Because the estimates are based on comparisons both within the HCZ and between HCZ and non-HCZ students, the authors can speculate somewhat about the relative importance of the schooling itself vs other aspects of the HCZ: they tentatively suggest that the community aspects must not drive the results, because non-HCZ students did just as well.

Overall I thought it was a nice example of careful comparisons in a non-experimental situation providing useful knowledge. I don't really know this literature, but it seems like a case where good work could have a big impact.

Posted by Andy Eggers at 9:40 AM

April 13, 2009

Alley-oops as workplace cooperation

Here's a paper for the "high internal, low external validity" file (via Kevin Lewis):

Interracial Workplace Cooperation: Evidence from the NBA

Joseph Price, Lars Lefgren & Henry Tappen
NBER Working Paper, February 2009

Abstract:
Using data from the National Basketball Association (NBA), we examine
whether patterns of workplace cooperation occur disproportionately
among workers of the same race. We find that, holding constant the
composition of teammates on the floor, basketball players are no more
likely to complete an assist to a player of the same race than a
player of a different race. Our confidence interval allows us to
reject even small amounts of same-race bias in passing patterns. Our
findings suggest that high levels of interracial cooperation can occur
in a setting where workers are operating in a highly visible setting
with strong incentives to behave efficiently.

Posted by Andy Eggers at 6:51 PM

March 18, 2009

Breastfeeding Research and Intention to Treat

The current issue of The Atlantic features an interesting story by Hanna Rosin called "The Case Against Breastfeeding." Rosin argues that the health benefits of breastfeeding have been overstated by advocates and professional associations and that, given the costs (in mothers' time and independence), mothers should not be made to feel guilty if they decide not to breastfeed for the full recommended period. One of her key points is that observational studies overstate the benefits of breastfeeding by failing to adequately adjust for background differences between mothers who breastfeed and those who don't. Observable differences, reports Rosin, are considerable: breastfeeding is more common among women who are "white, older, and educated; a woman who attended college, for instance, is roughly twice as likely to nurse for six months." In the course of making her argument Rosin provides a very nice layman's treatment of the difficulties of learning from observational studies; I think the article could be useful in teaching statistical concepts to a non-technical audience, although the politics of the issue might overwhelm the statistical content.

I followed up a bit on one experimental study she mentions in which researchers implemented an encouragement design in Belarus: new mothers in randomly selected clinics who were already breastfeeding were exposed to an intervention strongly encouraging them to nurse exclusively for several months, and the health outcomes of those babies as well as babies in non-selected clinics were tracked for several years. Rosin reports that this study found an effect of breastfeeding on gastrointestinal infection and infant rashes (and possibly IQ), but no effect on a host of other outcomes (weight, blood pressure, ear infections, or allergies).

I read what appears to be the first paper from the study (published in 2001 in JAMA), which reported that the intervention reduced GI infection and rashes. One thing that surprised me was that all health effects was reported in terms of "intention-to-treat," ie a raw comparison of outcomes in the treatment group and the control group, irrespective of whether the mother actually breastfed. The intervention increased the proportion of mothers doing any breast-feeding at 6 months from .36 to .5, so we know that whatever effects are found via ITT understate the impact of breastfeeding itself (because they measure the impact of being assigned to treatment, which changes breastfeeding status only for some mothers). (The authors know this too, and they raise the point in a rejoinder.)

The standard approach I learned is to estimate a "complier average treatment effect" by essentially dividing the ITT by the effect of treatment assignment on treatment status, but the study appears to not do this. (The CATE for GI infection, according to my back-of-the-envelope calculation and assuming "no defiers," is about -.3, ie about a 30% decrease in the probability of infection for mothers who were induced to breastfeed by the intervention.) I suppose focusing on ITT's could be common in epidemiology because it addresses the policymaker's question of whether it's worth it to implement a similar program, assuming compliance rates would be similar. But for a mother thinking about what to do, the CATE gives much better information about whether or not to breast-feed.

Posted by Andy Eggers at 7:44 AM

March 6, 2009

xkcd on Correlation and Causation

XKCD Pic

Posted by Andy Eggers at 1:09 PM

March 4, 2009

Follow-up on Robins' Talk ("A Bold Vision of Artificial Intelligence and Philosophy")

A few blog readers asked for more information about Jamie Robins' talk today and the "pinch of magic and miracle" he promised in the abstract. I wanted to offer my non-expert report on the presentation, particularly because Jamie and his coauthors don't yet have a paper to circulate.

Jamie organized the talk around a research scenario in which five variables are measured in trillions of independent experiments and the task is to uncover the causal process relating the variables. (His example involved gene expression.) He led us through an algorithm that he claimed could accomplish this feat (in some circumstances) with no outside, substantive input. The algorithm involves looking for conditional independencies in the data, not just in its original form but also under various transformations in which one or more independencies are induced by inverse probability weighting and we check whether others exist. For some data generating processes, this algorithm will hit on conditional independencies such that (under a key assumption, which he was coy about until the end of the talk) the causal model will be revealed -- the ordering and all of the effect sizes.

The key assumption is "faithfulness," which states that when two variables are found to be conditionally independent in the data, we can conclude that there is no causal arrow between them (i.e. we can rule out that there is an arrow between them that is perfectly offset by other effects). Without that assumption we can't infer the causal model from a joint density, but with it we can -- and the point of Jamie's talk was that, in the "star worlds" in which independencies have been induced by reweighting, even more information can be gleaned from the joint density than has been recognized.

All of this may seem surprising to people who have followed the debates over causal modeling and "causal discovery," much of which has centered around the work of Spirtes, Glymour, and Sheines. In these debates, Jamie has been (by his own admission) a consistent critic of the faithfulness assumption and has insisted that substantive knowledge, not conditional independence in sampled data, is the way to draw causal models. Rest assured, he has not changed his position. (I think he described the embrace of the faithfulness assumption by mainstream statistics as "probably insane" at one point in the talk.) The point of the talk was not to defend faithfulness, but rather to show that it implies a lot more than was realized by researchers who currently employ it to uncover causal structure from joint densities.

Anyone else who wants to fill in or correct my account, please chime in.

Posted by Andy Eggers at 10:19 PM

February 17, 2009

Social pressure and biased refereeing in Italian soccer

I recently came across a paper by Per Pettersson-Lidbom and Mikael Priks that uses a neat natural experiment in Italian soccer to estimate the effect of stadium crowds on referees' decisions. After a bout of hooliganism in early February, 2007, the Italian government began requiring soccer stadiums to fulfill certain security regulations; those stadiums that did not meet the requirements would have to hold their games without spectators. As a result, 25 games were played in empty stadiums that month allowing Petterson-Lidbom and Priks to examine game stats (like this) and see whether referees were more disposed toward the home team when the bleachers were filled with fans than when the stadium was empty. Looking at fouls, yellow cards, and read cards, the authors find that referees were indeed more likely to penalize the home team (and less likely to penalize the away team) in an empty stadium. There does not appear to be any effect of the crowd on players' performance, which suggests that fans were reacting to the crowd and not the players (and that fans should save their energy for haranguing the refs).

One of the interesting things in the results is that refs showed no favoritism toward the home team in games with spectators -- they handed out about the same number of fouls and cards to the home and away teams in those games. The bias shows up in games without spectators, where they hand out more fouls and cards to the home team. (The difference is not statistically significant in games with spectators but is in games with spectators.) If we are to interpret the empty stadium games as indicative of what refs would do if not subjected to social pressure, then we should conclude from the data that refs are fundamentally biased against the home team and only referee in a balanced way when their bias is balanced by crowd pressure. This would indeed be evidence that social pressure matters, but it seems unlikely that refs would be so disposed against the home team. A perhaps more plausible interpretation of the findings is that Italian refs are generally pretty balanced and not affected by crowds, but in the "empty stadium" games they punished the home team for not following the rules on stadium security. This interpretation of course makes the finding less generally applicable. In the end the example highlights the difficulty of finding "natural experiments" that really do what you want them to do -- in this case, illustrate what would happen if, quite randomly, no fans showed up for the game.

Posted by Andy Eggers at 8:25 AM

February 1, 2009

Visualizing partisan discourse

Burt Monroe, Michael Colaresi, and our own Kevin Quinn have written an interesting paper (forthcoming in Political Analysis) assessing methods for selecting partisan features in language, e.g. which words are particularly likely to be used by Republicans or Democrats on a given topic. They have also provided a dynamic visualization of partisan language in the Senate on defense issues between 1997 and 2004 (screenshot below).

The most striking feature coming out of the visualization is that language on defense went through an unpolarized period leading up to 9/11 and even for several months afterward, but that polarized language blossomed in the leadup to the Iraq War and through the end of the period they examine, with Republicans talking about what they thought was at stake ("Saddam", "Hussein". "oil", "freedom", "regime") and the Democrats emphasizing the process ("unilateral", "war", "reconstruction", "billions"). (Link to visualization, a QuickTime movie.)

fightingwords.png

Posted by Andy Eggers at 8:36 AM

January 22, 2009

Studying the 2008 primaries with prediction markets: Malhotra and Snowberg

With Obama now in office the rest of the country may be about ready to move on from the 2008 election, but political scientists are of course still finding plenty to write about. Neil Malhotra and Erik Snowberg recently circulated a working paper in which they use data from political prediction markets in 2008 to examine two key questions about presidential primaries: whether primaries constrain politicians from appealing to the middle of the electorate and whether states with early primaries play a disproportionately large role in choosing the nominee. It's a very short and preliminary working paper that applies some novel methods to interesting data. Ultimately the paper can't say all that much about these big questions, not just because 2008 was an unusual year but also because of the limitations of prediction market data and the usual problems of confounding. But there is some interesting stuff in the paper and I expect it will improve in revision -- I hope these comments can help.

The most clever insight in the paper is that you can combine data from different prediction markets to estimate an interesting conditional probability -- the probability that a primary candidate will win the general election conditional on winning the nomination. (If p(G) is the probability of winning the general election and p(N) is the probability of winning the nomination (both of which are evident in prediction market contract prices), p(G|N) -- the probability of winning the general election if nominated -- can be calculated as p(G)/p(N).) In the first part of the paper, the authors focus on how individual primaries in the 2008 election affected this conditional probability for each candidate. This is interesting because classic theories in political science posit that primary elections force candidates to take positions that satisfy their partisans but hurt their general election prospects by making it harder for them to appeal to the electoral middle. If that is the case, then ceteris paribus one would expect that the conditional election probabilities would have gone down for Obama and Clinton each time it looked like the primary season would become more drawn out -- which is what happened as results of several of the primaries rolled in.

As it turns out, p(G|N) didn't move much in most primaries; if anything, it went up when the primary season seemed likely to extend longer (e.g. for Obama in New Hampshire). Perhaps this was because of the much talked about positive countervailing factors -- i.e. the extended primary season actually sharpened each candidate's electoral machines and increased their free media exposure. Of course, Malhotra and Snowberg have no way of knowing whether the binding effect of primaries exists and was almost perfectly counterbalanced by these positive factors, or whether none of these factors really mattered very much.

There is yet another possibility, which is that conditional probabilities did not move much for most primaries because most primaries did not change the market's view of how long the primary season would be. Knowing how the conditional probability changed during a particular primary only tells us something about whether having more primaries helps or hurts candidates' general election prospects if that primary changed people's expectations about how long the primary season would be. There were certainly primaries where this was the case (New Hampshire and Ohio/Texas come to mind) but for most of the primaries there was very little new information about how many more primaries would follow. Malhotra and Snowberg proceed as if they were looking for an average effect of a primary taking place on a candidate's conditional general election prospects, but if they want to talk about how having more primaries affects candidates' electability in the general election, they need to focus more squarely on cases where expectations about the length of the primary season actually changed (and, ideally, not much else changed). I would say the March Ohio/Texas primary was the best case of that, and at that time Barack Obama's p(G|N) dropped by 3 points -- a good indication that the market assumed that the net effect of a longer season on general election prospects was negative. (Although of course that primary also presumably revealed new information about whether Obama would be able to carry Ohio in the general election -- it's hard to disentangle these things.)

The second part of the paper explicitly considers the problem of assessing how "surprised" the prediction markets were in particular primaries (without explaining why this was not an issue in the first part), and employs a pretty ad hoc means of upweighting effect estimates for the relatively unsurprising contests. Some kind of correction makes sense but it seemed to me that the correction was so important in producing their results that it should be explained more fully in further revisions of the paper.

So to sum up, I liked the use of prediction markets to estimate the conditional general election probability for a candidate at a point in time, and I think it's worth getting some estimates of how particular events moved this probability. I think at this stage the conclusions are a bit underdeveloped and oversold, considering how many factors are at play and how unclear it is what information each primary introduced. But I look forward to future revisions.

Posted by Andy Eggers at 10:18 AM

January 16, 2009

Amazon Mechanical Turk for Data Entry Tasks

Yesterday I tried using Amazon's Mechanical Turk service for the first time to save myself from some data collection drudgery. I found it fascinating. For the right kind of task, and with a little bit of setup effort, it can drastically reduce the cost and hassle of getting good data compared to other methods (such as using RAs).

Quick background on Mechanical Turk (MTurk): mturk.pngThe service acts as a marketplace for jobs that can be done quickly over a web interface. "Requesters" (like me) submit tasks and specify how much they will pay for an acceptable response; "Workers" (known commonly as "Turkers") browse submitted tasks and choose ones to complete. A Requester could ask for all sorts of things (e.g. write me a publishable paper), but because you can't do much to filter the Turkers and they aren't paid for unacceptable work, the system works best for tasks that can be done quickly and in a fairly objective way. The canonical tasks described in the documentation are discrete, bite-sized tasks that could almost be done by a computer -- indicating whether a person appears in a photo, for example. Amazon bills the service as "Artificial Artificial Intelligence," because to the Requester it seems as if a very smart computer were solving the problem for you (while in fact it's really a person). This is also the idea behind the name of the service, a reference to an 18th century chess-playing automaton that actually had a person inside (known as The Turk).

The task I had was to find the full text of a bunch of proposals from meeting agendas that were posted online. I had the urls of the agendas and a brief description of each proposal, and I faced the task of looking up each one. I could almost automate the task (and was sorely tempted), but it would require coding time and manual error checking. I decided to try MTurk.

The ideal data collection task on MTurk is the common situation where you have a spreadsheet with a bunch of columns and you need someone to go through and do something pretty rote to fill out another column. That was my situation: for every proposal I have a column with the url and a summary of what was proposed, and I wanted someone to fill in the "full text" column. To do a task like this, you need to design a template that applies to each row in the spreadsheet, indicating how the data from the existing columns should appear and where the Turker should enter the data for the missing column. Then you upload the spreadsheet and a separate task is created for each row in the spreadsheet. If everything looks good you post the tasks and watch the data roll in.

To provide a little more detail: Once you sign up to be a Requester at the MTurk website, you start the process of designing your "HIT" (Human Intelligence Task). MTurk provides a number of templates to get you started. The easiest approach is to pick the "Blank Template," which is very poorly named, because the "Blank Template" is in fact full of various elements you might need in your HIT; just cut out the stuff you don't need and edit the rest. (Here it helps to know some html, but for most tasks you can probably get by without knowing much.) The key thing is that when you place a variable in the template (e.g. ${party_id}), it will be filled by an entry from your spreadsheet, based on the spreadsheet's column names. So a very simple HIT would be a template that says

Is this sentence offensive? ${sentence}

followed by buttons for "yes" and "no" (which you can get right from the "Blank Template"). If you then upload a CSV with a column entitled "sentence" and 100 rows, you will generate 100 HITs, one for each sentence.

It was pretty quick for me to set up my HIT template, upload a CSV, and post my HITs.

Then the real fun begins. Within two minutes the first responses started coming in; I think the whole job (26 searches -- just a pilot) was done in about 20 minutes. (And prices are low on MTurk -- it cost me $3.80.) I had each task done by two different Turkers as a check for quality, and there was perfect agreement.

One big question people have is, "Who are these people who do rote work for so little?" You might think it was all people in developing countries, but it turns out that a large majority are bored Americans. There's some pretty interesting information out there about Turkers, largely from Panos Ipeirotis's blog (a good source on all things MTurk in fact). Most relvenat for understanding Turkers is survey of Turkers he conducted via (of course) MTurk. For $.10, Turkers were asked to write why they complete tasks on MTurk. The responses are here. My takeaway was that people do MTurk HITs to make a little money when they're bored, as an alternative to watching TV or playing games. One man's drudgery is another man's entertainment -- beautiful.

Posted by Andy Eggers at 9:49 AM

December 11, 2008

About those scatterplots . . .

Amanda Cox from the NYT graphics department gave a fun talk yesterday about challenges she and her colleagues face.

One of the challenges she discussed is statistical uncertainty -- how to represent confidence intervals on polling results, for example, while not sacrificing too much clarity. Amanda provided a couple of examples where the team had done a pretty poor job of reporting the uncertainty behind the numbers; in some cases doing it properly would have made the graphic too confusing for the audience and in others there may have been a better way.

She also talked about "abstraction," by which I think she meant the issue of how to graphically represent multivariate data. She showed some multivariate graphics the NYT had produced (the history of oil price vs. demand, growth in the CPI by categorized component) that I thought were quite successful, although some in audience disagreed about the latter figure.

Amanda also showed the figure that I reproduced and discussed in an earlier post, in which I reported that the NYT graphics people think that the public can't understand scatterplots. Amanda disagrees with this (she said it annoys her how often people mention that point to her) and showed some scatterplots the NYT has produced. (She did say she thinks people understand scatterplots better when there is an upward slope to the data, which was interesting.)

The audience at the talk, much of which studies the media in some capacity and nearly all of which reads the NYT, seemed hungry for some analysis of the economics behind the paper's decision to invest so much in graphics. (Amanda said the paper spends $500,000 a month on the department.) Amanda wasn't really able to shed too much light on this, but said she felt very fortunate to be at a paper that lets her publish regression trees when, at many papers, the graphics team is four people who have their hands full producing "fun facts" sidebars and illustrations of car crash sites.

Posted by Andy Eggers at 8:37 AM

November 17, 2008

Interest in computer science is volatile

Reading an NYT article about the dearth of women in computer science, I was struck by this figure, which shows the percentage of college freshmen who say they might major in computer science.computer_science.png The article focuses on the fact, clearly visible from the figure, that women are increasingly underrepresented in computer science since the 1970's and early 1980's, when computer science really started taking off as a discipline.

What also struck me, however, was how volatile the baseline interest in the field has been. I was in college in the late-1990's, when majoring in CS was definitely viewed as a practical and lucrative thing to do, and I'm not surprised to see that interest has fallen off since then. But the fall-off shown here was much steeper than I would have imagined. Have enrollments declined at that rate as well?

Even more surprising to me was that there had been an earlier, equally dramatic boom-and-bust cycle. I knew from watching Triumph of the Nerds that PC sales really took off around that time, and I know about movies like Tron and WarGames, which came at the peak of the earlier wave shown here. But I didn't know there was such a steep drop-off in interest then either. Was that one because of the collapse of a tech bubble too?

Two more questions:

Does anyone want to chime in on why women are less and less represented in CS since the early 1980s? My thought was that professionalization of education in general, and hardening of ideas about who works in the IT profession, would be leading causes. There were a few theories in the NYT article (subtle messages from families, the rise of a very male gaming culture) but it seemed like there was a lot more to be said.

Do any other disciplines have enrollments this volatile?

Posted by Andy Eggers at 5:00 PM

October 29, 2008

Bafumi and Herron on whether the US government is representative

Amid the name-calling, insinuation and jingoism of this political season it is easy to get a bit depressed about the democratic process. Joe Bafumi and Michael Herron have an interesting working paper that is cause for some comfort. The paper, entitled "Preference Aggregation, Representation, and Elected American Political Institutions," assesses the extent to which our federal political institutions are representative, in the sense that elected officials have similar views to those of their constituents. They do this by lining up survey questions from the Cooperative Congressional Elections Study (recently discussed in our weekly seminar by Steve Ansolabehere) alongside similar roll call votes recorded for members of Congress, as well as President Bush's positions on a number of pieces of legislation. There are enough survey questions to be able to place the survey respondents on an ideological scale (using Bayesian ideal point estimation), enough pieces of legislation to place the members of Congress and the President on an ideological scale, and enough survey questions that mirrored actual roll call votes to bring everyone together on a unified scale.

Overall, the authors find that the system is pretty effective at aggregating and representing voters' preferences. Members of Congress are more extreme than the constituencies they represent (perhaps because they represent partisans in their own districts), but the median member of a state's delegation is usually pretty close to the median voter in that state. Since the voters were surveyed in 2006, the paper is able to look at how the election affected the ideological proximity of government to the voters, and as one would hope Bafumi and Herron find that government moved somewhat closer to the voters as a result of the legislative reshuffling.

Below is one of the interesting figures from the paper. The grey line shows the density of estimated ideal points among the voters (ie CCES survey respondents); the green and purple solid lines are the density of estimated ideal points among members of the current House and Senate. The arrows show the location of the median member of the current and previous House and Senate, the median American at the time of the 2006 election (based on the survey responses), and President Bush. As you can see, before the 2006 election the House and Senate were both to the right of the median American (as was President Bush); after the Democratic sweep Congress has moved closer to the median American. Members of Congress are more partisan than the voters throughout, although this seems to be more the case on the right than the left.herron_bafumi.png

Posted by Andy Eggers at 9:45 AM

October 22, 2008

Useful metric for comparing two distributions?

In reading Bill Easterly's working paper "Can the West Save Africa?," I came across an interesting metric Easterly uses to compare African nations with the rest of the world on a set of development indicators. The metric is, "Given that there are K African nations, what percent of the K lowest scoring countries were African?" I don't think I've ever seen anyone use that particular metric, but maybe someone has. Does it have a name? Does it deserve one?

Generally, looking at the percent of units below (or above) a certain percentile that have some feature is a way of describing the composition of that tail of the distribution. What's interesting about using a cutoff corresponding to the total number of units with that feature is that it produces an intuitive measure of overlap of two distributions: it gives us a rough sense of how many countries would have to switch places before all the worst countries were African or, put differently, before all of the African countries are in the worst group. It reminds me a bit of measures of misclassification in machine learning, where here the default classification is, "All the worst countries are African."

Needless to say, the numbers were bleak -- 88% for life expectancy, 84% for percent of population with HIV, 75% for infant mortality.

Posted by Andy Eggers at 11:02 PM

October 15, 2008

Alfred Marshall, apologist for blog readers

Like many people I know, I often find it hard to stay on task and avoid the temptations of the internet while I work. Email, blogs, news of financial meltdown -- I find myself turning to these distractions in between spurts of productivity, knowing that I would get more done if I just turned off the wireless and kept on task for longer stretches of time.

Well, those of us who have trouble giving up our blogs and other internet distractions may have an unlikely enabler in Alfred Marshall, the great economist. When he was seventeen, Marshall observed an artist who took a lengthy break after drawing each element of a shop window sign. As he later recounted, the episode shaped his own productivity strategy, towards something that sounds vaguely similar to my own routine:

That set up a train of thought which led me to the resolve never to use my mind when it was not fresh, and to regard the intervals between successive strains as sacred to absolute repose. When I went to Cambridge and became full master of myself, I resolved never to read a mathematical book for more than a quarter of an hour at a time without a break. I had some light literature always by my side, and in the breaks I read through more than once nearly the whole of Shakespeare, Boswell's Life of Johnson, the Agamemnon of Aeschylus (the only Greek play I could read without effort), a great part of Lucretius and so on. Of course I often got excited by my mathematics, and read for half an hour or more without stopping, but that meant that my mind was intense, and no harm was done.

Now, somehow I doubt that Marshall would consider the NYT op-ed pages to be "light literature" on par with Boswell, or that he would agree that watching incendiary political videos at TalkingPointsMemo.com qualifies as "absolute repose." But never mind that. Alfred Marshall told me I shouldn't work for more than fifteen minutes without distractions!

Posted by Andy Eggers at 8:06 AM

October 8, 2008

Information and accountability -- Snyder and Stromberg

Jim Snyder and David Stromberg have produced a very interesting working paper called "Press Coverage and Political Accountability." It's a big paper and I haven't processed the whole thing, but I think it is an important and clever paper that speaks to big issues about the media and democratic accountability.

The goal of the paper is to trace the cycle of political accountability: politicians go about their jobs, the media reports on the politicians, voters consume the news and become informed about the politicians, and politicians shape their behavior to respond to or anticipate pressure from voters. It is a difficult thing to measure any of the effects implied by this cycle (e.g. how much do politicians respond to voter pressure? how much does media coverage respond to actual politician behavior? how much do voters learn from the news?) for the usual endogeneity reasons endemic in social science. It usually takes a very careful research design to say something convincing about any part of this cycle. Here, the cleverness comes in the observation that the amount of news coverage devoted to a member of Congress depends to some extent on the congruence between congressional district boundaries and media market boundaries. This congruence is high if most people in a congressional district read newspaper X, and most of paper X's readers are in that congressional district. It can be low in bigger cities, particularly cities located on state boundaries, and in areas with a lot of gerrymandering.

The innovation of the paper is to use the degree of fit between congressional districts and media markets as an exogenous source of variation in how much political news voters are exposed to. The authors look to see whether their measure of congruence is correlated with how much media coverage is devoted to the member of Congress, how much voters know about their member of Congress, and how energetic and effective members of Congress appear to be in carrying out their jobs. The correlations are surprisingly strong at each point in the cycle.

I kept expecting to see an instrumental variables regression, where congruence would serve as an instrument for, e.g., voter information in its effect on member discipline. Instead they kept providing the reduced form regression for everything, which is fine. In a sense there are more IV regressions here than you could figure out what to do with, since congruence could be thought of as an instrument in estimating any subsequent effect.

Here's the part of their abstract where they describe their findings:

Exploring the links in the causal chain of media effects -- voter information, politicians' actions and policy -- we find statistically significant and substantively important effects. Voters living in areas with less coverage of their U.S. House representative are less likely to recall their represenative's name, and less able to describe and rate them. Congressmen who are less covered by the local press work less for their constituencies: they are less likely to stand witness before congressional hearings, to serve on constituency-oriented committees (perhaps), and to vote against the party line. Finally, this congressional behavior affects policy. Federal spending is lower in areas where there is less press coverage of the local members of congress.

Posted by Andy Eggers at 10:58 PM

October 4, 2008

Guest post: Politics kills!

A guest post from Marc Alexander of Harvard's Gov department, who blogs at Politics and Health:

Politics kills! A new study on traffic fatalities on the election day...

A brilliant research report published in the Oct 2 issue of JAMA found that driving fatalities increase significantly on the election day in the US. Redelmeier from U of Toronto and Robert Tibshirani from Stanford found that the hazard of being hurt or dying in a traffic accident rises on the day of the Presidential election. While the effect seems to be bipartisan (or non-partisan?), the risk is higher for men, for those in the Northeast, and for those who vote early in the day. To my knowledge, this is the best systematic evidence that shows the dark side of political participation in the US; despite all the benefits and necessities of active participation to keep democracy alive, there also seem to be significant costs. Remember to vote, but be careful when driving or crossing the street this election season! The article was covered by Reuters and the New York Times here.
The original research report is available from JAMA here and is titled "Driving Fatalities on US Presidential Election Days." Here is the free excerpt from JAMA:

The results of US presidential elections have large effects on public health by their influence on health policy, the economy, and diverse political decisions. We are unaware of studies testing whether the US presidential electoral process itself has a direct effect on public health. We hypothesized that mobilizing approximately 50% to 55% of the population, along with US reliance on motor vehicle travel, might result in an increased number of fatal motor vehicle crashes during US presidential elections.

Posted by Andy Eggers at 12:35 PM

October 1, 2008

Timely research: Hopkins on the Wilder Effect

I first saw IQSS's own Dan Hopkins' paper on the Wilder effect this summer at the PolMeth conference. Jens and I agreed that, of all the research that was presented at the conference, this was probably the thing that would have been most interesting to journalists. It directly addresses the speculation that, because survey respondents are afraid to appear racist, polls overstate Barack Obama's level of support. Here's the abstract:

The 2008 election has renewed interest in the Wilder effect, the gap between the share of survey respondents expressing support for a candidate and the candidate's vote share. Using new data from 133 gubernatorial and Senate elections from 1989 to 2006, this paper presents the first large-sample test of the Wilder effect. It demonstrates a significant Wilder effect only through the early 1990s, when Wilder himself was Governor of Virginia. Although the same mechanisms could affect female candidates, this paper finds no such effect at any point in time. It also shows how polls' over-estimation of front-runners' support can exaggerate estimates of the Wilder effect. Together, these results accord with theories emphasizing how short-term changes in the political context influence the role of race in statewide elections. The Wilder effect is the product of racial attitudes in specific political contexts, not a more general response to under-represented groups.

In the last couple of weeks, I have twice been in a situation where someone brings up the idea that Obama will do worse than the polls suggest because of the "Wilder effect." It's nice to have some research at hand to speak to this.

Googling around I notice that Dan's paper has been covered by a ton of blogs, as well as the Washington Post and some other papers. Nice work, Dan.

Posted by Andy Eggers at 5:41 PM

September 24, 2008

Government as API provider

The authors of "Government Data and the Invisible Hand" provide some interesting advice about how the next president can make the government more transparent:

If the next Presidential administration really wants to embrace the potential of Internet-enabled government transparency, it should follow a counter-intuitive but ultimately compelling strategy: reduce the federal role in presenting important government information to citizens. Today, government bodies consider their own websites to be a higher priority than technical infrastructures that open up their data for others to use. We argue that this understanding is a mistake. It would be preferable for government to understand providing reusable data, rather than providing websites, as the core of its online publishing responsibility.

I've blogged here a couple of times about the role transparency-minded programmers and other private actors are playing in opening up access to government data sources. This paper draws the logical policy conclusion from what we've seen in the instances I blogged about: that third parties often do a better job of bringing important government data to the people than the government does. (For example, compare govtrack.us/opencongress.org with http://thomas.loc.gov.) The upshot of the paper is that the government should make it easier for those third parties to make the government websites look bad. By focusing on providing structured data, the government will save web developers some of the hassle involved in parsing and combining data from unwieldy government sources and reduce the time between the release of a clunky government site and the release of private site that repackages the underlying data and combines it with new sources in an interesting way.

Of course, to the extent that government data is made available in more convenient formats, our work as academic researchers gets easier too, and we can spend more time on analysis and less on data wrangling. In fact, for people doing social science stats, it's really the structured data and not the slick front-end that is important (although many of the private sites provide both).

I understand that this policy proposal is an idea that's been circulating for a while (anyone want to fill me in on the history?) and apparently both campaigns have been listening. It will be interesting to see whether these ideas lead to any change in the emphasis of government info policy.

Posted by Andy Eggers at 9:09 AM

September 18, 2008

Call for papers: the Midwest Poli Sci conference gets interdisciplinary

From Jeff Segal via Gary King, we get the following call for papers for the Midwest Political Science Conference. An interesting bit of news here is that the conference is introducing a registration discount for people outside of the discipline.

Ask your favorite political scientist what the biggest political science conference is, and she'll tell you it's the American Political Science Association. Ask her what the best political science conference is and she'll tell you it's the Midwest Political Science Association meeting, held every April in the beautiful Palmer House in Chicago.

The Midwest Political Science Association, like most academic associations, charges higher conference registration rates for nonmembers than to members. Hoping to continue to increase attendance by people outside of political science and related fields at its annual meeting, the Association will begin charging the lower (member) rate to registrants who 1) have academic appointments outside of political science or related fields (policy, public administration and political economy) and 2) do not have a PhD in political science or the same related fields.

In addition, the Association grants, on request, a substantial number of conference registration waivers for first time participants who are outside the discipline.

The call for papers for the 2009 meeting, due October 10, is at http://www.mpsanet.org/~mpsa/index.html.

Hope to see you in Chicago.

Sincerely,

Jeffrey Segal, President
Midwest Political Science Association

Posted by Andy Eggers at 6:41 AM

September 15, 2008

Are journal editors biased against certain kinds of authors?

In a working paper entitled "Can We Test for Bias in Scientific Peer Review?", Andrew Oswald proposes a method of detecting whether journal editors (and the peer review process generally, I suppose) discriminate against certain kinds of authors. His approach, in a nutshell, is to look for discrepancies between the editor's comparison of two papers and how those papers were ultimately compared by the scholarly community (based on citations). In tests he runs on two high-ranking American economics journals, he doesn't find a bias by QJE editors against authors from England or Europe (or in favor of Harvard authors), but he does find that JPE editors appear to discriminate against their Chicago colleagues.

While publication politics is of course interesting to me and other academics, I bring up this paper not so much for the results as for the technique. Since the most important decision an editor makes is whether to publish an article or not, the obvious way of trying to determine whether editors are biased would be to look at that decision -- perhaps look at whether editors are more likely to reject articles by a certain type of author, controlling for article quality. One could imagine a controlled experiment of this type, but otherwise this is an unworkable design: there is no good general way to "control for quality," and at any rate the record of what was submitted where would be impossible to piece together. Oswald's design neatly addresses both of these problems. Instead of looking at the untraceable accept/reject decision, he looks at the decision to accept two articles and place them next to each other in an issue; not only does this convey information about the editor's judgment of relative quality of articles, but it means that citations on those articles provide a plausible comparison of the quality of those articles, uncomplicated by differences in the relative impact of different journals.

Oswald's approach of course rests on the assumption that citations provide an unbiased measure of the quality of a paper (at least relative to other papers published in the same volume), which is probably not true: any bias we might expect among journal editors would likely be common among scholars as a whole and would thus be reflected in citations. Oswald's test therefore really compares the bias of editors against the bias of the scholarly community as a whole: if everyone is biased in the same way, the test would never be able to reject the null hypothesis of no bias.

This kind of issue aside, it seems like this general approach could be useful in other settings where we want to assess whether some selection process is biased. I haven't had much of a chance to think about it -- anyone have any suggestions where this kind of approach could be, or has been, applied to other topics?

Posted by Andy Eggers at 8:13 AM

September 11, 2008

Pigskin and Politics

Here's a paper in which the authors devised a clever way to gather data and answer an interesting question:

Pigskins and Politics: Linking Expressive Behavior and Voting

David Laband, Ram Pandit, Anne Laband & John Sophocleus
Journal of Sports Economics, October 2008, Pages 553-560

Abstract:
In this article, the authors use data collected from nearly 4,000
single-family residences in Auburn, Alabama to investigate empirically
whether nonpolitical expressiveness (displaying support for Auburn
University's football team outside one's home) is related to the probability
that at least one resident voted in the national/state/local elections held
on November 7, 2006. Controlling for the assessed value of the property and
the length of ownership, the authors find that the likelihood of voting by
at least one person from a residence with an external display of support for
Auburn University is nearly 2 times greater than from a residence without
such a display. This suggests that focusing narrowly on voting as a
reflection of political expressiveness may lead researchers to overstate the
relative importance of expressiveness in the voting context and understate
its more fundamental and encompassing importance in a variety of contexts,
only one of which may be voting.

What I think is clever here is that the way the project uses observable factors (stuff in your yard, whether you vote, how much your house costs) to shed light on a fairly interesting aspect of behavior (why do people vote?). Based on a working paper version, the authors simply drove around the city of Auburn, recording whether houses displayed political signs and Auburn paraphernalia (ranging from flying an AU flag to "placing an inflated figure of Aubie (AU's school mascot) in one's yard"). They later linked this up with voter rolls and data on home prices to get their correlations.

Of course, there are some problems with using football paraphernalia as a measure of "nonpolitical expressiveness." I don't know Auburn, but are Auburn fans more likely to be Republicans (controlling for value of house)? I am guessing they are. If more enthusiastic Auburn fans are also more enthusiastic Republicans (not just more expressive Republicans, but more ideologically committed ones), then these estimates would indicate too large a role for "expressiveness," particularly since the authors don't record even the party affiliation of people living in the houses, let alone strength of party identification. But their measure may be less confounded with political commitment itself than measures of expressiveness you could find in other communities. If you were to do this in Cambridge, I suppose you could use upkeep of the garden as a measure of expressiveness and community orientation (you wouldn't get far using Harvard football signs), but attention to the garden is of course correlated with wealth, which means there would be all the more difficulty in extracting the pure economic/political factors.

Anyway, I applaud the authors for devising a measure of expressiveness that works pretty well and is so easily observable.

Posted by Andy Eggers at 8:50 AM

August 12, 2008

Running and aging

The BBC reports on new findings published in in Archives of Internal Medicine from a longitudinal study of running and health. In 1984, the authors recruited members of a national club of over-50 runners and a control group of similar non-runners. Based on health outcomes observed in the years since, the authors conclude that "Vigorous exercise (running) at middle and older ages is associated with reduced disability in later life and a notable survival advantage." And -- more good news for runners -- the runners did not disproportionately suffer from knee and ankle injuries.

With studies like this, of course, I approach the paper wondering how much of the observed difference is due to the "treatment" (running) and how much is due to other differences between the treatment and control groups. Here I think that unmeasured confounding is significant; I would guess that selection accounts for maybe half of the difference in outcomes between the runners and non-runners. People who are in a running club at age 55 are unusual for many reasons that are not easily observed and controlled for. For one thing, they've usually run for a long time, meaning that the study really compares lifelong exercise regimens and not just whether the subjects run late in life. More importantly, older runners tend to have unusually fortunate genetic inheritances; anyone who didn't would be unable or unwilling to keep up a running regimen at that age. In my experience dedicated older runners also tend to be people with a fierce determination to conquer challenges and stick to a regimen. These are people who eat well and see their doctor and have friends and have all sorts of other health advantages. It's not surprising that people with such genetic and lifestyle advantages live longer, but attributing it to their running -- ie suggesting that anyone could start running and have similar outcomes -- would be to overlook a lot of these confounding factors.

The authors of the study recognize some of the selection problems they face, and they do what they can to address them, conditional on the study design. I was reminded in reading the paper that public health and medical researchers tend to be considerably more diligent than researchers in other social sciences in choosing words that differentiate association and causation. These distinctions tend to be lost in media coverage of the research but in this case the authors do a careful job of recognizing the obstacles to drawing inferences about the effects of running based on their study.

Posted by Andy Eggers at 7:36 AM

August 5, 2008

Gans and Leigh on the "Baby Bump"

In Born on the First of July: An (Un)natural Experiment in Birth Timing, forthcoming at the Journal of Public Economics, Joshua Gans and Andrew Leigh examine "introduction effects" (the extent to which people change their behavior to respond to new policies) in the context of a baby bonus that was initiated in Australia in 2004. In May of that year, the government announced that families of babies born on or after July 1 would receive a $3000 cash bonus. Mothers with due dates around that time made special arrangements (mostly delaying Caesarean and other planned deliveries) to get the prize. The authors estimate that over 1000 births were moved; July 1, 2004, witnessed more births than any other day in the period since 1975 for which the authors have data.

The authors note two implications of the study. First, policies can provoke not only long-run distortions (e.g. increases in babies born) but short-run distortions from gaming of the system. Second, the "baby bump" constituted a large disruption in regular procedures in maternity hospitals and staff; they don't find effects on infant mortality, but they suggest that the event could be useful for studying the effects of under-staffing in hospitals.

My first thought in reading the paper was that it was a cautionary tale for regression discontinuity design. The setup of the study has the flavor of David Card et al's paper "Does Medicare Save Lives?," discussed on this blog by the intrepid John Graves, in which the authors examine the outcomes of patients who need medical procedures right around the time when they become eligible to receive Medicare benefits; they find that patients who were barely old enough to receive the benefits do considerably better than the ones who were too young. I figured this study was probably a failed attempt to do something similar, ie to study the effect of extra income on child mortality or other outcomes by comparing kids born just before and after the benefit was handed out. This sort of thing doesn't work when the subjects are able to sort around the threshold. In this case, the parents who gave birth just after the cutoff may have been more desperate for money, or had more power with the doctors, or perhaps were generally more in tune with political events, such that differences in outcomes between recipients and non-recipients of the bonus could be due to these factors and not the bonus itself. In Card et al's case, they focused on emergency procedures that could not have been delayed; this study shows that for many people childbirth is quite postponable. So in addition to the implications Gans and Leigh draw in their conclusion, I would add that this is another case where an RDD-style approach is complicated because subjects can effectively sort.

I do think you could examine the effect of the bonus on child outcomes if you looked at kids born at least 3 weeks before and after the cutoff date, a point at which the sorting is probably not that big of a deal. And date of birth is probably not itself a strong confounder for whatever you want to study, so there are limited advantages to focusing in on the threshold anyway.

At any rate, it appears my initial impression -- that the paper is the artifact of a failed RDD project -- was wrong: the authors have done other examinations of how events affect birth patterns (the effect of the millennium on conceptions, births and deaths, and the ability of parents to move births from inauspicious days like Feb 29 and April 1).

Posted by Andy Eggers at 8:02 AM

May 13, 2008

Data sets and data interfaces at datamob.org

I recently came across Datamob.org, a site featuring public datasets and interfaces that have been built to help the public explore them.

From datamob's about page:

Our listings emphasize the connection between data posted by governments and public institutions and the interfaces people are building to explore that data.

It's for anyone who's ever looked at a site like MAPLight.org and wondered, "Where did they get their data?" And for anyone who ever looked at THOMAS and thought, "There's got to be a better way to organize this!"

I continue to wonder how the types of interfaces featured on datamob will affect the dissemination of information in society. The dream of a lot of these interface builders is to disintermediate information provision -- ie, to make it possible for citizens to do their own research, produce their own insights, publish their findings on blogs and via data-laden widgets. (We welcomed Fernanda and Martin from Many Eyes, two prominent participants in this movement, earlier this year at our applied stats workshop.) At the same time, the new interfaces make it cheaper for professional analysts -- academics, journalists, consultants -- to access the data and, as they have always done, package it for public consumption. It makes me wonder to what extent the source of our data-backed insights will really change, ie, how much more common will "I was playing around with data on this website and found out that . . . " become relative to "I heard about this study where they found that . . ."?

My hunch is that, just as blogging and internet news has democratized political commentary, the new data resources will make it possible for a new group of relatively uncertified people to become intermediaries for data analysis. (I think FiveThirtyEight is a good example in political polling, although since the site's editor is anonymous I can't be sure.) People will overwhelmingly continue to get data insights as packaged by intermediaries rather than through new interfaces to raw data, but the intermediaries (who will use these new services) will be quicker to use data in making their points, will become much larger in number, and will on average become less credentialed.

Posted by Andy Eggers at 9:48 AM

April 29, 2008

Common sense and research design

Last week on the New York Times' "Well" blog, Tara Parker-Pope blogged about a study that appeared to show that a mother's diet can affect the sex of her child. Yes, the father's sperm determines the gender of a particular embryo, but the story is that the mother's nutritional intake can affect how likely a given embryo is to go to term. At any rate the study is based on survey data in which mothers of boys report eating more around the time of conception than mothers of girls.

I can't really pass judgment on the study itself -- I haven't had time to read the thing -- but as someone who is pretty obsessed with (and professionally involved in) criticizing causal inferences drawn from observational studies, I found it pretty entertaining to read the comments. I admit I did not read all 409 of them. But on the whole they fell into five categories:

1. Credulous, prepared to integrate conclusions into own understanding of the world: "Interesting article…so maybe my eating all that 'crap and vitamins' will help me conceive a boy!!"

2. Generally dismissive: "Unmitigated rubbish! Another 'scientific study' that will be repudiated in two years."

3. Skeptical based on measurement error: "Surveying the diets of women who are 14 weeks pregnant and asking them to 'recall' what they had eaten earlier in pregnancy or preconception will not yield accurate data."

4. Skeptical based on unrealized observable implications: "Are there more daughters born to women in developing countries?" and "Seems to me that the obvious answer lies in genders of children born to diabetic mothers - whose bloodsugars are usually higher than the average nondiabetic woman."

5. Skeptical because of possibility of reverse causation: "Shouldn’t we be interested in the fact that the gender of the baby seems to be affecting the eating habits of the mother? That seems much more interesting to me."

For all I know, many of the perceptive comments in categories 4 and 5 came from professional statisticians, but my guess is that many of these people have never been involved in research in any serious way. In that sense I find it heartening to see so much careful public deliberation about research findings. My experience is that, while a sharp eye for research design can be taught and learned, most of the issues that occupy myself and other members of what Jim Snyder affectionately calls the "identification Taliban" -- the statisticians and social scientists who maraud around academia trying to put burkas on those who would interpret a cross-country regression causally -- are quite simple and widely understood. It seems like the most dangerously misguided people are the ones with 1 semester of econometrics and a working knowledge of Stata. It's as if you lose the common sense your mother taught you once you learn how to run a regression. (Disclosure: I was certainly a danger to myself and others at that stage.)

I find that the mass participation aspect of the web alternately exhilarates (StumbleUpon!) and depresses me (inane, racist YouTube comments!); reading the comments on that NYT blog entry was one of the happier experiences.

Posted by Andy Eggers at 4:36 PM

April 15, 2008

Google Charts from R: Maps

A few weeks ago I wrote a post sharing some code I wrote to generate sharp-looking PNG scatterplots from R using the Google Chart API. I think there are some nice uses of that (for example, as suggested by a commenter, to send a quick plot over IM), but here's something that I think could be much more useful: maps from R using Google Charts.

So, suppose you have data on the proportion of people who say "pop" (as opposed to "soda" or "coke") in each US state. (I got this data from Many-Eyes.) Once you get my code, you enter a command like this in R

googlemap(x = pct_who_say_pop, codes = state_codes, location = "usa", file ="pop.png")

and this image is saved locally as "pop.png":

To use this, first get the code via
source("http://people.fas.harvard.edu/~aeggers/googlemap.r")
which loads in a function named googlemap, to which you pass


  • x: a vector of data

  • codes: a vector of state/country codes (see the list of standard state and country codes),

  • and location a region of the world ("africa", "asia", "europe", "middle_east", "south_america", "usa") or the whole world ("world")


and you get back a url that you can embed in html as I did above, send over IM, etc. If you pass a file argument, as I did above, you can save the PNG locally.

For optional parameters to affect the scale of the figure and its colors, see the source.

Another quick example:

Suppose you wanted to make a little plot of Germany's colonial possessions in Africa. This code

googlemap(x = c(1,1,1,1), location = "africa", codes = c("CM", "TZ", "NA", "TG"),file = "germans_in_africa.png")

returns this url

"http://chart.apis.google.com/chart?cht=t&chtm=africa . . . etc.

and saves this PNG on your hard drive:

The scatterplot thing before was something of a novelty, but I think this mapping functionality could actually be useful for generating quick maps in R, since the existing approaches are pretty annoying in my (limited) experience. The Google Charts API is not very flexible about labels and whatnot, so you probably won't be publishing any of these figures. But I expect this will serve very well for quick exploratory stuff, and I hope others do too.

I'd love it if someone wanted to help roll this into a proper R package . . . .

Posted by Andy Eggers at 3:01 PM

April 2, 2008

Google Charts from R

Late last year Google released the Google Chart API, which gives programmers access to a web service that renders charts. So this url

http://chart.apis.google.com/chart?cht=p3&chd=t:60,40&chs=250x100&chl=Hello|World

will produce a chart like this:

Try it yourself -- copy the url into your browser; change the text from "Hello World" to something else, etc. And the API supports bar plots, line charts, Venn diagrams (!) and even, recently, maps.

People have written libraries in various languages to provide interfaces to the API (here's a list of them), and tonight I hacked together a little R interface to the scatterplot charts. It's quite rough, but I'd be curious if anyone wants to extend it or can show anything cool with it.

From R, all you have to do is:

> source("http://people.fas.harvard.edu/~aeggers/code/googleplot.r")

And then where you might say

> plot(1:9, c(4,2,4,3,6,4,7,8,5), cex = 1:9, xlim = c(0, 10), ylim = c(1,10))

you can use the same syntax with the googleplot function

> googleplot(1:9, c(4,2,4,3,6,4,7,8,5), cex = 1:9, xlim = c(0, 10), ylim = c(1,10))

and get back a long url encoding those parameters

"http://chart.apis.google.com/chart?cht=s&chd=s:GMSYekqw2,SGSMeSkqY,
GNUbhov29&chxt=x,y&chxl=0:|0|2|5|7|10|1:|1|3|5|7|10&chs=250x200"

which, when entered into an address bar or embedded in an img tag in a web page, gives you a figure like this:

It seems like this approach could provide a convenient way to publish a figure on the web in some circumstances, but setting aside the insufficiency of my R function, the graphics flexibility of the API isn't quite large enough yet (eg can't pass an axis label, ie xlab in R). In most cases it seems like you'd just want to create a nice PNG in R or whatever and then publish that. But I'd love to hear if anyone finds a way to use this or thinks it'is worth extending further.

Posted by Andy Eggers at 1:07 AM

March 18, 2008

Games That Produce Data

In a conversation with Kevin Quinn this week I was reminded of a fascinating lecture given at Google in 2006 by Luis von Ahn, an assistant professor in computer science at Carnegie Mellon. Von Ahn gives a very entertaining and thought-provoking talk on ingenious ways to apply human intelligence and judgment on a large scale to fairly small problems that computers still struggle with.

(Or watch video on Google video.)

Von Ahn devises games that produce data, the best-known example being the ESP Game, which Google acquired and developed as Google Image Labeler. In the game, you are paired with another (anonymous) player and shown an image. Each of you feverishly types in words describing the image (eg, "Spitzer", "politician", "scandal", "prostitution"); you get points and move to the next image when you and your partner agree on a label. The game is fun, even addictive, and of course Google gets a big, free payoff -- a set of validated keywords for each image.

I'm curious about how these approaches can be applied to coding problems in social science. A lot of recent interesting work has involved developing machine learning techniques to teach computers to label text, but there are clearly cases where language is just too subtle and complex to accurately extract meaning, and we need real people to read the text and make judgments. Mostly we hire RAs or do it ourselves; could we devise games instead?

Posted by Andy Eggers at 9:37 AM

February 23, 2008

Publication Bias in Drug Trials

A study published in the New England Journal of Medicine last month showed that widely-prescribed antidepressants may not be as effective as the published research indicates. After reading about the study in the NYT, I recently read the article and was struck by how well the authors were able to document the somewhat elusive phenomenon of publication bias.

Researchers in most fields can document publication bias only by pointing out patterns in published results. A jump in the density of t-stats around 2 is one strong sign that null reports are not being published; an inverse relationship between average reported effect size and sample size in studies of the same phenomenon is another strong sign (because the only small studies that could be published are the ones with large estimated effects). These meta-analysis procedures are clever because they infer something about unpublished studies from what we see in published studies.

As the NEJM article makes clear, publication bias is more directly observable in drug trials because we have very good information about unpublished trials. When a pharmaceutical company initiates clinical trials for a new drug, the studies are registered with the FDA; in order to get FDA approval to bring the drug to market, the company must submit the results of all of those trials (including the raw data) for FDA review. All trials conducted on a particular drug are therefore reviewed by the FDA, but a subset of those trials are published in medical journals.

The NEJM article uses this information to determine which antidepressant trials made it into the journals:

Among 74 FDA-registered studies, 31%, accounting for 3449 study participants, were not published. Whether and how the studies were published were associated with the study outcome. A total of 37 studies viewed by the FDA as having positive results were published; 1 study viewed as positive was not published. Studies viewed by the FDA as having negative or questionable results were, with 3 exceptions, either not published (22 studies) or published in a way that, in our opinion, conveyed a positive outcome (11 studies). According to the published literature, it appeared that 94% of the trials conducted were positive. By contrast, the FDA analysis showed that 51% were positive. Separate meta-analyses of the FDA and journal data sets showed that the increase in effect size ranged from 11 to 69% for individual drugs and was 32% overall.

One complaint -- I thought it was too bad that the authors did not determine whether the 22 studies that were "negative or questionable" and went unpublished were not submitted ("the file drawer problem") or rejected by the journals. But otherwise very thorough and interesting.

Posted by Andy Eggers at 2:05 AM

January 27, 2008

Opening the government's books at FedSpending.org

FedSpending.org has been on my list of new data resources to check out for a while now. This is a project of budget watchdog OMB Watch that brings government data on contracts and awards together in a single searchable database at www.fedspending.org. After looking into it this week I can report that it looks like a very useful resource, interesting not only for the data it provides but as yet another example of the phenomenon of private groups repackaging government data for public use in the name of transparency and accountability.

First, on the data: you can use FedSpending.org to get pretty good detail on contracts handed out by the federal government, including the contracting agency, the company to which the contract was awarded, the size and purpose of the contract, the location of the company, and the location where the contract was carried out. You can subset it in interesting ways, searching by congressional district or by contractor. You can even specify contractor characteristics. For example, this search will return contracts awarded to minority-owned businesses in Massachusetts. Finally, you can output search results in various formats (including CSV), and there is even an API that allows you to generate your queries programmatically so that, with a little bit of code to parse the XML, you could systematically build an interesting dataset without spending time clicking and downloading by hand.

I view this site as part of a phenomenon where data released by the government is being repackaged and publicly released in a useful format by private actors. Another intriguing example is GovTrack, a project built by a Princeton undergrad several years ago (he's now a linguistics grad student) that parses the congressional record and a bunch of other public information and provides email notifications so that you can track any legislator or issue. A more professional recent attempt to bring transparency to Capitol Hill is OpenCongress.

I'm interested in whether academics and other researchers find useful ways to exploit these new resources. I suspect the pretty presentation that these sites provide will be of limited use to researchers, who are used to grappling with the raw data, but other aspects of cleaning up the data for public consumption (for example, tagging elements of the congressional record based on which members are mentioned) seem like they could save us some steps. And the reach and effectiveness of this kind of transparency effort is itself a topic worthy of research in my view.

One of the interesting things about FedSpending is that last month OMB released its own website providing public access to data on federal contracts and awards, at USASpending.gov. On the surface, USASpending looks pretty different from FedSpending -- the government site's design is clean and patriotic, prominently featuring a US flag and the White House facade, while FedSpending's design stays true the site's watchdog roots with a green color scheme and a washed-out dollar bill as a header. But look more closely and you find that USASpending is basically a clone of FedSpending, identical down to the examples in the documentation. It turns out that a piece of 2006 legislation (the bill co-sponsored by Obama that was held up by the notorious "secret hold") required OMB to build a website to publicly disclose data on federal contracts and awards. But at the time OMB Watch was already almost done with a site of its own that would do essentially everything specified in the legislation. So in a bizarre twist, OMB decided to hire OMB Watch (which, as the name suggests, usually makes its bread by criticizing OMB) to lend its technology to the government project. Based on the outcome, it looks like OMB is basically running a clone of the OMB Watch site. So much for legislating transparency.

Incidentally, can anyone use FedSpending (or USASpending) to find the contract in which OMB hired OMB Watch? I couldn't. The Washington Post article where I learned about the arrangement says there was an intermediary contractor, and the contract appears not to have been sourced from OMB, so I wasn't able to locate it either by the contractor or the contracting agency. Oh well.

Posted by Andy Eggers at 2:49 PM

December 5, 2007

Holiday Gifts for the Data-Addicted

The infosthetics blog offers its "shopping guide for the data-addicted." I was intrigued by the chumby and nabaztag, two devices that offer the charms of the internet divorced from the keyboard/mouse/monitor setup. For the urban planner on your list, don't miss the fly swatter whose mesh is a street map of Milan. For the social science stats crowd, though, the best gift on the list has to be the Death and Taxes poster, depicting the US federal discretionary budget in remarkable detail and clarity. Click on the image below to get a close-up look at the poster.
dat.jpg

Posted by Andy Eggers at 8:52 AM

November 21, 2007

A case for the case-control method

In his opinion piece in this weekend's New York Times, Henry Louis Gates presents a taste of the research behind his forthcoming book, In Search of Our Roots:

I have been studying the family trees of 20 successful African-Americans, people in fields ranging from entertainment and sports (Oprah Winfrey, the track star Jackie Joyner-Kersee) to space travel and medicine (the astronaut Mae Jemison and Ben Carson, a pediatric neurosurgeon). And I’ve seen an astonishing pattern: 15 of the 20 descend from at least one line of former slaves who managed to obtain property by 1920 — a time when only 25 percent of all African-American families owned property.

The question is, how astonishing is the pattern Gates points out?

Whether we should be impressed that 15 of 20 successful African-Americans had landowning ancestors depends a lot on what we assume about patterns of intermarriage among landowning and non-landowning African-Americans. Let's consider two extreme possibilities. First, assume that landowners never married non-landowners. In that case, one's grandparents would have been either all landowners or all non-landowners (more precisely, all from landowning families or all from non-landowning families). Given that 25% of all African-American families were landowners in 1920, 25% of African Americans of Oprah's generation would have had landowning grandparents (assuming that landowners and non-landowners had equally sized families, and also assuming that African-Americans had only African-American grandparents). The fact that 75% of Gates's successful African-Americans had landowning grandparents would indeed be remarkable, supporting his claim that having landowning ancestors helped them succeed.

At the other extreme, assume that landowners intermarried perfectly freely with non-landowners. In that case, there is an almost 70% chance that a randomly-selected person of Oprah's generation would have at least one landowning grandparent. (To see this: given that 3/4 of all grandparents were not landowners, the probability that a randomly selected person would have no landowning grandparents is (3/4)^4 = 31.6%.) If that were true, we would be quite likely to see as many as 15 people with landowning ancestors in Gates's sample even if landowning had nothing to do with success (p-value = .19). The pattern he observes would thus provide only very weak evidence for his argument about the importance of landowning.

Gates's book appears typical of a genre of case study that focuses on remarkable people (or companies, or countries) in order to determine how they became that way. The problem with these studies (as pointed out in KKV, among many other places) is that they assume too much about the characteristics of unremarkable people (or companies, or countries). In the above example, Gates implicitly assumes that far fewer than 75% of unsuccessful African-Americans had a landowning ancestor in 1920. Instead of relying on this (possibly erroneous) assumption, he could have explicitly compared the family histories of his sample of remarkable African-Americans with those of another sample of unremarkable African-Americans. (This research design is known as "case-control" in epidemiology.) I doubt it would help with book sales to include a few chapters about thoroughly unfamous people, but it would make his arguments more convincing.

Posted by Andy Eggers at 12:40 AM

November 8, 2007

Beyond scatterplots

Are scatterplots confusing? Turns out the graphics people at the New York Times, who I think have been putting out some outstanding work in the past few years, think so.

Matthew Ericson, Deputy Graphics Editor at the NYT, gave a talk recently at the Infovis conference in which he described some of the techniques his staff uses to communicate information to readers. I wasn't there, but I looked through his slides (70 M zip file), which provide both highlights from the NYT's recent graphics and some indication of the process by which they arrive at a final product. Particularly interesting is the set of slides from 35-62, in which he shows how they developed a graphic depicting partisan shifts between the 2004 and 2006 Congressional elections. Early in the sequence (at page 38), you see a draft of what seems like an adequate approach -- a scatterplot depicting 2004 vote margin vs 2006 vote margin. It turns out (I'm basing this on Fernanda Viegas' description of the talk on the infosthetics blog) that the NYT graphics staff has found that lay readers don't really understand scatterplots, in part because they are so used to seeing time on the x-axis. So Ericson and his staff went back to the drawing board and developed something different (shown on page 61 of the slides; you can also see the a one-page pdf here or by clicking on the thumbnail below). Their new graphic orders the districts vertically by their 2006 vote share and shows the vote outcome on the horizontal axis, depicting the 2004-2006 shift by a horizontal arrow originating at the 2004 vote margin and ending at the 2006 margin. This approach conveys the information much less compactly (for one thing, all of the information in the y-axis is also in the x-axis) but communicates the partisan shift in a more intuitive way, while also giving a better sense of the distribution of partisanship across districts than did the scatterplot. Even though I'm used to seeing scatterplots I think I get a lot more out of this figure, especially with the extra summary stats they are able to depict by continuing on the theme of horizontal arrows depicting 2004-2006 shifts.
ericson_beyond_scatterplots.JPG

Posted by Andy Eggers at 10:55 AM

October 25, 2007

Visualizing UK Politicians

Since I saw Fernanda Viegas and Martin Wattenberg's presentation on Many Eyes a few weeks ago in our Applied Stats workshop, I've been itching to use their visualization tools on some of my own data. Tonight I made a treemap of the dataset of UK politicians that Jens Hainmueller and I have been developing. (The data consists of over 6000 candidates who ran for the House of Commons between 1950 and 1970.) I set up the visualization such that each box in the treemap is sized to indicate the number of campaigns for each combination of party and occupation (eg Conservative barristers) and the color reflects "proportion attending Oxbridge." But you can play with it via the menus at the bottom of the visualization and cut the data the way you want: you can make the size reflect "proportion female" and the color reflect "proportion elected," and you can make it show party by occupation instead of occupation by party. I've embedded the visualization here; you can make comments tied to a particular view of the data on the many-eyes site. I'm eager to hear reactions, whether on the visualization or on the brand-new data.

Posted by Andy Eggers at 1:50 AM

October 12, 2007

Visualization for data cleaning

Speaking of Fernanda Viegas and Martin Wattenberg's excellent presentation on visualization, I recently came across a data cleaning problem where visualization was a big help. Data cleaning is all about having powerful ways of finding mistakes quickly. Much of the time, clever scripting is the best way to detect errors, but in this case a simple data visualization turned out to be the best tool. Screenshot after the jump.

First, a little background on the project, which is a collaboration with Jens Hainmueller. The Times of London published election guides throughout the 20th century including voting results and candidate bios for every constituency in every election to the House of Commons. We scanned and OCR'd seven volumes of this series and wrote scripts to extract information about each constituency race, including the name, vote total, and short bio of each candidate. The challenge then was to determine which appearances belonged to the same individual. For example, when "P G Agnew" runs in 1950 and "Peter Agnew" runs in 1955, are they the same person? We trained a clustering algorithm to do this matching based on name similarity, year of birth, party, and gender, and wrote some scripts to catch likely errors. When we thought we had done as well as we could, we decided to produce a little visualization to admire our perfectly cleaned data. To our surprise, the visualization revealed a number of hard-to-catch remaining errors.

As can be seen in the screenshot below, we listed the candidates alphabetically by surname and depicted their election career graphically with a colored rectangle for each appearance in a race. We selected the colors to reflect the margin in the race, with deep green indicating an easy victory and deep red indicating a resounding defeat.
thc_screenshot3.JPG
Depicting the candidates' campaign history in this way helped us see patterns that suggested that a single candidate had been incorrectly coded as separate candidates. Brian Batsford, shown at the top of the screen shot, was one such case: the Brian Batsford who ran in 1959, 1964, and 1970 was very likely to be the same person as the Brian Batsford who ran in 1966. Indeed, it turned out that they were the same person; our clustering algorithm had mistakenly separated him in two because the year of birth had been miscoded as 1928 in his 1966 appearance.

The key point here is that the pattern that allowed us to see this mistake is easier to see than it is to articulate and, perhaps more importantly, than it is to write in a script. (OK, I'll try: "Find pairs of candidates who have similar names and did not appear in the same elections, especially if they appeared in contiguous elections and had similar results.") I prefer the pretty colors.

Posted by Andy Eggers at 12:35 PM

September 27, 2007

How do you get 7,000,000 cell phone records?

Not to take anything away from David Lazer's presentation today at the Applied Stats workshop, but the star of his talk was the data. The crowd favorite appeared to be a dataset of all cell phone transactions over a several-week period for 7,000,000 subscribers somewhere in Europe (wouldn't say where). David and his colleagues have built a graph of interpersonal connections based on the call data, and are trying to answer questions like, "How many degrees of separation are there between two randomly selected people in the network?" (Answer: 13.) But to me an even more compelling question came up in the Q&A session: where do you get data like this?

David's answer was basically that you need to know the right people; it sounded as if he or one of his colleagues knew key executives at the phone company who were able to provide the call records. Lee Fleming offered that grad students might find their way to data like this by getting to know scholars like David who have access to it. (How many degrees of separation are there between you and your dream dataset?)

But the importance of knowing cell phone execs would be the wrong takeaway from David's talk, which after all was basically about how we are all awash in data these days. Yes, to get data on cell phone calls you may need to have friends at the phone company, and yes, to get information on where a group of MIT students spends every hour of the day over a few weeks you will have to launch your own experiment (as described in David's talk today), but for those of us with fewer connections and smaller research budgets there is still an enormous amount of data out there to collect, much of it from the web. I've actually spent a fair amount of time in the past year learning how to collect data from the web, and I look forward to blogging here about web scraping and other data collection approaches in the next few months. But right now I'm going to go check whether David left any tracking devices in my bag.

Posted by Andy Eggers at 12:44 AM