May 2012
Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    

Authors' Committee


Matt Blackwell (Gov)


Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries



SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
The Education Wonks
Empirical Legal Studies
Free Exchange
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science



Powered by
Movable Type 4.24-en

March 2, 2011

R Graph Cookbook

I've been waiting for this kind of R book for awhile. Packt Publishing, which releases technical and information technology books, has just published The R Graph Cookbook. The premise is simple: there is a need for a book that clearly presents "recipes" of R graphs in one comprehensive volume. Indeed, many researchers switch to R (from Stata or SAS) in part because of the enormous flexibility and power of R in creating graphs.

This book is perhaps most useful for beginners, but even experienced R users should find the clarity of the presentation and discussion of advanced graphics informative. In particular, I found the presentation of how to create heatmaps and geographic maps useful. I'll certainly use these examples when teaching data visualization. Another enormous benefit of the book is that the author has released all the R code used to create the graphs. You can download the R code here.

I have two quibbles, however. First, while the use of color in the graphs is pretty, I would've liked more examples with black-and-white templates. Although many decades from now (when most research might conceivably be published exclusively online), color graphs will be the norm, currently most research is published in journals where colors are not used. Second, like nearly all books I've seen on graphics using statistical packages, the author doesn't present graphics for regression coefficients and cross-tabs. (For information on graphing these, I recommend the excellent article on using graphs instead of tables, published in Perspectives in Politics.) Nonetheless, these are minor issues, and most R users, regardless of skill level, should find this book very useful for teaching and reference.

Posted by Ethan Fosse at 6:41 PM

January 9, 2010

Sequential Ideal Points

Simon Jackman puts together a plot of how the estimation of ideal points of the 111th U.S. Senate changes as he adds each roll call. Every Senator starts the term at 0 and then branches out. It illustrates an interesting feature of these IRT models:

The other thing is that there doesn't seem to be any obvious "vote 1″ update for ideal points. That is, there is no simple mapping from the ideal point estimate based on m roll call to ideal point estimates based on m+1 roll calls. You have to start the fitting algorithm from scratch each time (and hence the appeal of exploiting multiple cores etc), although the results from the previous run giving pretty good start values.

Posted by Matt Blackwell at 3:57 PM

October 14, 2009

The Fundamental Regret of Causal Inference

Tim Kreider at the New York Times has a short piece on what he dubs "The Referendum" and how it plagues us:

The Referendum is a phenomenon typical of (but not limited to) midlife, whereby people, increasingly aware of the finiteness of their time in the world, the limitations placed on them by their choices so far, and the narrowing options remaining to them, start judging their peers' differing choices with reactions ranging from envy to contempt. ...Friends who seemed pretty much indistinguishable from you in your 20s make different choices about family or career, and after a decade or two these initial differences yield such radically divergent trajectories that when you get together again you can only regard each other's lives with bemused incomprehension.

Those familiar with casual inference will recognize this as stemming from the Fundamental Problem of Causal Inference: we cannot observe, for one individual, both their response to treatment and control. The article is an elegant look at how we grow to worry about those mysterious missing potential outcomes--the paths we didn't choose--and how we use our friends' lives to impute those missing missing outcomes. Kreider goes on to make this point exactly, with a beautiful quote from a novel:

The problem is, we only get one chance at this, with no do-overs. Life is, in effect, a non-repeatable experiment with no control. In his novel about marriage, "Light Years," James Salter writes: "For whatever we do, even whatever we do not do prevents us from doing its opposite. Acts demolish their alternatives, that is the paradox." Watching our peers' lives is the closest we can come to a glimpse of the parallel universes in which we didn't ruin that relationship years ago, or got that job we applied for, or got on that plane after all. It's tempting to read other people's lives as cautionary fables or repudiations of our own.

Perhaps the only response is that, while so close to us in so many respects, friends may be poor matches for gauging these kinds of effects. In any case, "Acts demolish their alternatives, that is the paradox" is the best description of the problem of causal inference that I have seen.

Posted by Matt Blackwell at 4:19 PM

June 8, 2009

Was there Really a Hawthorne Effect at the Hawthorne Plant?

The idea of the Hawthorne effect is that individuals may change their behavior because they are being studied, in addition to any real effects of the intervention. Steven Levitt and John List have revisited the illumination experiments at the Hawthorne plant that gave name to the effect, and argue that many of the original conclusions do not hold up to scrutiny. There's an Economist article on the paper here but its subtitle "Being watched may not affect behavior, after all" is misleading: even if the earlier research was sloppy by today's standards the contribution was to point out the possibility of these effects. A better subtitle could have commended replication as important scientific method.

Levitt and List (2009) "Was there Really a Hawthorne Effect at the Hawthorne Plant? An Analysis of the Original Illumination Experiments" NBER Working Paper #15016,

The Economist (June 4, 2009) "Light work: Questioning the Hawthorne effect",

Posted by Sebastian Bauhoff at 8:53 AM

May 23, 2009

Distribution of Swine Flu Cases by Weekday

How will you expect swine flu cases to be distributed by weekday? More specifically, will you expect more cases distributed in weekdays or in weekends? My first reaction is that there will be more cases if there are more social gatherings.

Following this logic, the reasons for supporting more cases in weekdays may include that susceptible population have more contacts with infected population in weekdays, either through school or through work, etc. In addition, as people are more likely to travel in weekends, it means that they will have more contacts with infected subjects during their traveling, but because it takes around two days for the virus to have impacts, the cases will not be identified until a couple of days later. Could this also be due to the fact that there are less clinical services provided in weekends and that people are less likely to visit clinics in weekends?

Here is an old graph I made according to the swine flu updates (4/26/2009 - 05/21/2009) published on WHO's website. To be more accurate, I drew a new graph using the number of confirmed new cases rather than the cumulative number of confirmed cases.

As the reporting times for confirmed new cases vary, some at 18:00 while others at 6:00, etc., I kept only records between 05/01 and 05/21 whose reporting time is at 6:00 and redrew the graph. Weekdays are redefined as well. For example, Thursday 6:00 to Friday 6:00 is defined as Thursday. Could you still see any salient patterns, like the differential between weekdays and weekends? Why Friday is so spiky this time?

Posted by Weihua An at 12:38 AM

May 20, 2009

Debates on government transparency websites

A few weeks ago my friend Aaron Swartz wrote a blog post called Transparency is Bunk, arguing that government transparency websites don't do what they're supposed to do, and in fact have perversely negative effects: they bury the real story in oceans of pro forma data, encourage apathy by revealing "the mindnumbing universality of waste and corruption," and lull activists into a false sense of accomplishment when occasional successes occur. It's a particularly powerful piece because Aaron uses the platform to announce he's done working on his own government website ( The piece appears to have caused a stir in government transparency/hacktivist circles, where Aaron is pretty well known.

On looking back at it I think Aaron's argument (or rant, more accurately) against the transparency websites is not very strong: indeed, data overload, apathy, and complacency are all dangers these efforts face, but that shouldn't have come as a surprise.

I had two other responses particular to my perch in academia. First, there is some good academic research showing that transparency works, although the evidence on the effectiveness of grassroots watchdogging is less strong than the evidence on auditing from e.g. Ferraz and Finan on Brazilian municipalities (QJE 2008, working paper version) or Olken's field experiment in Indonesia (JPE 2008, working paper version).

Second, my own work and that of other academics benefits greatly from these websites. I have a project right now on the investments of members of Congress (joint with Jens Hainmueller) that is possible only because of websites like the ones Aaron criticizes. I think this paper is going to be useful in helping watchdogs understand how Congress invests and whether additional regulation is a good idea, and it would be a shame if the funders of these sites listened to Aaron and shut them down.

I do agree with Aaron that professional analysis may be better than grassroots citizen activism in achieving the goals of the transparency movement. Sticking with the example of the congressional stock trading data I'm using, I suspect that not much useful watchdogging came out of the web interface that OpenSecrets provides for the investments data. While it may be interesting to know that Nancy Pelosi owns stock in company X, it's hard to get any sense of patterns of ownership across members and how these investments relate to political relationships between members and companies. This is what our paper tries to do. It takes a ton of work, far more than an investigative journalist is going to put in. We do it because of the rewards of publishing interesting and original and careful research, and also because these transparency websites have made it much more manageable: converted the scanned disclosure forms into a database and provided lobbying data, and GovTrack provided committee and bill info, as well as an API linking company addresses to congressional districts. Most of the excitement around these websites seems to center on grassroots citizen activism, but their value to academic research (and the value of academic research to government accountability) should not be overlooked.

Posted by Andy Eggers at 10:53 PM

May 13, 2009

Natural Languages

The social sciences have long embraced the idea of text-as-data, but in recent years, increasing numbers of quantitative researchers are investigating how to have computers find answers to questions in texts. This task might appear easy on the outset (as it apparently did to early researchers in machine translation), but, as we know, natural languages are incredibly complicated. In most of the applications in social science, analysts end up making a "bag of words" assumptions--the relevant part of a document are the actual words, not their order (this is not a unreasonable assumptions, especially given the questions being asked).

When I see applications of natural language processing (NLP) in the social sciences, I typically think very quickly to its future. Computers are making strides at being able to understand, in some sense, what they are reading. Two recent articles , however, give a good overview of the challenges that NLP faces. First, John Seabrook of the New Yorker had an article last summer, Hello, Hal, which states the problem clearly:

The first attempts at speech recognition were made in the nineteen-fifties and sixties, when the A.I. pioneers tried to simulate the way the human mind apprehends language. But where do you start? Even a simple concept like "yes" might be expressed in dozens of different ways--including "yes," "ya," "yup," "yeah," "yeayuh," "yeppers," "yessirree," "aye, aye," "mmmhmm," "uh-huh," "sure," "totally," "certainly," "indeed," "affirmative," "fine," "definitely," "you bet," "you betcha," "no problemo," and "okeydoke"--and what's the rule in that?

The article is mostly about speech recognition, but it definitely hits the main points about why human-generated language is so hard tricky. The second article, in the New York Times recently, is a short story about Watson, the computer that IBM is creating to compete on Jeopardy! IBM is trying to push the field of Question Answering quite a bit forward with this challenge. This goal is to create a computer that you can ask a natural language question to and get the correct answer. A quick story in the article indicates that they may a bit to go:

In a demonstration match here at the I.B.M. laboratory against two researchers recently, Watson appeared to be both aggressive and competent, but also made the occasional puzzling blunder.

For example, given the statement, "Bordered by Syria and Israel, this small country is only 135 miles long and 35 miles wide," Watson beat its human competitors by quickly answering, "What is Lebanon?"

Moments later, however, the program stumbled when it decided it had high confidence that a "sheet" was a fruit.

This whole Watson enterprise makes me wonder if there are applications for this kind of technology within the social sciences. Would this only be useful as a research aid, or are there empirical discoveries to be made with this? I suppose it comes down to this: if a computer could answer your question, what would you ask?

Posted by Matt Blackwell at 9:43 AM

May 10, 2009

Dobbie and Fryer on Charter Schools in the Harlem Children's Zone

David Brooks wrote a column a few days ago about Will Dobbie and Roland Fryer's working paper on the Harlem Children's Zone charter schools, which the authors report dramatically improved students' performance, particularly in math. Looking at the paper, I think it's a nice example of constructing multiple comparisons to assess the effect of a program and to do some disentangling of mechanisms.

The program they study is enrollment in one of the Promise Academy elementary and middle schools in Harlem Children's Zone, a set of schools that offer extended class days, provide incentives for teacher and student performance, and emphasize a "culture of achievement." The authors assess the schools' effect on student test scores by comparing the performance of students at the schools with that of other students. The bulk of the paper is concerned with how to define this group of comparable non-students, and the authors pursue two strategies:

  • First, they examine cases where too many students applied to the school and slots were handed out by lottery; the comparison of lottery winners and non-lottery winners (and the accompanying IV estimate in which attending the school at some point is the treatment) allow them to compare the effect of attending these schools under nearly experimental conditions, at least in years when lotteries were held.
  • Second, they compare students who were age-eligible and not age-eligible for the program, and students who were in the schools' recruitment area vs not in the schools' recruitment area. (This boils down to an IV in which the interaction of cohort and address instruments for attendance at the school.)

The estimated effect is very large, particularly for math. Because the estimates are based on comparisons both within the HCZ and between HCZ and non-HCZ students, the authors can speculate somewhat about the relative importance of the schooling itself vs other aspects of the HCZ: they tentatively suggest that the community aspects must not drive the results, because non-HCZ students did just as well.

Overall I thought it was a nice example of careful comparisons in a non-experimental situation providing useful knowledge. I don't really know this literature, but it seems like a case where good work could have a big impact.

Posted by Andy Eggers at 9:40 AM

April 13, 2009

Alley-oops as workplace cooperation

Here's a paper for the "high internal, low external validity" file (via Kevin Lewis):

Interracial Workplace Cooperation: Evidence from the NBA

Joseph Price, Lars Lefgren & Henry Tappen
NBER Working Paper, February 2009

Using data from the National Basketball Association (NBA), we examine
whether patterns of workplace cooperation occur disproportionately
among workers of the same race. We find that, holding constant the
composition of teammates on the floor, basketball players are no more
likely to complete an assist to a player of the same race than a
player of a different race. Our confidence interval allows us to
reject even small amounts of same-race bias in passing patterns. Our
findings suggest that high levels of interracial cooperation can occur
in a setting where workers are operating in a highly visible setting
with strong incentives to behave efficiently.

Posted by Andy Eggers at 6:51 PM

April 4, 2009

Can Nonrandomized Experiments Yield Accurate Answers?

Here is some latest progress (at least to me) on causal inference. William R. Shadish, M. H. Clark, and Peter M. Steiner published a paper on JASA (December 1, 2008, 103(484): 1334-1344.) based on "a randomized experiment comparing random and nonrandom assignments". Basically "In the randomized experiment, participants were randomly assigned to mathematics or vocabulary training; in the nonrandomized experiment, participants chose their training." As the authors acknowledged, unsurprisingly, the randomized and nonrandomized experiments provided different estimates of the training effects, very likely through the selection bias caused by math phobia. The key finding is that statistical adjustment including propensity score stratification, weighting, and covariance adjustment can reduce estimation bias by about 58-96%.

Here is a link to the PPT of the paper. The comments on the paper are also very insightful.

Posted by Weihua An at 10:31 PM

March 25, 2009

How to teach methods

Over on the polmeth mailing list there is a small discussion brewing about how to teach undergraduate methods classes. Much of the discussion is on how to manage the balance between computation and statistics. A few posters are using R as their main data analysis tool, which provoked others to comment that this might push a class too far away from its original intent: to learn research methods (although one teacher of R indicated that a bigger problem was the relative inability to handle .zip files). This got me thinking about how research methods, computing and statistics fit into the current education framework.

As a gross and unfair generalization, much of college is about learning how take a set of skills and use them to make effective and persuasive arguments. In a literature class, for instance, one might use the skills of reading and writing to critical engage a text. In mathematics, one might take the "skill" of logic and use it to derive a proof.

The issue with introductory methods classes is that many undergraduates come into school without a key skill: computing. It is becoming increasingly important to have proficient computing skills in order to make cogent arguments with data. I wonder if it is time to rethink how we teach computing at lower levels of education to adequately prepare students for the modern workplace. There is often emphasis on using computers to teach students, but I think it will become increasingly important to teach computers to students. This way courses on research methods can focus on how to combine computing and statistics in order to answer interesting questions. We could spend more time matching tools to questions and less time simply explaining the tool.

Of course, my argument reeks of passing buck. A broader question is this: where do data analysis and computing fit in the education model? Is this a more fundamental skill that we should build up in children earlier? Is it perfectly fine where it is, being taught in college?

Posted by Matt Blackwell at 3:08 PM

March 7, 2009

How to Take Log of Zero Income

I encounter a problem when using a Log normal distribution to model income distribution. Namely, there are a bunch of people in my dataset who report zero income, maybe due to unemployment, and I am wondering how to logarize the zero incomes. I notice some researchers just drop the observations with zero income while others assign a small amount of income to them so that logarithm can be taken legitimately. Obviously, we can try both ways to see how the results stand. But I am wondering if there are some experts on this topic who can clarify the pros and cons of these and other approaches treating zero incomes.

A related question is what model you think fits the income distribution best, a Lognormal, a power distribution, or a mixture model of a Normal and a point mass at zero, and so on.
Look forward to your thoughts on these questions.

Lastly, here is an interesting animation of the income distribution in the USA.

Posted by Weihua An at 6:07 PM

February 25, 2009

Missingness Maps and Cross Country Data

I've been doing some work on diagnostics for missing data issues and one that I have found particularly useful and enlightening has been what I've been calling a "missingness map." In the last few days, I used it on some World Bank data I downloaded to see what missingness looks like in a typical comparative political economy dataset.


View image

The y-axis here are country-years and the x-axis are variables. We draw a red square where the country-year-variable cell is missing and a light green square where the cell is observed. We can see immediately that a whole set of variables in the middle columns are almost always unobserved. These are variables measuring income inequality and they are known to have extremely poor coverage. This plot very quickly shows us how listwise deletion will affect our analyzed sample and how the patterns of missingness occur in our data. For example, in these data, it seems that if GDP is missing, then many of the other variables, such as imports and exports are also missing. I think this is a neat way to get a quick, broad view of missingness.

(Another map and some questions after the jump...)

We can also change the ordering of the rows to give a better sense of missingness. For the World Bank data, it is wise to resort the data by time and see how missingness changes over time.


View image

A clear pattern emerges that the World Bank has better and better data as we move forward in time (the map becomes more "clear"). This is not surprising, but it is an important point when, say, deciding the population under study in a comparative study. Clearly, listwise deletion will radically change the sample we analyze (the answers will be biased toward more recent data, at the very least). The standard statistical advice of imputation or data augmentation is tricky as well here because we need to choose what to impute. Should we carry forth with imputation given that income inequality measures seem to be completely unavailable before 1985? If we remove observations before this, how do we qualify our findings?

Any input on the missingness map would be amazing, as I am trying to add as a diagnostic it to a new version of Amelia. What would make these plots better?

Posted by Matt Blackwell at 2:58 PM

February 21, 2009

My Basketball Friend

I met one of my friends on basketball court. This is selection. I select him as my friend because he plays good basketball and is an avid player. We have been friends for almost three years. When either of us wants to play, most times we will call each other and meet on the court. I think without knowing him, I will still play basketball, but not that many times. So we influence each other. Sometimes we eat Vietnamese noodles together at Le's right after game. Contextual factors matter, but it is him who makes me eat more times of noodles than I would have by myself. Probably, our friendship has some impacts on both of our weights and may make them change more synchronously. Similarly, if you are a runner, you will surely like running with your friends and may run more because you get a runner as friend. So the empirical question is whether you indeed play more basketball when you get a friend who likes playing basketball and run more if you get a runner friend. It is also possible that because you play more or run more, you eat more, which offsets the weight loss due to those extra exercises.

Given only observational data, it is hard to disentangle the effects of selection, induction and contextual factors on weight changes. We have to assign you friends (roommates) randomly and check if you and your friends gain/lose weight together, possibly because you two play more basketball, run more, eat similar things, have similar living styles, share similar standards about what consists of a normal weight, etc.

It is interesting to see that the effects of friendship seem to be directional or asymmetric. Only people you think as friend can induce you to lose weight. You can not induce a person who does not think you are his friend to lose weight, although you think he is your friend. This is kind of opinion leader effect.

The directionality of friendship effects also counters the challenging of contextual factors hypothesis, because if contextual factors matter, you would expect friends' weight changes correlate without directionality. Also, if they matter, you would expect your neighbors' weight changes synchronize with yours and the weight of your friend who lives hundreds of miles away should not correlate with yours. But neither is corroborated by data.

Hence selection should be the largest concern in this case. Now the questions are whether using weight changes or obese status changes will remove the selection effect and how we could control it better.

One of my friends told me two weeks ago that, he did not buy the points in "The Spread of Obesity in a Large Social Network over 32 Years" until he read the real paper. I confessed, "Same here." Read the real paper, not the popular press. But you are absolutely not obligated to buy the points. Here are more.

K.P. Smith and N.A. Christakis, "Social Networks and Health," Annual Review of Sociology 34: 405-429 (August 2008)

Journal of Health Economics, Volume 27, Issue 5, September 2008

Ethan Cohen-Cole, Jason M. Fletcher, "Is obesity contagious? Social networks vs. environmental factors in the obesity epidemic", Pages 1382-1387.

Justin G. Trogdon, James Nonnemaker, Joanne Pais, "Peer effects in adolescent overweight", Pages 1388-1399.

J.H. Fowler, N.A. Christakis, "Estimating peer effects on health in social networks: A response to Cohen-Cole and Fletcher; and Trogdon, Nonnemaker, and Pais", Pages 1400-1405.

P.s. My friend and I have successfully induced several of our friends who originally do not play basketball to play more. But hopefully they can gain some weight rather than losing weight so that we can play more strongly and better.

Posted by Weihua An at 9:01 AM

February 17, 2009

Social pressure and biased refereeing in Italian soccer

I recently came across a paper by Per Pettersson-Lidbom and Mikael Priks that uses a neat natural experiment in Italian soccer to estimate the effect of stadium crowds on referees' decisions. After a bout of hooliganism in early February, 2007, the Italian government began requiring soccer stadiums to fulfill certain security regulations; those stadiums that did not meet the requirements would have to hold their games without spectators. As a result, 25 games were played in empty stadiums that month allowing Petterson-Lidbom and Priks to examine game stats (like this) and see whether referees were more disposed toward the home team when the bleachers were filled with fans than when the stadium was empty. Looking at fouls, yellow cards, and read cards, the authors find that referees were indeed more likely to penalize the home team (and less likely to penalize the away team) in an empty stadium. There does not appear to be any effect of the crowd on players' performance, which suggests that fans were reacting to the crowd and not the players (and that fans should save their energy for haranguing the refs).

One of the interesting things in the results is that refs showed no favoritism toward the home team in games with spectators -- they handed out about the same number of fouls and cards to the home and away teams in those games. The bias shows up in games without spectators, where they hand out more fouls and cards to the home team. (The difference is not statistically significant in games with spectators but is in games with spectators.) If we are to interpret the empty stadium games as indicative of what refs would do if not subjected to social pressure, then we should conclude from the data that refs are fundamentally biased against the home team and only referee in a balanced way when their bias is balanced by crowd pressure. This would indeed be evidence that social pressure matters, but it seems unlikely that refs would be so disposed against the home team. A perhaps more plausible interpretation of the findings is that Italian refs are generally pretty balanced and not affected by crowds, but in the "empty stadium" games they punished the home team for not following the rules on stadium security. This interpretation of course makes the finding less generally applicable. In the end the example highlights the difficulty of finding "natural experiments" that really do what you want them to do -- in this case, illustrate what would happen if, quite randomly, no fans showed up for the game.

Posted by Andy Eggers at 8:25 AM

February 15, 2009

Bayesian Propensity Score Matching

Many people have realized that conventional propensity score matching (PSM) method does not take into account the uncertainties of estimating propensity scores. In other words, for each observation, PSM assumes that there is only one fixed propensity score. In contrast, Bayesian methods can generate a sample of propensity scores for any observation, by either monitoring the posterior distributions of the estimated propensity scores directly or predicting propensity scores from the posterior samples of the parameters of the propensity score model.

Then matching on thus obtained propensity scores, we should expect to get a distribution of estimated treatment effects. This will also provide us with an estimation of the standard error of the treatment effect. The Bayesian S.E. will be larger than the S.E. based on PSM estimate, as it takes into account more uncertainties. This conjecture is indeed confirmed by a recent paper written by Lawrence C. McCandless, Paul Gustafson and Peter C. Austin, "Bayesian propensity score analysis for observational data", which appears in Statistics in Medicine (2009; 28:94-112). The authors show that, the Bayesian 95% credible interval for the treatment effect is 10% wider than conventional propensity score C.I.

It seems that we should expect Bayesian propensity score matching (BPSM) perform better than PSM in cases where there are a lot of uncertainties in estimating the propensity scores. Before running into any simulations, however, the question is: what are the sources of the uncertainties in estimating propensity scores? From my point of view, there is at least one source of uncertainties, the uncertainties due to omitted variables. I do not think BPSM can do any better than PSM in solving this issue. But maybe, BPSM can model the error terms and so provide better estimations of the propensity scores? The above authors argue that when the association between treatment and covariates is weak (i.e., when the betas are smaller), the uncertainties in estimating propensity scores are higher. Weak association means smaller R-square or larger AIC, etc. Is this equivalent to larger bias due to omitted variables?

Another type of uncertainty related to BPSM, but not to propensity scores, is the uncertainty due to matching procedure. This is avoidable or negligible. Radically, we can just abandon the matching method and resort to linear regression model to predict the outcomes. Or we can neglect the bias from matching procedure, because when we only care about ATT and there is sufficient number of control cases, the bias is negligible, according to Abadie and Imbens 2006. ("Large Sample Properties of Matching Estimators for Average Treatment Effects." Econometrica 74 (1): 235 - 267.)

Of course, the logit model for the propensity scores could be wrong as well. But this can be manipulated in the simulations. Now my question is: how should we do the simulations to evaluate the performance of BPSM vs. that of conventional PSM?

Posted by Weihua An at 12:06 AM

February 5, 2009

Deaton on use of randomized trials in development economics

A new NBER paper by Angus Deaton takes on the trendiness of randomized trials, instrumental variables and natural experiments in development economics. One of the main points: well-designed experiments are most useful when they help uncover general mechanisms (i.e. inform theory) and can support real-life policy-making outside their narrow context. A good if lengthy read.

Deaton, A (2009) Instruments of development: Randomization in the tropics, and the search for the elusive keys to economic development, NBER Working Paper 14690.

Harvard users click here.

There is currently much debate about the effectiveness of foreign aid and about what kind of projects can engender economic development. There is skepticism about the ability of econometric analysis to resolve these issues, or of development agencies to learn from their own experience. In response, there is movement in development economics towards the use of randomized controlled trials (RCTs) to accumulate credible knowledge of what works, without over-reliance on questionable theory or statistical methods. When RCTs are not possible, this movement advocates quasi-randomization through instrumental variable (IV) techniques or natural experiments. I argue that many of these applications are unlikely to recover quantities that are useful for policy or understanding: two key issues are the misunderstanding of exogeneity, and the handling of heterogeneity. I illustrate from the literature on aid and growth. Actual randomization faces similar problems as quasi-randomization, notwithstanding rhetoric to the contrary. I argue that experiments have no special ability to produce more credible knowledge than other methods, and that actual experiments are frequently subject to practical problems that undermine any claims to statistical or epistemic superiority. I illustrate using prominent experiments in development. As with IV methods, RCT-based evaluation of projects is unlikely to lead to scientific progress in the understanding of economic development. I welcome recent trends in development experimentation away from the evaluation of projects and towards the evaluation of theoretical mechanisms.

Posted by Sebastian Bauhoff at 8:12 AM

February 3, 2009

What is Japan doing at 2:04pm?

You can now answer that question and so many more. The Japanese Statistics Bureau conducts a survey every five years called the "Survey on Time Use and Leisure Activities" where they give people journals to record their activities throughout the day. Thus, they have a survey of what people are in Japan at any given time of the day. This is fun data in of itself, but it was made downright addictive by Jonathan Soma who created a slick Stream Graph based on the data. (via kottke)

There are actually three Stream Graphs: one for the various activities, another for how the current activity differs between sexes and a final for how the current activity breaks down by economic status. Thus, the view contains not only information about daily routines, but also how those routines vary across sex and activity. For instance, gardening tends to happen in the afternoon and evening at around equal intensity and is fairly evenly distributed between men and women. Household upkeep, on the other hand, is done mostly by women and mostly in the morning. This visualization is so compelling, I think, because it allows for deep exploration of rich and interesting data (to be honest, though, I find the economic status categories a little strange and not incredibly useful).

I think there are two points that come to mind when seeing this. First is that it would fascinating to see how these would look across countries, even if it was just one other country. The category of this survey on the website for the Japanese Bureau of Statistics is "culture." Seeing the charts actually makes me wonder how different this culture is from other countries. Soma does point out, though, that Japanese men are rather interested in "productive sports" which is perhaps unique to the island.

Second, I think that Stream Graphs might be useful for other time-based data types. Long term survey projects, such as the General Social Survey, track respondent spending priorities. It seems straightforward to use a Stream Graph to capture how priorities shift over time. Other implemented Stream Graphs are the NYT box-office returns data and Lee Byron's playlist data. This graph type seems best suited for showing how different categories change over time and how rapidly they grow and how quickly they shrink. They also seem to require some knowledge of Processing. There are still some open questions here: What other types of social science data might these charts be useful for? How or should we incorporate uncertainty? (Soma warns that the Japan data is rather slim on the number of respondents)

Also: October 18th is Statistics Day in Japan. There are posters. And a slogan: "Statistical Surveys Owe You and You Owe Statistical Data"!

Posted by Matt Blackwell at 5:37 PM

February 1, 2009

Visualizing partisan discourse

Burt Monroe, Michael Colaresi, and our own Kevin Quinn have written an interesting paper (forthcoming in Political Analysis) assessing methods for selecting partisan features in language, e.g. which words are particularly likely to be used by Republicans or Democrats on a given topic. They have also provided a dynamic visualization of partisan language in the Senate on defense issues between 1997 and 2004 (screenshot below).

The most striking feature coming out of the visualization is that language on defense went through an unpolarized period leading up to 9/11 and even for several months afterward, but that polarized language blossomed in the leadup to the Iraq War and through the end of the period they examine, with Republicans talking about what they thought was at stake ("Saddam", "Hussein". "oil", "freedom", "regime") and the Democrats emphasizing the process ("unilateral", "war", "reconstruction", "billions"). (Link to visualization, a QuickTime movie.)


Posted by Andy Eggers at 8:36 AM

January 22, 2009

Studying the 2008 primaries with prediction markets: Malhotra and Snowberg

With Obama now in office the rest of the country may be about ready to move on from the 2008 election, but political scientists are of course still finding plenty to write about. Neil Malhotra and Erik Snowberg recently circulated a working paper in which they use data from political prediction markets in 2008 to examine two key questions about presidential primaries: whether primaries constrain politicians from appealing to the middle of the electorate and whether states with early primaries play a disproportionately large role in choosing the nominee. It's a very short and preliminary working paper that applies some novel methods to interesting data. Ultimately the paper can't say all that much about these big questions, not just because 2008 was an unusual year but also because of the limitations of prediction market data and the usual problems of confounding. But there is some interesting stuff in the paper and I expect it will improve in revision -- I hope these comments can help.

The most clever insight in the paper is that you can combine data from different prediction markets to estimate an interesting conditional probability -- the probability that a primary candidate will win the general election conditional on winning the nomination. (If p(G) is the probability of winning the general election and p(N) is the probability of winning the nomination (both of which are evident in prediction market contract prices), p(G|N) -- the probability of winning the general election if nominated -- can be calculated as p(G)/p(N).) In the first part of the paper, the authors focus on how individual primaries in the 2008 election affected this conditional probability for each candidate. This is interesting because classic theories in political science posit that primary elections force candidates to take positions that satisfy their partisans but hurt their general election prospects by making it harder for them to appeal to the electoral middle. If that is the case, then ceteris paribus one would expect that the conditional election probabilities would have gone down for Obama and Clinton each time it looked like the primary season would become more drawn out -- which is what happened as results of several of the primaries rolled in.

As it turns out, p(G|N) didn't move much in most primaries; if anything, it went up when the primary season seemed likely to extend longer (e.g. for Obama in New Hampshire). Perhaps this was because of the much talked about positive countervailing factors -- i.e. the extended primary season actually sharpened each candidate's electoral machines and increased their free media exposure. Of course, Malhotra and Snowberg have no way of knowing whether the binding effect of primaries exists and was almost perfectly counterbalanced by these positive factors, or whether none of these factors really mattered very much.

There is yet another possibility, which is that conditional probabilities did not move much for most primaries because most primaries did not change the market's view of how long the primary season would be. Knowing how the conditional probability changed during a particular primary only tells us something about whether having more primaries helps or hurts candidates' general election prospects if that primary changed people's expectations about how long the primary season would be. There were certainly primaries where this was the case (New Hampshire and Ohio/Texas come to mind) but for most of the primaries there was very little new information about how many more primaries would follow. Malhotra and Snowberg proceed as if they were looking for an average effect of a primary taking place on a candidate's conditional general election prospects, but if they want to talk about how having more primaries affects candidates' electability in the general election, they need to focus more squarely on cases where expectations about the length of the primary season actually changed (and, ideally, not much else changed). I would say the March Ohio/Texas primary was the best case of that, and at that time Barack Obama's p(G|N) dropped by 3 points -- a good indication that the market assumed that the net effect of a longer season on general election prospects was negative. (Although of course that primary also presumably revealed new information about whether Obama would be able to carry Ohio in the general election -- it's hard to disentangle these things.)

The second part of the paper explicitly considers the problem of assessing how "surprised" the prediction markets were in particular primaries (without explaining why this was not an issue in the first part), and employs a pretty ad hoc means of upweighting effect estimates for the relatively unsurprising contests. Some kind of correction makes sense but it seemed to me that the correction was so important in producing their results that it should be explained more fully in further revisions of the paper.

So to sum up, I liked the use of prediction markets to estimate the conditional general election probability for a candidate at a point in time, and I think it's worth getting some estimates of how particular events moved this probability. I think at this stage the conclusions are a bit underdeveloped and oversold, considering how many factors are at play and how unclear it is what information each primary introduced. But I look forward to future revisions.

Posted by Andy Eggers at 10:18 AM

January 16, 2009

Amazon Mechanical Turk for Data Entry Tasks

Yesterday I tried using Amazon's Mechanical Turk service for the first time to save myself from some data collection drudgery. I found it fascinating. For the right kind of task, and with a little bit of setup effort, it can drastically reduce the cost and hassle of getting good data compared to other methods (such as using RAs).

Quick background on Mechanical Turk (MTurk): mturk.pngThe service acts as a marketplace for jobs that can be done quickly over a web interface. "Requesters" (like me) submit tasks and specify how much they will pay for an acceptable response; "Workers" (known commonly as "Turkers") browse submitted tasks and choose ones to complete. A Requester could ask for all sorts of things (e.g. write me a publishable paper), but because you can't do much to filter the Turkers and they aren't paid for unacceptable work, the system works best for tasks that can be done quickly and in a fairly objective way. The canonical tasks described in the documentation are discrete, bite-sized tasks that could almost be done by a computer -- indicating whether a person appears in a photo, for example. Amazon bills the service as "Artificial Artificial Intelligence," because to the Requester it seems as if a very smart computer were solving the problem for you (while in fact it's really a person). This is also the idea behind the name of the service, a reference to an 18th century chess-playing automaton that actually had a person inside (known as The Turk).

The task I had was to find the full text of a bunch of proposals from meeting agendas that were posted online. I had the urls of the agendas and a brief description of each proposal, and I faced the task of looking up each one. I could almost automate the task (and was sorely tempted), but it would require coding time and manual error checking. I decided to try MTurk.

The ideal data collection task on MTurk is the common situation where you have a spreadsheet with a bunch of columns and you need someone to go through and do something pretty rote to fill out another column. That was my situation: for every proposal I have a column with the url and a summary of what was proposed, and I wanted someone to fill in the "full text" column. To do a task like this, you need to design a template that applies to each row in the spreadsheet, indicating how the data from the existing columns should appear and where the Turker should enter the data for the missing column. Then you upload the spreadsheet and a separate task is created for each row in the spreadsheet. If everything looks good you post the tasks and watch the data roll in.

To provide a little more detail: Once you sign up to be a Requester at the MTurk website, you start the process of designing your "HIT" (Human Intelligence Task). MTurk provides a number of templates to get you started. The easiest approach is to pick the "Blank Template," which is very poorly named, because the "Blank Template" is in fact full of various elements you might need in your HIT; just cut out the stuff you don't need and edit the rest. (Here it helps to know some html, but for most tasks you can probably get by without knowing much.) The key thing is that when you place a variable in the template (e.g. ${party_id}), it will be filled by an entry from your spreadsheet, based on the spreadsheet's column names. So a very simple HIT would be a template that says

Is this sentence offensive? ${sentence}

followed by buttons for "yes" and "no" (which you can get right from the "Blank Template"). If you then upload a CSV with a column entitled "sentence" and 100 rows, you will generate 100 HITs, one for each sentence.

It was pretty quick for me to set up my HIT template, upload a CSV, and post my HITs.

Then the real fun begins. Within two minutes the first responses started coming in; I think the whole job (26 searches -- just a pilot) was done in about 20 minutes. (And prices are low on MTurk -- it cost me $3.80.) I had each task done by two different Turkers as a check for quality, and there was perfect agreement.

One big question people have is, "Who are these people who do rote work for so little?" You might think it was all people in developing countries, but it turns out that a large majority are bored Americans. There's some pretty interesting information out there about Turkers, largely from Panos Ipeirotis's blog (a good source on all things MTurk in fact). Most relvenat for understanding Turkers is survey of Turkers he conducted via (of course) MTurk. For $.10, Turkers were asked to write why they complete tasks on MTurk. The responses are here. My takeaway was that people do MTurk HITs to make a little money when they're bored, as an alternative to watching TV or playing games. One man's drudgery is another man's entertainment -- beautiful.

Posted by Andy Eggers at 9:49 AM

January 13, 2009

Multiple comparisons and the "Axe" effect

Like many of us, I'm always on the lookout for good examples to use in undergraduate methods courses. My high school chemistry teacher (a former nun) said that the best teaching examples involved sex, food, or money, and that seems like reasonable advice for statistics as well. In that vein, I noted a recent article on the "Axe effect" in Metro:

'Axe effect' really works, a new study swears

Researchers in the U.K. asked women to rate the attractiveness of men wearing Axe's British counterpart, Lynx, against those who were wearing an odorless placebo.

On a 7-point scale, men wearing Lynx scored a 4.2, 0.4 point higher than those wearing the placebo.

But here's the catch: The women did not meet the men face-to-face. They watched them on video.

So what explains the discrepancy in ratings? Men wearing Lynx reported feeling more confident about themselves. So the difference in attitude appears more responsible for getting you lucky than the scent itself.

This story was not just reported in a subway tabloid; a long article appeared in the Economist. (Although at least the Metro story reported an effect size, unlike the Economist).

Is there an Axe effect? The news stories are reporting on a study in the International Journal of Cosmetic Science, "Manipulation of body odour alters men's self-confidence and judgements of their visual attractiveness by women". The researchers recruited male students and staff members from the University of Liverpool, randomly assigned some of them to use deodorant or a placebo. They then took photographs of the men as well as videos of them pretending to chat up an attractive woman. The photos and videos of the men were evaluated by "a panel of eight independent female raters" for attractiveness and self-confidence.

Medium Attractiveness Confidence
Photo Not significant (not asked)
Video, no sound Significant! Not significant
Video w/ sound Not significant Not significant

There may be an Axe effect on women's perception of men's attractiveness (but not self-confidence) if they see them on video if they can't hear them. Or it might be a fluke. This seems like a classic multiple comparison problem. With five tests, it is not that unlikely that one of them would be (barely) statistically significant. The proposed mechanism for the one "effect" (which attracted all of the media attention) was increased self-confidence on the part of the male subjects, so it seems a little odd that an effect would be found on perceived attractiveness and not on self-confidence. We might be more confident that something is going on if the effect sizes were reported for the non-significant results, but they don't appear in the paper. So, the Axe effect may be for real, but only if you keep your mouth shut.

Posted by Mike Kellermann at 8:19 PM

January 6, 2009

NYT pays tribute to R

Today's New York Times has an article about the increasing popularity of R and what it means for commercial packages. See here for ``Data Analysts Captivated by Power of R''.

Posted by Sebastian Bauhoff at 11:09 PM

December 11, 2008

About those scatterplots . . .

Amanda Cox from the NYT graphics department gave a fun talk yesterday about challenges she and her colleagues face.

One of the challenges she discussed is statistical uncertainty -- how to represent confidence intervals on polling results, for example, while not sacrificing too much clarity. Amanda provided a couple of examples where the team had done a pretty poor job of reporting the uncertainty behind the numbers; in some cases doing it properly would have made the graphic too confusing for the audience and in others there may have been a better way.

She also talked about "abstraction," by which I think she meant the issue of how to graphically represent multivariate data. She showed some multivariate graphics the NYT had produced (the history of oil price vs. demand, growth in the CPI by categorized component) that I thought were quite successful, although some in audience disagreed about the latter figure.

Amanda also showed the figure that I reproduced and discussed in an earlier post, in which I reported that the NYT graphics people think that the public can't understand scatterplots. Amanda disagrees with this (she said it annoys her how often people mention that point to her) and showed some scatterplots the NYT has produced. (She did say she thinks people understand scatterplots better when there is an upward slope to the data, which was interesting.)

The audience at the talk, much of which studies the media in some capacity and nearly all of which reads the NYT, seemed hungry for some analysis of the economics behind the paper's decision to invest so much in graphics. (Amanda said the paper spends $500,000 a month on the department.) Amanda wasn't really able to shed too much light on this, but said she felt very fortunate to be at a paper that lets her publish regression trees when, at many papers, the graphics team is four people who have their hands full producing "fun facts" sidebars and illustrations of car crash sites.

Posted by Andy Eggers at 8:37 AM

October 29, 2008

Bafumi and Herron on whether the US government is representative

Amid the name-calling, insinuation and jingoism of this political season it is easy to get a bit depressed about the democratic process. Joe Bafumi and Michael Herron have an interesting working paper that is cause for some comfort. The paper, entitled "Preference Aggregation, Representation, and Elected American Political Institutions," assesses the extent to which our federal political institutions are representative, in the sense that elected officials have similar views to those of their constituents. They do this by lining up survey questions from the Cooperative Congressional Elections Study (recently discussed in our weekly seminar by Steve Ansolabehere) alongside similar roll call votes recorded for members of Congress, as well as President Bush's positions on a number of pieces of legislation. There are enough survey questions to be able to place the survey respondents on an ideological scale (using Bayesian ideal point estimation), enough pieces of legislation to place the members of Congress and the President on an ideological scale, and enough survey questions that mirrored actual roll call votes to bring everyone together on a unified scale.

Overall, the authors find that the system is pretty effective at aggregating and representing voters' preferences. Members of Congress are more extreme than the constituencies they represent (perhaps because they represent partisans in their own districts), but the median member of a state's delegation is usually pretty close to the median voter in that state. Since the voters were surveyed in 2006, the paper is able to look at how the election affected the ideological proximity of government to the voters, and as one would hope Bafumi and Herron find that government moved somewhat closer to the voters as a result of the legislative reshuffling.

Below is one of the interesting figures from the paper. The grey line shows the density of estimated ideal points among the voters (ie CCES survey respondents); the green and purple solid lines are the density of estimated ideal points among members of the current House and Senate. The arrows show the location of the median member of the current and previous House and Senate, the median American at the time of the 2006 election (based on the survey responses), and President Bush. As you can see, before the 2006 election the House and Senate were both to the right of the median American (as was President Bush); after the Democratic sweep Congress has moved closer to the median American. Members of Congress are more partisan than the voters throughout, although this seems to be more the case on the right than the left.herron_bafumi.png

Posted by Andy Eggers at 9:45 AM

October 25, 2008

A General Inequality Parameter

There is an interesting paper by Guillermina Jasso and Samuel Kotz in Sociological methods and Research in which they analyzed the mathematical connections between two kinds of inequality: inequality between persons and inequality between subgroups. They showed that a general inequality parameter (a shape parameter c of a two-parameter continuous univariate distribution), or a deep structure of inequality, governs both types of inequality. More concretely, they demonstrated convenient measures of personal inequality like Gini coefficient, Arkinson's measure, Theil's MLD and Pearson's coefficient of variation, and measures of inequality between subgroup are nothing but functions of this general inequality parameter c. The c parameter, according to the authors, also governs the shape of Lorenz curve, a conventional graph tool to express inequality.

Given the unitary operation of this inequality parameter, the authors concluded there is a monotonic connection between personal inequality and between-group inequality, namely, as personal inequality increases, so does between-group inequality. This conclusion is kind of surprising and even contradictory to our intuition that it is very plausible, if not usual, that personal inequality can change due to within-group transfers while between-group inequality still keeps the same. The authors admitted that their conclusion hold only under certain set of conditions. For example, the derived relation between the two types of inequality assumes two-parameter distribution and non-intersecting Lorenz curves. You may consult the full article to obtain more technical details if interested.

Jasso, Guillermina and Samuel Kotz. 2008. "Two Types of Inequality: Inequality Between Persons and Inequality Between Subgroups." Sociological Methods & Research 37: 31-74.

click here to get a working paper version of that from IDEAS

Posted by Weihua An at 2:40 PM

October 22, 2008

Useful metric for comparing two distributions?

In reading Bill Easterly's working paper "Can the West Save Africa?," I came across an interesting metric Easterly uses to compare African nations with the rest of the world on a set of development indicators. The metric is, "Given that there are K African nations, what percent of the K lowest scoring countries were African?" I don't think I've ever seen anyone use that particular metric, but maybe someone has. Does it have a name? Does it deserve one?

Generally, looking at the percent of units below (or above) a certain percentile that have some feature is a way of describing the composition of that tail of the distribution. What's interesting about using a cutoff corresponding to the total number of units with that feature is that it produces an intuitive measure of overlap of two distributions: it gives us a rough sense of how many countries would have to switch places before all the worst countries were African or, put differently, before all of the African countries are in the worst group. It reminds me a bit of measures of misclassification in machine learning, where here the default classification is, "All the worst countries are African."

Needless to say, the numbers were bleak -- 88% for life expectancy, 84% for percent of population with HIV, 75% for infant mortality.

Posted by Andy Eggers at 11:02 PM

October 15, 2008

Alfred Marshall, apologist for blog readers

Like many people I know, I often find it hard to stay on task and avoid the temptations of the internet while I work. Email, blogs, news of financial meltdown -- I find myself turning to these distractions in between spurts of productivity, knowing that I would get more done if I just turned off the wireless and kept on task for longer stretches of time.

Well, those of us who have trouble giving up our blogs and other internet distractions may have an unlikely enabler in Alfred Marshall, the great economist. When he was seventeen, Marshall observed an artist who took a lengthy break after drawing each element of a shop window sign. As he later recounted, the episode shaped his own productivity strategy, towards something that sounds vaguely similar to my own routine:

That set up a train of thought which led me to the resolve never to use my mind when it was not fresh, and to regard the intervals between successive strains as sacred to absolute repose. When I went to Cambridge and became full master of myself, I resolved never to read a mathematical book for more than a quarter of an hour at a time without a break. I had some light literature always by my side, and in the breaks I read through more than once nearly the whole of Shakespeare, Boswell's Life of Johnson, the Agamemnon of Aeschylus (the only Greek play I could read without effort), a great part of Lucretius and so on. Of course I often got excited by my mathematics, and read for half an hour or more without stopping, but that meant that my mind was intense, and no harm was done.

Now, somehow I doubt that Marshall would consider the NYT op-ed pages to be "light literature" on par with Boswell, or that he would agree that watching incendiary political videos at qualifies as "absolute repose." But never mind that. Alfred Marshall told me I shouldn't work for more than fifteen minutes without distractions!

Posted by Andy Eggers at 8:06 AM

October 7, 2008

DOL visa data reveals salaries for academic jobs

With many of my friends are preparing for the annual job market song and dance, one question they will have soon is what salary expectations are appropriate for what position and institution.

It seems hard to know. Fortunately (and somewhat incredibly) the Department of Labor Foreign Labor Certification Data Center not only collects employer petitions for H-1B visas for foreign professionals, but the DOL also posts them online. The data goes back until 2001; information for other visa types is sometimes available for earlier years. Overall this seems like a great source for labor economic studies or the effects of visa restrictions etc. (Let us know if you use it!)

But the data is also good for a quick reality check on salary expectations. You can search by institution on the DOL website or type in a keyword in this search engine.

For example, looking for "assistant professor economics harvard" will reveal two visa petitions from the university, with a proposed salary of $115,000 in 2005. Stanford proposed to pay $120,000 in early 2006. The data is not just limited to academic jobs of course. You can also see that Morgan Stanley proposed to pay $85,000 for an analyst in New York in 2006. Or that a taxi company in Maryland proposed $11.41 per hour.

Naturally the data is limited since it only covers a specific group of job applicants. Maybe they'll take a lower salary in exchange for help with the visa, or they get paid more to leave their home countries. But the relative scales across institutions could be similar and it's better than no idea at all. Good luck on your job hunts and negotiations!

Posted by Sebastian Bauhoff at 2:40 PM

September 26, 2008

Recommend a Book for Probability Theory

For those of you who want to do some exercises or solve typical problems in probability theory and random processes, I strongly recommend a book by Geoffrey Grimmett and David Stirzaker, One Thousand Exercises in Probability. As the authors said in the preface, there are over three thousands of problems in the book since many exercises include several parts. Personally, I find this book very useful, partly because all exercises come with solutions, which makes it much more readable than many other counterparts, and partly because I realize some faculty here tend to adopt exercises in it and put them in class assignments and exams. (Am I here the first person who realizes this?) So I recommend this book to you and hopefully, it will help you deepen your understanding of those daunting proofs in probability theory and random processes. More luckily, you may learn how to get used to them in von Neumann's sense.

In mathematics you don't understand things, you just get used to them.

John von Neumann

Posted by Weihua An at 7:59 PM

September 25, 2008

New NBER paper charts history and future of field experiments in economics

The NBER just posted a new working paper by Steven Levitt and John List ``Field Experiments in Economics: The Past, The Present, and The Future.'' I only had a first glance and this paper looks like an easy to read history of field experiments in economics and a (short) summary of the limitations. Levitt and List also suggest that partnerships with private institutions could be the future of this field. It seems like a natural conclusion. Collaborating with the private sector should create more opportunities for good research, and the money and infrastructure will be attractive to researchers. And anyway what other sector is left to be conquered? But maybe such partnerships are only useful for certain areas of research (Levitt and List suggest the setting could be a useful laboratory for the field of industrial organization). And firms, like any institution, must have an interest to participate. This might be fine for learning about fundamental economic behavior but will we see more declarations of interest on experiments related to policy?

Levitt, S and List, J (2008) ``Field Experiments in Economics: The Past, The Present, and The Future.'' NBER Working Paper 14356,

Harvard users click here for PIN access.

This study presents an overview of modern field experiments and their usage in economics. Our discussion focuses on three distinct periods of field experimentation that have influenced the economics literature. The first might well be thought of as the dawn of "field" experimentation: the work of Neyman and Fisher, who laid the experimental foundation in the 1920s and 1930s by conceptualizing randomization as an instrument to achieve identification via experimentation with agricultural plots. The second, the large-scale social experiments conducted by government agencies in the mid-twentieth century, moved the exploration from plots of land to groups of individuals. More recently, the nature and range of field experiments has expanded, with a diverse set of controlled experiments being completed outside of the typical laboratory environment. With this growth, the number and types of questions that can be explored using field experiments has grown tremendously. After discussing these three distinct phases, we speculate on the future of field experimental methods, a future that we envision including a strong collaborative effort with outside parties, most importantly private entities.

Posted by Sebastian Bauhoff at 7:30 AM

September 24, 2008

Government as API provider

The authors of "Government Data and the Invisible Hand" provide some interesting advice about how the next president can make the government more transparent:

If the next Presidential administration really wants to embrace the potential of Internet-enabled government transparency, it should follow a counter-intuitive but ultimately compelling strategy: reduce the federal role in presenting important government information to citizens. Today, government bodies consider their own websites to be a higher priority than technical infrastructures that open up their data for others to use. We argue that this understanding is a mistake. It would be preferable for government to understand providing reusable data, rather than providing websites, as the core of its online publishing responsibility.

I've blogged here a couple of times about the role transparency-minded programmers and other private actors are playing in opening up access to government data sources. This paper draws the logical policy conclusion from what we've seen in the instances I blogged about: that third parties often do a better job of bringing important government data to the people than the government does. (For example, compare with The upshot of the paper is that the government should make it easier for those third parties to make the government websites look bad. By focusing on providing structured data, the government will save web developers some of the hassle involved in parsing and combining data from unwieldy government sources and reduce the time between the release of a clunky government site and the release of private site that repackages the underlying data and combines it with new sources in an interesting way.

Of course, to the extent that government data is made available in more convenient formats, our work as academic researchers gets easier too, and we can spend more time on analysis and less on data wrangling. In fact, for people doing social science stats, it's really the structured data and not the slick front-end that is important (although many of the private sites provide both).

I understand that this policy proposal is an idea that's been circulating for a while (anyone want to fill me in on the history?) and apparently both campaigns have been listening. It will be interesting to see whether these ideas lead to any change in the emphasis of government info policy.

Posted by Andy Eggers at 9:09 AM

September 18, 2008

Call for papers: the Midwest Poli Sci conference gets interdisciplinary

From Jeff Segal via Gary King, we get the following call for papers for the Midwest Political Science Conference. An interesting bit of news here is that the conference is introducing a registration discount for people outside of the discipline.

Ask your favorite political scientist what the biggest political science conference is, and she'll tell you it's the American Political Science Association. Ask her what the best political science conference is and she'll tell you it's the Midwest Political Science Association meeting, held every April in the beautiful Palmer House in Chicago.

The Midwest Political Science Association, like most academic associations, charges higher conference registration rates for nonmembers than to members. Hoping to continue to increase attendance by people outside of political science and related fields at its annual meeting, the Association will begin charging the lower (member) rate to registrants who 1) have academic appointments outside of political science or related fields (policy, public administration and political economy) and 2) do not have a PhD in political science or the same related fields.

In addition, the Association grants, on request, a substantial number of conference registration waivers for first time participants who are outside the discipline.

The call for papers for the 2009 meeting, due October 10, is at

Hope to see you in Chicago.


Jeffrey Segal, President
Midwest Political Science Association

Posted by Andy Eggers at 6:41 AM

September 2, 2008

Study on DTCA creates media attention for causal inference

The British Medical Journal just published an great piece by Michael Law* and co-authors on the (in-)effectiveness of direct-to-consumer advertisement (DTCA) for pharmaceuticals. This issue continues to be political controversial and expensive for companies, and good studies are rare. Mike makes use of the linguistic divide in his home country Canada to evaluate the effectiveness of the ads. Canadian TV stations are not allowed to broadcast pharma ads. The French-speakers have no choice to oblige, but English-speaking Canada gets to watch ads for pharmaceuticals on US TV stations. The results suggest that for the three drugs under study, the effects of DTCA maybe very small and short-term.

An interesting fallout of this work is a wave of media attention for causal inference and identifying counterfactuals. For example the WSJ writes

[...] the new study will draw some attention because it is among the first to compare the behavior of people exposed to drug ads with people who weren't.

And the New Scientist says

However, consumer advertising is usually accompanied by other marketing efforts directly to doctors, making it difficult to tease out the effect of the ads alone.

See here for a longer list of articles at Google News.

I think it's great that the study creates so much interest (meaning it's relevant in real life) and that the media gets interested in research design. I'm curious to see the wider repercussions on both issues.

Law, Michael, Majumdar, Sumit and Soumerai, Stephen (2008) "Effect of illicit direct to consumer advertising on use of etanercept, mometasone, and tegaserod in Canada: controlled longitudinal study" BMJ 2008;337:a1055

* Disclosure: Mike is a recent graduate of the PhD in Health Policy, and a classmate and friend of mine.

Posted by Sebastian Bauhoff at 9:32 PM

June 26, 2008

Exxon-tainted research?

A few bloggers at other sites (Concurring Opinions and Election Law Blog) have pointed out an interesting footnote in the Supreme Court's recent decision on punitive damages in the Exxon Valdez case. Justice Souter took note of experimental research on jury decisionmaking done by Cass Sunstein, Daniel Kahneman, and others, but then dismissed it for the purposes of the decision because Exxon had contributed funding for the research:

The Court is aware of a body of literature running parallel to anecdotal reports, examining the predictability of punitive awards by conducting numerous “mock juries,” where different “jurors” are confronted with the same hypothetical case. See, e.g., C. Sunstein, R. Hastie, J. Payne, D. Schkade, W. Viscusi, Punitive Damages: How Juries Decide (2002); Schkade, Sunstein, & Kahneman, Deliberating About Dollars: The Severity Shift, 100 Colum. L. Rev. 1139 (2000); Hastie, Schkade, & Payne, Juror Judgments in Civil Cases: Effects of Plaintiff’s Requests and Plaintiff’s Identity on Punitive Damage Awards, 23 Law & Hum. Behav. 445 (1999); Sunstein, Kahneman, & Schkade, Assessing Punitive Damages (with Notes on Cognition and Valuation in Law), 107 Yale L. J. 2071 (1998). Because this research was funded in part by Exxon, we decline to rely on it.

It will be interesting to see whether this position is taken up by the lower courts; if so, we might see less incentive for private actors to fund social science research. That could be good or bad, I suppose, depending on one's views of likelihood that researchers will be unduly influenced by their funding sources.

Posted by Mike Kellermann at 1:13 PM

June 13, 2008

Awards for IQSS faculty

Two awards given by the Society for Political Methodology were announced today, and both of them went to IQSS faculty members (and co-authors).

The Gosnell Prize is given to the "best paper on political methodology given at a conference", and this year's prize was awarded to Kevin Quinn for his paper "What Can be Learned from a Simple Table? Bayesian Inference and Sensitivity Analysis for Causal Effects from 2x2 and 2x2xK Tables in the Presence of Unmeasured Confounding." From the announcement:

Quinn's paper offers a set of steps to improve inference with binary independent and dependent variables and unmeasured confounds. He derives large sample, non-parametric bounds on the average treatment effect and shows how these bounds do not rely on auxiliary assumptions. He then provides a graphical way to depict the robustness of inferences as one changes assumptions about the confounds. Finally, he shows how one can use a Bayesian framework relying on substantive knowledge to restrict the set of assumptions on the confounds to improve inference.

The Warren Miller prize is given annually to the best paper appearing in Political Analysis. This year's prize has been awarded to Daniel E. Ho, Kosuke Imai, Gary King, and Elizabeth A. Stuart for their article, "Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference." The abstract of their paper follows:

Although published works rarely include causal estimates from more than a few model specifications, authors usually choose the presented estimates from numerous trial runs readers never see. Given the often large variation in estimates across choices of control variables, functional forms, and other modeling assumptions, how can researchers ensure that the few estimates presented are accurate or representative? How do readers know that publications are not merely demonstrations that it is possible to find a specification that fits the author's favorite hypothesis? And how do we evaluate or even define statistical properties like unbiasedness or mean squared error when no unique model or estimator even exists? Matching methods, which offer the promise of causal inference with fewer assumptions, constitute one possible way forward, but crucial results in this fast-growing methodological literature are often grossly misinterpreted. We explain how to avoid these misinterpretations and propose a unified approach that makes it possible for researchers to preprocess data with matching (such as with the easy-to-use software we offer) and then to apply the best parametric techniques they would have used anyway. This procedure makes parametric models produce more accurate and considerably less model-dependent causal inferences.

Posted by Mike Kellermann at 2:22 PM

May 31, 2008

The Tree-Friendly Academic, Part II: The Editing Process, and Getting Off the Monitor

I'm grateful for the strong response to my original query for quality, free PDF annotation for Linux. In general, there seem to be a few categories.

-Windows-based editors, adaptable through emulators: PDF X-change, Foxit (free version), primopdf
-Linux editors with non-portable annotations: Okular, which has hidden XML files for its annotations (skim, for OS X, has the same scheme)
-early, incomplete solutions that will eventually be good: GNU's PDF project, Xournal
-early, incomplete solutions that aren't user-friendly: pdfedit, Cabaret Stage
-early solutions that are still in progress: evince

Of all of these options, I like Okular the best, mainly because integrating its XML-saved annotations into the PDF is but one plugin away (which might already exist, for all I know), and it's theoretically portable to Windows by installing qt4 binaries. Using an emulator like wine is a hassle big enough that I've avoided it, for the same reason I don't use cygwin on Windows systems.

So we're close to a (more) universal free editing environment. But I'm still not a fan of doing all my work on a screen, and also not willing to print. So I'm trying a middle road.

I bought an iLiad e-paper reader this past week, and so far I'm impressed with how it handles (though its price tag, $600 for the model I bought, definitely isn't for everyone, and was almost not for me). The screen is easily readable, the battery lasts, and I can zoom in and rotate documents to get a half-page display with larger text. More importantly, the device runs Linux and iRex has made a point to try and use open source software as much as possible, in contrast to Amazon and the Kindle (which is half the size, can't read PDFs and can't edit books.)

However, as the project is still in its relative infancy, there are a few functions it has yet to incorporate that I really would like, and they're the same ones I want in a computer-based annotator: highlighting multiple-column text, for example, so that I can extract passages I want later at the push of a button. And like Okular, the annotations made on the iLiad are saved in a companion XML file rather than the original PDF, but the company offers a free program to do the merging.

I'm going to continue to explore what the iLiad can do as far as editing, but it's definitely reassuring that everyone who's seen me used it has oohed and aahed at it.

To sum up, I've now got a free platform for reading, editing and annotating PDFs on a Linux machine, and an auxiliary paper-free method for reading them later which is admittedly not free. And I have more needs as well, but I can at least see them being met soon. What else do people want in paperless work we haven't covered yet?

P.S. If the people from iRex are reading this and want me to shill for them for real, they can let me know directly.

Posted by Andrew C. Thomas at 11:05 PM

May 26, 2008

The Tree-Friendly Academic: Whither A Useful Free PDF Editor?

I'm a Linux user in need of a quality PDF reader with basic annotation tools, and I need it to be available for free. Think I'm asking for too much?

We're at a point where the level of content available online dwarfs our ability to print it all onto paper for examination and notation. As academics, we're expected to sort through volumes of other people's work in order to verify that our own is original, as well as comment, annotate, and on occasion make corrections or forward-references to later works.

But despite a boom in computational power and information bandwidth, the software to do this without resorting to printed or copied matter isn't accessible to most students without paying through the nose. Full software suites like Adobe Acrobat aren't necessary for the kind of work academics need to do. There are a few functions that are essential to the task, currently available in commercial software:

-Adding and reading notes, whether free-floating or attached to highlighted text
-The ability to select and copy multi-column text (none of the free ones seem to be able to get this one right)
-I'd like that when LaTeX creates a link to a footnote or citation, hovering over the displayed link should cause a pop-up box to display the information.

I'm a man with big ideas but no time, and more importantly, no budget, to motivate and drive the development and use of a free PDF reader with mild annotation capabilities. I can't resort to the for-pay software available from the school website because I'm running Linux, and I shouldn't have to go to a virtual machine or another computer to do this kind of annotation. Likewise, others shouldn't have to spend hundreds for software where they only need a few simple functions.

I suppose the issue is that everyone has their own toys they want included in a PDF editor, which is why the commercial package makes sense. But as academics, wouldn't we be happy with "the basics plus"?

Posted by Andrew C. Thomas at 6:34 PM

May 22, 2008

Nicholas and James are Featured in the NYT again

Professor Nicholas Christakis and Professor James Fowler's study on social network and smoking cessation is featured in the New York Times, which is also going to appear in the New England Journal of Medicine this Thursday. Congratulations to them!

Their basic findings are that smokers are likely to quit in groups (As Nicholas said, "Whole constellations are blinking off at once.") and that the remaining smokers tend to be socially marginalized.

One interesting question I have for their study is that, if friends tend to quit smoking together, will this partly contribute to the simultaneous weight gains among friends, a result Nicholas and James have found last year using the same dataset? In other words, I totally accept that social ties have important impacts on individuals' wellbeing, but if you try to research a certain outcome of wellbeing and do not control for the "contaminating" effects from other outcomes, the estimation of the social network effects on the former outcome could be biased. For example, the weight gains among friends, from this point of view, could be partially resulted from their simultaneous quitting from smoking. Of course, if smokers only consist of a very small fraction of the participants in the studied sample and their weight changes are not too extreme, the bias of the estimation should not invoke a serious problem.

See the following link for a glimpse of their study.

Study Finds Big Social Factor in Quitting Smoking

Sorry for the duplicate if you have noticed this news.

Posted by Weihua An at 12:01 PM

May 19, 2008

Harvard Program on Survey Research (on Youtube)

Mark Blumenthal from has been posting interviews with scholars at the 2008 AAPOR conference, including two with our very own Sunshine Hillygus and Chase Harrison from the Program on Survey Research:

Posted by Mike Kellermann at 10:50 AM

May 15, 2008

Placebo effects and the probability of assignment to active treatment

I just finished reading an interesting paper on placebo effects in drug trials by Anup Malani. Malani noticed that participants in high probability trials know that they more likely to get active treatment (because of informed consent prior to the trial). They have higher expectations and hence should have higher placebo effects than patients in low probability trials. Malani compares outcomes across trials with different assignment probabilities and finds evidence for placebo effects. A related finding is that the control group in high probability trials reports more side effects.

The paper discusses some potential implications of placebo effects, e.g. that patients who are optimistic about the outcome might change their behavior and hence get better even without the active drug. It makes me wonder how this might translate into non-medical settings and whether there are studies of placebo effects in the social sciences. Also, if placebo drugs can improve health outcomes, maybe ineffective social programs would still work as long as participants don’t know whether the program works or doesn’t? Maybe this is the role of politics. But what about the side-effects?

Malani, A (2006) “Identifying Placebo Effects with Data from Clinical Trials” Journal of Political Economy, Vol. 114, pp. 236-256.

A medical treatment is said to have placebo effects if patients who are optimistic about the treatment respond better to the treatment. This paper proposes a simple test for placebo effects. Instead of comparing the treatment and control arms of a single trial, one should compare the treatment arms of two trials with different probabilities of assignment to treatment. If there are placebo effects, patients in the higher-probability trial will experience better outcomes simply because they believe that there is a greater chance of receiving treatment. This paper finds evidence of placebo effects in trials of antiulcer and cholesterol-lowering drugs.

Posted by Sebastian Bauhoff at 12:00 PM

May 13, 2008

Data sets and data interfaces at

I recently came across, a site featuring public datasets and interfaces that have been built to help the public explore them.

From datamob's about page:

Our listings emphasize the connection between data posted by governments and public institutions and the interfaces people are building to explore that data.

It's for anyone who's ever looked at a site like and wondered, "Where did they get their data?" And for anyone who ever looked at THOMAS and thought, "There's got to be a better way to organize this!"

I continue to wonder how the types of interfaces featured on datamob will affect the dissemination of information in society. The dream of a lot of these interface builders is to disintermediate information provision -- ie, to make it possible for citizens to do their own research, produce their own insights, publish their findings on blogs and via data-laden widgets. (We welcomed Fernanda and Martin from Many Eyes, two prominent participants in this movement, earlier this year at our applied stats workshop.) At the same time, the new interfaces make it cheaper for professional analysts -- academics, journalists, consultants -- to access the data and, as they have always done, package it for public consumption. It makes me wonder to what extent the source of our data-backed insights will really change, ie, how much more common will "I was playing around with data on this website and found out that . . . " become relative to "I heard about this study where they found that . . ."?

My hunch is that, just as blogging and internet news has democratized political commentary, the new data resources will make it possible for a new group of relatively uncertified people to become intermediaries for data analysis. (I think FiveThirtyEight is a good example in political polling, although since the site's editor is anonymous I can't be sure.) People will overwhelmingly continue to get data insights as packaged by intermediaries rather than through new interfaces to raw data, but the intermediaries (who will use these new services) will be quicker to use data in making their points, will become much larger in number, and will on average become less credentialed.

Posted by Andy Eggers at 9:48 AM

May 9, 2008

Adventures in Identification III: The Indiana Jones of Economics

fabulous three part series on further adventures in identification on the Freakonomics blogs here, here, and here. The story features Kennedy School Professor Robert Jensen in his five year long quest of achieving rigorous identification for Giffen effects. After finding correlational evidence for Giffen goods in survey data he and his co-author actually followed up by running an experiment in China and guess what, they do find evidence for Giffen behavior. Impressive empirics and a funny read, enjoy!

Posted by Jens Hainmueller at 2:16 PM

May 8, 2008

Some Random Notes about the International Network Meeting

Last week we had an International Meeting on Methodology for Empirical Research on Social Interactions, Social Networks, and Health here at the IQ., thanks to the organization by Professor Charles Manski and Professor Nicholas Christakis. Some people told me that the second day of the meeting was much more "dynamic and interactive" than the first day and based on what I have seen, I believe it was true. I saw at least three cliques of speakers were automatically formed on site along the disciplinary lines: statisticians, economists, and sociologists and political scientists. There were even sub-cliques and backfires! Fortunately, nobody was severely wounded. But anyway, it was a great intellectual exchange between disciplines. Below are some brief notes I took at the second day of the meeting, particularly at the last 20 minutes of the meeting when speakers talked about the future directions of network analysis in social sciences. Sorry for that I forgot to jot down exactly who said what, and that I also squeezed into the notes some of my personal thoughts. I took full responsibility for all errors in the notes.

1. Need to combine game theory with social network analysis, particularly evolutionary game theory (and transaction costs theory).

2. Need to further develop social network analysis based on (random) graph theory, typology and random matrix theory.

3. Network studies tend to focus on network structure and typology as dependent variables while social sciences are more concerned with how network positions and features affect node level of problems. To put simply, network studies tend to start from nodes and end at network while social sciences are more like a top-down approach.

4. In either case, however, it is very crucial to understand the data/tie generating mechanism. Especially, think that the formation of ties can go two ways: influence and selection. For example, smokers can become friends either because a person is influenced by his/her smoking friend to start smoking or because they are both smokers and then become friends. For another example, a highly educated person is usually less likely to be nominated by others as the best friend. This could be either because the highly educated person is less trustworthy or incapable to maintain friend ties or because he/she is more independent and less wiling to associate with others.Longitudinal data may help solve the influence vs. selection issue.

5. Network analysis assumes that the probability of forming ties between nodes is the same between any pair of nodes. So start with a meaningful number of nodes to build network so that each node have roughly the same probability to form ties with one another.

6. How the sever of an existing tie and the formation of a new tie will affect the structure of social network? How ties can bring more ties and lead to polarized network? Nonlinear generating processes and dynamics in network can lead to dramatic difference in network structure for any tiny changes at the node level. How network size can affect network structure? (Think about the difference among monopolistic market, oligarchic market and perfect competitive market.)

7. How to define homophyly between friends? One dimension vs. multiple dimensions? Suppose it is one dimension, there are still two approaches: 1) do a mean test between the tie senders and the tie receivers. 2) Use the ratio of the number of ties whose connected nodes are in the same group (e.g., age +/- 5) that you defined to the total number of ties as an alternative measure. What else?

8. Need to think about how to incorporate network analysis into traditional regression framework. We can either include network properties into regression models to study how network affect personal/clique level of phenomena or use regressions to evaluate how network properties are determined by socioeconomic variables.

9. How to deal with the dependence structure among node level of variables since the errors are not iid.? Is it enough to just using correlation matrix to weight the standard errors and get robust SEs?

10. Need to combine network software with traditional statistical software. The stat-net is getting there. But for Stata users, canned programs are needed to generate network data inside of Stata.

Lastly, for those of you who are interested in causal analysis, read Patrick Doreian (2001), "Causality in Social Network Analysis" (Sociological Methods and Research 30: 81-114) and see if you can improve upon his study.

Posted by Weihua An at 10:46 AM

May 6, 2008

Tuesday: Tips & Tricks

I've been programming in R for four years now, and it seems that no how much I learn there are a million tiny ways that I could do it better. We all have our own programming styles and frequently used functions that may prove useful to others. I often find that a casual conversation with an office mate yields new approaches to a programming quandary. I'm speaking not of statistical insights, though those are important too, but rather the "simple" art of data manipulation and programming implementation--those essential tricks that help to improve coding efficiency. So, to that end I'm announcing the beginning of a bi-weekly "Tuesday Tips & Tricks" posting. These tips may include the description of a useful and perhaps obscure function, or the solutions to common coding problems. I'm selfishly hoping that if readers of this blog know of better or alternate approaches, they'll respond in the comment section. So I'm looking forward to reading your responses.

This week's tip: How to quickly summarize contents of an object.

Answer: summary(), str(), dput()

The primary option, of course, is the familiar summary() command. This command works well for viewing model output, but also to get a quick sense of data frame, matrices and factors. For example, summary of a data frame or matrix shows the following:

> summary(dat1)
Hello test citynames
Min. :1.00 Min. :-3 Length:2
1st Qu.:1.25 1st Qu.:-2 Class :character
Median :1.50 Median :-1 Mode :character
Mean :1.50 Mean :-1
3rd Qu.:1.75 3rd Qu.: 0
Max. :2.00 Max. : 1

This is an incredibly useful function for numeric data, but is less useful for string data. For character vectors the summary function only reveals the length, class, and mode of the variable. In this case, to get a quick look at the data, one might want to use str(). Officially str() "compactly displays the structure of an arbitrary R object", and in practice this is incredibly useful. So using the same dataframe as an example:

> str(dat1)
'data.frame': 2 obs. of 3 variables:
$ Hello : num 1 2
$ test : num -3 1
$ citynames: chr "Cambridge" "Rochester"

In this case, this is just a 2 x 3 data frame, where the first variable is Hello, it's a numeric variable, and the values of the variable Hello are: 1, 2. In this case, the character vector for citynames is much more usefully displayed. While this is a small example, the function works just as well for much larger data frames and matrices where it only displays the first ten values of each variable.

For smaller objects, the function dput() might also prove useful. This function shows the ASCII text representation of the R object and it's characteristics. So for this same example:

> dput(dat1)
structure(list(Hello = c(1, 2), test = c(-3, 1), citynames = c("Cambridge",
"Rochester")), .Names = c("Hello", "test", "citynames"), row.names = c(NA,
-2L), class = "data.frame")

Posted by Eleanor Neff Powell at 4:41 PM

May 1, 2008

New NBER working paper by James Heckman ``Econometric Causality''

James Heckman has a new NBER working paper ``Econmetric Causality’’ which some of you might interesting. To give you a flavor, Heckman writes

``Unlike the Neyman–Rubin model, these [selection] models do not start with the experiment as an ideal but they start with well-posed, clearly articulated models for outcomes and treatment choice where the unobservables that underlie the selection and evaluation problem are made explicit. The hypothetical manipulations define the causal parameters of the model. Randomization is a metaphor and not an ideal or “gold standard".’’ (page 37)

Heckman, J (2008) ``Econometric Causality’’ NBER working paper #13934.

Abstract: This paper presents the econometric approach to causal modeling. It is motivated by policy problems. New causal parameters are defined and identified to address specific policy problems. Economists embrace a scientific approach to causality and model the preferences and choices of agents to infer subjective (agent) evaluations as well as objective outcomes. Anticipated and realized subjective and objective outcomes are distinguished. Models for simultaneous causality are developed. The paper contrasts the Neyman-Rubin model of causality with the econometric approach.

Posted by Sebastian Bauhoff at 10:00 AM

April 24, 2008

FAQs about Statistical Interactions

I am writing a short essay about the connection and distinction between indirect effect and interaction effect for a methodological class and find the following website very helpful to clarify some of the FAQs on that subject. The website is maintained by Professor Regina Branton at the Department of Political Science of Rice University.

Also check out the mediation item at Wikipedia and its great references.

Posted by Weihua An at 11:35 AM

April 16, 2008

JAMA article on ghostwriting medical studies

The Journal of the American Medical Association published a piece today on ghostwriting of medical research. Thanks to the Vioxx lawsuits, the authors say that they found documents ``describing Merck employees working either independently or in collaboration with medical publishing companies to prepare manuscripts and subsequently recruiting external, academically affiliated investigators to be authors. Recruited authors were frequently placed in the first and second positions of the authorship list.’’ One of the exhibits uses a placeholder ``External author?’’ for the expert to be named. Obviously the idea that a pharmaceutical company is pre-writing clinical studies is as controversial as doctors possibly signing off on them without really being involved. A NYT article has some comments, and Merck has released a press statement.

Ross, J et al (2008) "Guest Authorship and Ghostwriting in Publications Related to Rofecoxib. A Case Study of Industry Documents From Rofecoxib Litigation" JAMA 299(15):1800-1812.

Posted by Sebastian Bauhoff at 10:54 PM

April 15, 2008

Google Charts from R: Maps

A few weeks ago I wrote a post sharing some code I wrote to generate sharp-looking PNG scatterplots from R using the Google Chart API. I think there are some nice uses of that (for example, as suggested by a commenter, to send a quick plot over IM), but here's something that I think could be much more useful: maps from R using Google Charts.

So, suppose you have data on the proportion of people who say "pop" (as opposed to "soda" or "coke") in each US state. (I got this data from Many-Eyes.) Once you get my code, you enter a command like this in R

googlemap(x = pct_who_say_pop, codes = state_codes, location = "usa", file ="pop.png")

and this image is saved locally as "pop.png":

To use this, first get the code via
which loads in a function named googlemap, to which you pass

  • x: a vector of data

  • codes: a vector of state/country codes (see the list of standard state and country codes),

  • and location a region of the world ("africa", "asia", "europe", "middle_east", "south_america", "usa") or the whole world ("world")

and you get back a url that you can embed in html as I did above, send over IM, etc. If you pass a file argument, as I did above, you can save the PNG locally.

For optional parameters to affect the scale of the figure and its colors, see the source.

Another quick example:

Suppose you wanted to make a little plot of Germany's colonial possessions in Africa. This code

googlemap(x = c(1,1,1,1), location = "africa", codes = c("CM", "TZ", "NA", "TG"),file = "germans_in_africa.png")

returns this url

" . . . etc.

and saves this PNG on your hard drive:

The scatterplot thing before was something of a novelty, but I think this mapping functionality could actually be useful for generating quick maps in R, since the existing approaches are pretty annoying in my (limited) experience. The Google Charts API is not very flexible about labels and whatnot, so you probably won't be publishing any of these figures. But I expect this will serve very well for quick exploratory stuff, and I hope others do too.

I'd love it if someone wanted to help roll this into a proper R package . . . .

Posted by Andy Eggers at 3:01 PM

April 10, 2008

How Network Graphs are Generated?

When Professor Nicholas Christakis came by to give a talk on social networks and health two weeks ago, some commentator expressed concern about the sparseness of information contained in network graphs (not specifically regarding Nicholas’ research, which I believe was well-done). I do share the same concern with that commentator. So afterwards I did some preliminary search on the literature about visualization of network data and found several interesting pieces that may help clarify (or even exacerbate) part of the concern some of us are having with network graphs.

The first is the lecture notes Professor Peter V. Marsden wrote about visualization of network graphs in soc275. Here I just want to highlight a few points in his notes. (Words in quotes are taken from Professor Marsden’s lecture notes.)

1) Network graphs can be “referenced to known geographical/spatial/social locations of points”.

2) Aesthetic criteria are used to generate network graphs, for examples, to minimize crossing lines, to make lines shorter, … and “[to] construct plot such that close vertices are connected, positively connected, strongly connected, or connected via short geodesics”.

3) “Location of points reflects ‘social distances’”. … “Spatial configuration differs depending on what 'distance-generating mechanism' is assumed and built in to one’s data.”

4) Some often-used network graph generating algorithms include factor analysis, multidimensional scaling (MDS) and spring embedders, etc.

So the configuration of network graphs seems to a large degree dependent on researchers’ theoretical interests and can change according to the network measures (whether it is the number of clusters within network or overall network connectedness, etc.) that researchers are mostly interested in. In other words, before generating any network graphs, researchers have to be clear about what theoretical themes they aim to present through network graphs and then select corresponding network measures and generating algorithms. For those of you who want to follow up with this topic, there are several pieces recommended by Professor Marsden in his lecture notes that I think are good starting references. See below for more details.

1. Bartholomew, David J., Fiona Steele, Irini Moustaki, and Jane I. Galbraith. 2002. The Analysis and Interpretation of Multivariate Data for Social Scientists. London: Chapman and Hall/CRC. Chapters 3 and 4.

2. Freeman, Linton C. 2005. “Graphic Techniques for Exploring Social Network Data.” Chapter 12 in Carrington, Peter J., John Scott, and Stanley Wasserman. 2005. Models and Methods in Social Network Analysis. New York: Cambridge University Press.

3. Freeman, Linton C. 2000. “Visualizing Social Networks.” Journal of Social Structure 1. (Electronically available at

Posted by Weihua An at 11:51 AM

April 7, 2008

A Case Against Evidence Based Medicine?


Seb just sent this very amusing paper (which he found in a comment to a post on Andrew Gelman's blog):

Objectives: To determine whether parachutes are effective in preventing major trauma related to gravitational challenge. Design: Systematic review of randomised controlled trials. Data sources: Medline, Web of Science, Embase, and the Cochrane Library databases; appropriate internet sites and citation lists. Study selection: Studies showing the effects of using a parachute during free fall. Main outcome measure: Death or major trauma, defined as an injury severity score > 15. Results: We were unable to identify any randomised controlled trials of parachute intervention. Conclusions: As with many interventions intended to prevent ill health, the effectiveness of parachutes has not been subjected to rigorous evaluation by using randomised controlled trials. Advocates of evidence based medicine have criticised the adoption of interventions evaluated by using only observational data. We think that everyone might benefit if the most radical protagonists of evidence based medicine organised and participated in a double blind, randomised, placebo controlled, crossover trial of the parachute.

Funny how such a lampoon can trigger a flame war on the BMJ website. Makes me understand why Gary writes about Misunderstandings between experimentalists and
observationalists about causal inference

Posted by Jens Hainmueller at 7:16 PM

April 5, 2008

Political Economy Students Conference

Dear students and colleagues,

We would like to invite you to attend the Political Economy Student Conference, to be held on April 17th in the NBER premises, in Cambridge, MA. The conference is an opportunity for students interested in political economy and other related fields to get together and discuss the open issues in the field, know what other people are working on, and share ideas. The program of the conference can be found at:

This year, some members of the NBER Political Economy Group will be joining us for the conference. We are sure that we will greatly benefit from their comments and suggestions during the discussions.

We hope that those of you interested will attend the conference. The success of the conference largely depends on students' attendance and participation. Given that we have limited seats for the conference, please e-mail leopoldo (at) mit (dot) edu as soon as possible if you are interested in attending so that we can secure a spot for you.

Best regards,

Leopoldo Fergusson
Marcello Miccoli
Pablo Querubin

Posted by Jens Hainmueller at 5:04 PM

April 4, 2008

Predicting Pennsylvania

Here are the results of the Pennsylvania Democratic primary, with Obama counties in purple and Clinton counties in Orange.


What, you say? The Pennsylvania primary hasn't happened yet? You're right. Enter statistics!

Consider this scatterplot of Kerry's 2004 vote share versus Obama's 2008 vote shares in Ohio counties. The result is something I call the Kerry-Obama smile: Obama does well in Kerry's best counties, where staunchly Democratic urban blacks are concentrated; and in Kerry's worst regions, presumably due to Obama's appeal to crossover Republicans. Clinton does best in the wide middle swath.


This motivates a very simple modeling idea: fit a curve to the scatterplot. Obviously, a quadratic in Kerry's share looks like a decent fit. That gives us the best-fit line shown on the plot. The R-squared is 0.16, representing an okay fit.

The next step is utterly useless, but utterly fun. We can use Ohio to predict Pennsylvania. In other words, given that we know how Kerry did in Pennsylvania counties in 2004, we can predict how well Obama will do in 2008 in every Pennsylvania county. Note that I first tweaked the model's intercept slightly in Obama's favor, so that the aggregate prediction matches the current polling average (showing Clinton up by 6.6%).

The bad news for Obama is that nearly all of Pennsylvania's counties fall in the middle of the smile. The image below compares Kerry in 2004 to the model's predictions for Obama in 2008. Obama is predicted to carry Philadelphia overwhelmingly, and to do well in some of the curvy, heavily Republican counties in the south-center of the state. Everywhere else, though, is Clinton country.


Posted by Kevin Bartz at 1:15 PM

April 3, 2008

A born-again frequentist?

It's a day or so past April 1, but if you haven't seen this post [Edit: link fixed] over at Andrew Gelman's blog, it is worth a look. It's about as good an apologia from a "born-again frequentist" as you are likely to find. An exerpt:

I like unbiased estimates and I like confidence intervals that really have their advertised confidence coverage. I know that these aren't always going to be possible, but I think the right way forward is to get as close to these goals as possible and to develop robust methods that work with minimal assumptions. The Bayesian approach--to give up even trying to approximate unbiasedness and to instead rely on stronger and stronger assumptions--that seems like the wrong way to go.

Fortunately, Gelman's conversion experience appears to have ended after about a day...

Posted by Mike Kellermann at 12:09 AM

March 28, 2008

Visualizing Data with Processing

A friend just referred me to Processing, a powerful language for visualizing data:

Processing is an open source programming language and environment for people who want to program images, animation, and interactions. It is used by students, artists, designers, researchers, and hobbyists for learning, prototyping, and production. It is created to teach fundamentals of computer programming within a visual context and to serve as a software sketchbook and professional production tool. Processing is developed by artists and designers as an alternative to proprietary software tools in the same domain.

Their exhibition shows some very impressive results. For example, I liked the visualization of the London Tube map by travel time. I lived in Russel Square once, so this invoked pleasant memories:
If you can spare a minute also take a look at the other exhibited pieces. Most are art rather than statistics. For chess friends I especially recommend the piece called "Thinking Machine 4" by Martin Wittenberg, who gave a talk at the IQSS applied stats workshop in the fall. Enjoy!


Posted by Jens Hainmueller at 7:43 AM

March 27, 2008

How 0.05 comes into rule?

Recently I read an article written by Erin Leahey, talking about how the usage of statistical significance testing, the 0.05 cut-off value and the three-star system becomes legitimized and dominant in mainstream sociology. According to Erin, one star stands for p<=.05, two stars p<=.01 and three stars p<=.001. But I feel the cut-off values are something like .01, .05 and .10 respectively. Anyway, Erin attributed the first usage of .05 significance level to R. A. Fisher’s book, Design of Experiments in 1935. Erin noticed that other forms of significance testing besides the .05 test were already very popular in the 1930s, when close to 40 percent of articles published in ASR and AJS applied one or another form of significance testing procedure. Based on the articles she sampled from ASR and AJS, Erin showed that the popularity of the usage of statistical significance testing and the 0.05 cut-off value roughly took an “S” shape. The usage rose firstly from the 1930s to 1950, declined afterwards until 1970 and then revived since then. Currently, around 80 percent of articles published in ASR and AJS employ both practices. The three-star system emerged in the 1950s, but became popular only after 1970. Now there were slightly above 40 percent of articles published in the above top two sociological journals use this procedure.

So what account for the diffusion of such practices? Erin brought out several arguments to answer this question. For examples, she argued that institutional factors like investment in research and computer, graduate training and institution’s academic status, and journal editor’s individual preference, etc., could be some of the most important factors in the diffusion process of these practices. Interestingly, she found that graduating from Harvard had a significant negative “effect” on adopting these statistical practices. :-)

Of course, as it happens to almost all research, Erin’s study can not avoid some minor drawbacks either. For example, her sample is only drawn from the top two sociological journals and hence the generalization power of her findings could be limited. But overall, it is a fun reading. And if you are interested in more historical account of how the statistical practices were introduced to and became legitimized in social sciences in general, Camic and Xie (1994) is a very good start.

Leahey, Erin. 2005. Alphas and Asterisks: the Development of Statistical Significance Testing Standards in Sociology. Social Forces 84: 1-24.
Camic, Charles, and Yu Xie. 1994. “The Statistical Turn in American Social Science: Columbia University, 1890-1915.” American Sociological Review 59:773-805.

Posted by Weihua An at 11:57 AM

March 26, 2008

The Guardian features Andy and Jens' research on returns to office

A joint project by Andy Eggers and Jens Hainmueller, two long-time contributors to this blog, is the basis of a piece in The Guardian this Monday. Check out the article "How election paid off for postwar Tory MPs" and the paper "MPs For Sale? Estimating Returns to Office in Post-War British Politics". Congrats to Andy and Jens!

Posted by Sebastian Bauhoff at 4:44 PM

March 20, 2008

Correlation of Ratios or Difference Scores Having Common Terms

Yesterday I went to Professor Stanley Lieberson’s class, Issue in the Interpretation of Empirical Evidence. We discussed a paper, written by Stan and Glenn Fuguitt, titled Correlation of Ratios or Difference Scores Having Common Terms. The basic argument of this paper is that although ratios and difference scores are often used as dependent variables in traditional regression analysis, if there are some independent variables who share the same common term with those dependent variables, the estimated coefficients could be severely biased due to the spurious correlation brought about by this common term (whether it is in the denominator or numerator). For examples, if dependent variables are in the form of X/Z while independent variables are something like Y/Z, Z, or Z/X, etc., the estimated coefficients between the dependent and independent variable could become statistically significant simply due to chance.

For some concrete examples, criminologist often use crime rate (adjusted by city population size) as dependent variable while at the same time using city population size as independent variable; organizational researchers are interested in the relationship between the relative size of administration of organization and the absolute size of organization; and economists often regress GDP per capita on such variables as population growth rate, and/or even population size, etc. According to Stan and Fuguitt’s research, all the above examples will provide spurious coefficients since the dependent variable and the independent variable include common terms. In their paper, they attributed this finding back to a paper written by Kail Pearson in 1897 in which Pearson presented rigorously how the spurious correlation came from and a proximate formula for computing correlations of ratios, etc.

We were asked to do an experiment to prove the above spurious correlation, in which we generated three sets of random integers (namely, X, Y, Z) ranging from 1 to 99, presented the pairwise correlation matrix among them and found no significant correlations between any pair of variables. But we found significant correlation between Y/X and X, and when we regressed Y/X on X, the coefficient became significant too. So after such manipulations like division or subtraction, we artificially build significant correlation among two originally insignificant correlated random integers.

Why not try the following in Stata to see if the above claims are overstated or not?

set obs 50
gen x=int(99*uniform()+1)
gen y=int(99*uniform()+1)
gen z=int(99*uniform()+1)

pwcorr x y z, sig

gen ydx = y/x
pwcorr x ydx, sig
reg x ydx

gen xdz = x/z
gen ydz = y/z
pwcorr xdz ydz, sig
reg xdz ydz

gen zdy = z/y
pwcorr xdz zdy, sig
reg xdz zdy

Are you convinced by now? If not, please go read the source paper below (or just write back and say what is wrong with Stan and Fuguitt’s argument). If yes, the question now becomes what should we do with the spurious correlation. Shall we just use the original forms of variables? Shall we re-specify the Solow model? But what if our research interest is about ratio or difference? … …

Stanley Lieberson and Glenn Fuguitt, 1974. Correlation of Ratios or Difference Scores Having Common Terms, in Sociological Methodology (1973-1974), edited by Herbert Costner, San Francisco: Jossey-Rass Publishers.

Posted by Weihua An at 11:17 AM

March 18, 2008

Games That Produce Data

In a conversation with Kevin Quinn this week I was reminded of a fascinating lecture given at Google in 2006 by Luis von Ahn, an assistant professor in computer science at Carnegie Mellon. Von Ahn gives a very entertaining and thought-provoking talk on ingenious ways to apply human intelligence and judgment on a large scale to fairly small problems that computers still struggle with.

(Or watch video on Google video.)

Von Ahn devises games that produce data, the best-known example being the ESP Game, which Google acquired and developed as Google Image Labeler. In the game, you are paired with another (anonymous) player and shown an image. Each of you feverishly types in words describing the image (eg, "Spitzer", "politician", "scandal", "prostitution"); you get points and move to the next image when you and your partner agree on a label. The game is fun, even addictive, and of course Google gets a big, free payoff -- a set of validated keywords for each image.

I'm curious about how these approaches can be applied to coding problems in social science. A lot of recent interesting work has involved developing machine learning techniques to teach computers to label text, but there are clearly cases where language is just too subtle and complex to accurately extract meaning, and we need real people to read the text and make judgments. Mostly we hire RAs or do it ourselves; could we devise games instead?

Posted by Andy Eggers at 9:37 AM

March 11, 2008

What is P(Obama beats McCain)?

While the Democratic nomination contest drags on (and on and on...; Tom Hanks declared himself bored with the race last week), attention is turning to hypothetical general election matchups between Hilary Clinton or Barack Obama and John McCain. Mystery Pollster has a post up reporting on state-by-state hypothetical matchup numbers obtained from surveys of 600 registered voters in each state conducted by Survey USA. There is some debate about the quality of the data (Survey USA uses Interactive Voice Response to conduct its surveys, there is no likely voter screen, etc.). But we have what we have.

At this point, the results are primarily of interest to the extent that they speak to the "electability" question on the Democratic side; who is more likely to beat McCain? MP goes through the results state by state, classifying each state into Strong McCain, Lean McCain, Toss-up, etc. From this you can calculate the number of electoral votes in each category, which provides some information but isn't exactly what we're interested in.

This problem is a natural one for the application of some simple, naive Bayesian ideas. If we throw on some flat priors, make all sorts of unreasonably strong independence assumptions, and assume that the results were derived from simple random sampling, we can quickly get posterior distributions for the support for each candidate in each state and can calculate estimates of the probability of victory. From there, it is easy to calculate the posterior distribution of the number of electoral votes for each candidate and find posterior probabilities that Obama beats McCain, Clinton beats McCain, or the probability that Obama would receive more electoral votes than Clinton.

While I was sitting around at lunch yesterday, I ran a very quick analysis using the reported SurveyUSA marginals. Essentially, I took samples from 50 independent Dirichlet posteriors for both hypothetical matchups, assuming a flat prior and multinomial sampling density (to allow for undecideds); to avoid dealing with the posterior predictive distributions, I'm just going to assume that all registered voters will vote so I can just compare posterior proportions. When you run this, you obtain estimates (conditional on the data and, most importantly, the model) that the probability of an Obama victory over McCain is about 88% and the probability of a Clinton victory is about 72%. There is a roughly 70% posterior probability that Obama would win more electoral votes than Clinton.

As I mentioned, this is an extremely naive Bayesian approach. There are a lot of ways that one could make the model better: adding additional sources of uncertainty, allowing for correlations between the states, using historical information to inform priors, and imposing a hierarchical structure to shrink outlying estimates toward the grand mean. One place to start would be by modeling the pairs of responses to the two hypothetical matchup questions. Any of these things, however, is going to be much easier to do in a Bayesian framework, since calculating posterior distributions of functions of the model parameters is extremely easy.

Posted by Mike Kellermann at 11:17 AM

March 5, 2008

"Early Thoughts on the Autism Epidemic"

The dramatic increase in cases of autism in children over the past few years has been in the news again in recent days. Most notably, presumptive Republican presidential nominee John McCain said at a recent stop, "there’s strong evidence that indicates that it’s got to do with a preservative in vaccines." Which would be fine if such strong evidence existed; unfortunately, that is a mischaracterization of the current state of the literature to say the least. McCain has since backed away from his initial comments (see this article in yesterday's New York Times), but the debate prompted by his comments will undoubtedly continue.

By coincidence, the Robert Wood Johnson program at Harvard is sponsoring a talk tomorrow on this topic. Professor Peter Bearman (chair of the Statistics Department at Columbia) will be speaking on "Early Thoughts on the Autism Epidemic." Professor Bearman is currently leading a project on the social determinants of autism. The talk is in N262 on the second floor of the Knafel Building at CGIS from 11:00 to 12:30.

Posted by Mike Kellermann at 2:56 PM

February 23, 2008

Publication Bias in Drug Trials

A study published in the New England Journal of Medicine last month showed that widely-prescribed antidepressants may not be as effective as the published research indicates. After reading about the study in the NYT, I recently read the article and was struck by how well the authors were able to document the somewhat elusive phenomenon of publication bias.

Researchers in most fields can document publication bias only by pointing out patterns in published results. A jump in the density of t-stats around 2 is one strong sign that null reports are not being published; an inverse relationship between average reported effect size and sample size in studies of the same phenomenon is another strong sign (because the only small studies that could be published are the ones with large estimated effects). These meta-analysis procedures are clever because they infer something about unpublished studies from what we see in published studies.

As the NEJM article makes clear, publication bias is more directly observable in drug trials because we have very good information about unpublished trials. When a pharmaceutical company initiates clinical trials for a new drug, the studies are registered with the FDA; in order to get FDA approval to bring the drug to market, the company must submit the results of all of those trials (including the raw data) for FDA review. All trials conducted on a particular drug are therefore reviewed by the FDA, but a subset of those trials are published in medical journals.

The NEJM article uses this information to determine which antidepressant trials made it into the journals:

Among 74 FDA-registered studies, 31%, accounting for 3449 study participants, were not published. Whether and how the studies were published were associated with the study outcome. A total of 37 studies viewed by the FDA as having positive results were published; 1 study viewed as positive was not published. Studies viewed by the FDA as having negative or questionable results were, with 3 exceptions, either not published (22 studies) or published in a way that, in our opinion, conveyed a positive outcome (11 studies). According to the published literature, it appeared that 94% of the trials conducted were positive. By contrast, the FDA analysis showed that 51% were positive. Separate meta-analyses of the FDA and journal data sets showed that the increase in effect size ranged from 11 to 69% for individual drugs and was 32% overall.

One complaint -- I thought it was too bad that the authors did not determine whether the 22 studies that were "negative or questionable" and went unpublished were not submitted ("the file drawer problem") or rejected by the journals. But otherwise very thorough and interesting.

Posted by Andy Eggers at 2:05 AM

February 22, 2008

Bus Accidents as Random Health Shocks

A major item of interest in applied health economics is to understand the impact of health shocks on household income, investments and consumption. This relation is particularly important in developing countries that don’t have programs like universal health insurance or social insurance like Medicaid. Alas it’s also a major challenge to establish causal effects and mechanisms through which the shocks might operate. A main culprit is endogeneity, since health affects wealth and vice versa. As result there is a huge and truly inter-disciplinary literature on the topic, much of it with suspicious identification strategies.

The main struggle is to find a plausibly exogenous exposure to health shocks that have real-life relevance. A new paper by Manoj Mohanan takes this challenge seriously and looks at the effect of health shocks from bus accidents on household’s consumption, and examines what mechanisms households rely on to smooth consumption. (Full disclosure: Manoj is a classmate of mine, and I really like his work!)

To address the endogeneity problem, the paper focuses on people who have been in bus accidents as recorded by the state-run bus company in Karnataka, India. Clearly, finding a good control group is critical: people who travel on public buses may be different from those who don’t. For starters, they actually took the risk of getting on a bus – if you have ever been on the road in a developing country you’ll know what this means. Manoj’s approach is to select unexposed individuals among travelers on the same bus route, after matching on age, sex and geographic area of residence. Hence, conditional on these factors, the bus accident can be treated as exogenous.

He then compares the two groups on various dimensions. He finds that households reduce educational and festival spending by a large amount, but appear to be able to smooth food and housing consumption. He is unable to find effects on assets or labor supply. The principal coping mechanism is debt accumulation. Overall this suggests that not all is well: debt traps aside, reducing investments in education could be very costly in the long run (on this point see also Chetty and Looney, 2006).

* Chetty, R. and Looney, A. (2006) ``Consumption smoothing and the welfare consequences of social insurance in developing economies'' Journal of Public Economics, 90: 2351-2356.

Posted by Sebastian Bauhoff at 10:00 AM

February 2, 2008

Conference on ``New Technologies and Survey Research''

This year's Spring Conference of the Harvard Program on Survey Research is on ``New Technologies and Survey Research.'' It will be held on May 9, 2008, 9:00am to 5:00 pm at IQSS, and is open to the public.

See here for details.

Posted by Sebastian Bauhoff at 9:54 AM

February 1, 2008

useR! 2008 in Dortmund

Abstracts are now being accepted for the 2008 useR! conference in Dortmund, Germany. This conference is designed to bring R users and developers together to trade ideas and find out what is new in the sprawling world of R. Several of us went to the Vienna conference a few years ago, and found it very useful. Previous editions have had a good mix of academic and private sector participants, and I learned more than I have at some of the more traditional academic conferences. The announcement from the useR webpage is below; the website is at

useR! 2008, the R user conference, takes place at the Fakultät Statistik, Technische Universität Dortmund, Germany from 2008-08-12 to 2008-08-14. Pre-conference tutorials will take place on August 11.

The conference is organized by the Fakultät Statistik, Technische Universität Dortmund and the Austrian Association for Statistical Computing (AASC). It is funded by the R Foundation for Statistical Computing.

Following the successful useR! 2004, useR! 2006, and useR! 2007 conferences, the conference is focused on

  1. R as the `lingua franca' of data analysis and statistical computing,
  2. providing a platform for R users to discuss and exchange ideas how R can be used to do statistical computations, data analysis, visualization and exciting applications in various fields,
  3. giving an overview of the new features of the rapidly evolving R project.

As for the predecessor conference, the program consists of two parts:

  1. invited lectures discussing new R developments and exciting applications of R,
  2. user-contributed presentations reflecting the wide range of fields in which R is used to analyze data.

A major goal of the useR! conference is to bring users from various fields together and provide a platform for discussion and exchange of ideas: both in the formal framework of presentations as well as in the informal part of the conference in Dortmund's famous beer pubs and restaurants.

Prior to the conference, on 2008-08-11, there are tutorials offered at the conference site. Each tutorial has a length of 3 hours and takes place either in the morning or afternoon.

Call for Papers
We invite all R users to submit abstracts presenting innovations or exciting applications of R on topics such as:

Applied Statistics & Biostatistics
Bayesian Statistics
Chemometrics and Computational Physics
Data Mining
Econometrics & Finance
Environmetrics & Ecological Modeling
High Performance Computing
Machine Learning
Marketing & Business Analytics
Robust Statistics
Spatial Statistics
Statistics in the Social and Political Sciences
Visualization & Graphics
and many more.

We recommend a length of about one page in pdf format. The program committee decided on the presentation format. There is no proceedings volume, but the abstracts are available in an online collection linked from the conference program and in a single pdf file.

Deadline for submission of abstracts: 2008-03-31.

Posted by Mike Kellermann at 11:55 AM

January 4, 2008

Call for Papers: Conference at Harvard on Networks in Political Science

James Fowler sent the following message to the Polmeth list, regarding a conference that we will apparently be hosting in June that may be of interest:

The study of networks has exploded over the last decade, both in the social and hard sciences. From sociology to biology, there has been a paradigm shift from a focus on the units of the system to the relationships among those units. Despite a tradition incorporating network ideas dating back at least 70 years, political science has been largely left out of this recent creative surge. This has begun to change, as witnessed, for example, by an exponential increase in network-related research presented at the major disciplinary conferences.

We therefore announce an open call for paper proposals for presentation at a conference on "Networks in Political Science" (NIPS), aimed at _all_ of the subdisciplines of political science. NIPS is supported by the National Science Foundation, and sponsored by the Program on Networked Governance at Harvard University.

The conference will take place June 13-14. Preceding the conference will be a series of workshops introducing existing substantive areas of research, statistical methods (and software packages) for dealing with the distinctive dependencies of network data, and network visualization. There will be a $50 conference fee. Limited funding will be available to defray the costs of attendance for doctoral students and recent (post 2005) PhDs. Funding may be available for graduate students not presenting papers, but preference will be given to students using network analysis in their dissertations. Women and minorities are especially encouraged to apply.

The deadline for submitting a paper proposal is March 1, 2008. Proposals should include a title and a one-paragraph abstract. Graduate students and recent Ph.D.'s applying for funding should also include their CV, a letter of support from their advisor, and a brief statement about their intended use of network analysis. Send them to The final program will be available at

Posted by Mike Kellermann at 5:18 PM

December 11, 2007

Coding Analysis Toolkit looking for beta testers

A recent message to the Polmeth mailing list announced that a research group at the University of Pittsburgh is looking for beta testers for some new coding reliability software that they have developed:

The Coding Analysis Toolkit (or “CAT”) was developed in the summer of 2007. The system consists of a web-based suite of tools custom built from the ground-up to facilitate efficient and effective analysis of text datasets that have been coded using the commercial-off-the-shelf package ATLAS.ti ( We have recently posted a narrated slide show about CAT and a tutorial online. The Coding Analysis Toolkit was designed to use keystrokes and automation to clarify and speed-up the validation or consensus adjudication process. Special attention was paid during the design process to the need to eliminate the role of the computer mouse, thereby streamlining the physical and mental tasks in the coding analysis process. We anticipate that CAT will open new avenues for researchers interested in measuring and accurately reporting coder validity and reliability, as well as for those practicing consensus-based adjudication. The availability of CAT can improve the practice of qualitative data analysis at the University of Pittsburgh and beyond.

More information is avaliable at this website: This is far from my area of expertise, but it looks like it might be useful for some projects...

Posted by Mike Kellermann at 6:00 PM

December 5, 2007

Holiday Gifts for the Data-Addicted

The infosthetics blog offers its "shopping guide for the data-addicted." I was intrigued by the chumby and nabaztag, two devices that offer the charms of the internet divorced from the keyboard/mouse/monitor setup. For the urban planner on your list, don't miss the fly swatter whose mesh is a street map of Milan. For the social science stats crowd, though, the best gift on the list has to be the Death and Taxes poster, depicting the US federal discretionary budget in remarkable detail and clarity. Click on the image below to get a close-up look at the poster.

Posted by Andy Eggers at 8:52 AM

November 30, 2007

Conference on Computational Social Science

IQSS is sponsoring a conference next Friday on the emerging area of computational social science. Below is the announcement:

The Conference on Computational Social Science (part of the Eric M. Mindich Conference series)

Friday, December 7, 2007
Center for Government and International Studies South, Tsai Auditorium (Room S010)
1730 Cambridge Street, Cambridge, MA

The development of enormous computational power and the capacity to collect enormous amounts of data has proven transformational in a number of scientific fields. The emergence of a computational social science has been slower than in the sciences. However, the combination of the still exponentially increasing computational power with a massive increase in the capturing of data about human behavior makes the emergence of a field of computational social science desirable, but not inevitable. The creation of a field of computational social science poses enormous challenges, but offers enormous promise to achieve the public good. The hope is that we can produce an understanding of the global network on which many global
problems exist: SARS and infectious disease, global warming, strife due to cultural collisions, and the livability of our cities. That is, can sensing our society lead to a sensible society?

To solve these problems will require trading off privacy versus convenience, individual freedom versus societal benefit, and our sense of individuality versus group identity. How will we decide what the sensible society will look like? This conference brings together the wide array of individuals who are working in this emerging research area to discuss how we might address these global challenges, and to evaluate the potential emergence of a field of "computational social science.

Registration is required; more information is available here.

Posted by Mike Kellermann at 9:42 AM

November 15, 2007

Artsy Statistics

From Andrew Gelman, I saw a link to an interesting "art exhibit" that's actually all about statistics and language. In some ways it reminded me of this other art exhibit that's actually all about statistics -- in this case, the meaning of some of the very large numbers we read about all the time, but find difficult to grasp on an intuitive level.

Both are worth checking out online. And if you live somewhere that you can visit either, lucky you!

Posted by Amy Perfors at 9:47 AM

October 31, 2007

The statistics of race

Amy Perfors

There's an interesting article at Salon today about racial perception. As is normally the case for scientific articles reported in the mainstream media, I have mixed feelings about it.

1) First, a pet peeve: just because something is can be localized in the brain using fMRI or similar techniques, does not mean it's innate. This drives me craaazy. Everything that we conceptualize or do is represented in the brain somehow (unless you're a dualist, and that has its own major logical flaws). For instance, trained musicians devote more of their auditory processing regions to listening to piano music, and have a larger auditory cortex and larger areas devoted toward motor control of the fingers used to play their instrument. [cite]. This is (naturally, reasonably) not interpreted as meaning that playing the violin is innate, but that the brain can "tune itself" as it learns. [These differences are linked to amount of musical training, and are larger the younger the training began, which all supports such an interpretation]. The point is, localization in the brain != innateness. Aarrgh.

2) The article talks about what agent-based modeling has shown us, which is interesting:

Using this technique, University of Michigan political scientist Robert Axelrod and his colleague Ross Hammond of the Brookings Institution in Washington, D.C., have studied how ethnocentric behavior may have evolved even in the absence of any initial bias or prejudice. To make the model as simple as possible, they made each agent one of four possible colors. None of the colors was given any positive or negative ranking with respect to the other colors; in the beginning, all colors were created equal. The agents were then provided with instructions (simple algorithms) as to possible ways to respond when encountering another agent. One algorithm specified whether or not the agent cooperated when meeting someone of its own color. The other algorithm specified whether or not the agent cooperated with agents of a different color.

The scientists defined an ethnocentric strategy as one in which an agent cooperated only with other agents of its own color, and not with agents of other colors. The other strategies were to cooperate with everyone, cooperate with no one and cooperate only with agents of a different color. Since only one of the four possible strategies is ethnocentric and all were equally likely, random interactions would result in a 25 percent rate of ethnocentric behavior. Yet their studies consistently demonstrated that greater than three-fourths of the agents eventually adopted an ethnocentric strategy. In short, although the agents weren't programmed to have any initial bias for or against any color, they gradually evolved an ethnocentric preference for one's own color at the expense of those of another color.

Axelrod and Hammond don't claim that their studies duplicate the real-world complexities of prejudice and discrimination. But it is hard to ignore that an initially meaningless trait morphed into a trigger for group bias. Contrary to how most of us see bigotry and prejudice as arising out of faulty education and early-childhood indoctrination, Axelrod's model doesn't begin with preconceived notions about the relative values of different colors, nor is it associated with any underlying negative emotional state such as envy, frustration or animosity. Detection of a difference, no matter how innocent, is enough to result in ethnocentric strategies.

As I understand it, the general reason these experiments work the way they do is that the other strategies do worse given the dynamics of the game (single-interaction Prisoner's Dilemma): (a) cooperating with everyone leaves one open to being "suckered" by more people; (b) cooperating with nobody leaves one open to being hurt disproportionately by never getting the benefits of cooperation; and (c) cooperating with different colors is less likely to lead to a stable state.

Why is this last observation -- the critical one -- true? Let's say we have a red, orange, and yellow agent sitting next to each other, and all of them decide to cooperate with a different color. This is good, and leads to an increased probability of all of them being able to reproduce, and the next generation has two red, two yellow, and two orange agents. Now the problem is apparent: each of the agents is now next to an agent (i.e., the other one of its own color) that it is not going to cooperate with, which will hurt its chances of being able to survive and reproduce. By contrast, subsequent generations of agents that favor their own color won't have this problem. And in fact, if you remove "local reproduction" -- if an agent's children aren't likely to end up next to it -- then you don't get the rise of ethnocentrism... but you don't get much cooperation, either. (Again, this is sensible: the key is for agents to be able to essentially adapt to local conditions in such a way that they can rely on the other agents close to them, and they can't do that if reproduction isn't local). I would imagine that if one's cooperation strategy didn't tend to resemble the cooperation strategy of one's parents, you wouldn't see either ethnocentrism (or much cooperation) either.

3) One thing the article didn't talk about, but I think is very important, is how much racial perception may have to do with our strategies of categorization in general. There's a rich literature studying categorization, and one of the basic findings is of boundary sharpening and within-category blurring. (Rob Goldstone has been doing lots of interesting work in this area, for instance). Boundary sharpening refers to the tendency, once you've categorized X and Y as different things, to exaggerate their differences: if the categories containing X and Y are defined by differences in size, you would perceive the size difference between X and Y to be greater than it actually is. Within-category blurring refers to the opposite effect: the tendency to minimize the differences of objects within the same category -- so you might see two X's as being closer in size than they really are. This is a sensible strategy, since the more you do so it, the better you'll be able to correctly categorize the boundary cases. However, it results in something that looks very much like stereotyping.

Research along these lines is just beginning, and it's too early to go from this observation to conclude that part of the reason for stereotyping is that it emerges from the way we categorize things, but I think it's a possibility. (There also might be an interaction with the cognitive capacity of the learning agent, or its preference for a "simpler" explanation -- the more the agent can't remember subtle distinctions, and the more the agent favors an underlying categorization with few groups or few subtleties between or within groups, the more these effects occur).

All of which doesn't mean, of course, that stereotyping or different in-group/out-group responses are justified or rational in today's situations and contexts. But figuring out why we think this way is a good way to start to understand how not to when we need to.

[*] Axelrod and Hammond's paper can be found here.

Posted by Amy Perfors at 2:32 PM

October 30, 2007

Clay Public Lecture: "Technology-driven statistics"

The Clay Mathematics Institute and the Harvard Mathematics Department are sponsoring a lecture by Terry Speed from the Department of Statistics at Berkeley on "Technology-driven statistics," with a focus on the challenges presented to statistical theory and practice presented by the massive amounts of data that are generated by modern scientific instruments (microarrays, mass spectrometers, etc.). These issues have not yet been as salient in the social sciences, but they are clearly on the horizon. The talk is at 7PM tonight (Oct. 30) in Science Center B at Harvard. The abstract for the talk is after the jump:

Technology-driven Statistics

Terry Speed, UC Berkeley and WEHI in Melbourne, Australia

Tuesday, October 30, 2007, at 7:00 PM

Harvard University Science Center -- Hall B

Forty years ago, biologists collected data in their notebooks. If they needed help from a statistician in analyzing and interpreting it, they would pass over a piece of paper with numbers on it. The theory on which statistical analyses was built a couple of decades earlier seemed entirely adequate for the task. When computers became widely available, analyses became easier and a little different. with the term "computer intensive" entering the lexicon. Now, in contemporary biology and many other areas, new technologies generate data whose quantity and complexity stretches both our hardware and our theory. Genome sequencing, genechips, mass spectrometers and a host of other technologies are now pushing statistics very hard, especially its theory. Terry Speed will talk about this revolution in data availability, and the revolution we need in the way we theorize about it.

Terry Speed splits his time between the Department of Statistics at the University of California, Berkeley and the Walter & Eliza Hall Institute of Medical Research (WEHI) in Melbourne, Australia. Originally trained in mathematics and statistics, he has had a life-long interest in genetics. After teaching mathematics and statistics in universities in Australia and the United Kingdom, and a spell in Australia's Commonwealth Scientific and Industrial Research Organization, he went to Berkeley 20 years ago. Since that time, his research and teaching interests have concerned the application of statistics to genetics and molecular biology. Within that subfield, eventually to be named bioinformatics, his interests are broad, including biomolecular sequence analysis, the mapping of genes in experimental animals and humans, and functional genomics. He has been particularly involved in the low level analysis of microarray data. Ten years ago he took the WEHI job, and now spends half of his time there, half in Berkeley, and the remaining half in the air somewhere in between.

Posted by Mike Kellermann at 12:08 AM

October 29, 2007

Visualizing Electoral Data

Andy Eggers and I are currently working on a project on UK elections. We have collected a new dataset that covers detailed information on races for the House of Commons between 1950 and 1970; seven general elections overall. We have spent some time thinking about new ways to visualize electoral data and Andy has blogged about this here and here. Today, I'd like to present a new set of plots that we came up with to summarize the closeness of constituency races over time. This is important for our project because we exploit close district races as a source of identification.

Conventional wisdom holds that in Britain, about one-quarter of all seats are 'marginal', ie. decided within majorities of less than 10 percentage points. To visualize this fact Andy and I came up with the following plot. Constituencies are on the x axis and the elections are on the y axis. Colors indicate the closeness of the district race (ie. vote majority / vote sum) categorized into different bins as indicated in the colorkey on top. Color scales are from Colorbrewer. We have ranked the constituencies from close to safe from left to right. Please take a look:


The same plot is available as a pdf here. The conventional wisdom seems to hold. About 30 percent of the races are close. Also some elections are closer than others.

A long format of the plot is available here. It allows to identify individual districts, but requires some scrolling. We are considering developing an interactive version using javascript so that additional info pops up as one mouses over the plot. Notice that both plots exclude the 50 or so districts that changed names as a result of the 1951 redistricting wave.

Finally, Andy and I care about districts that swing between the two major parties. To visualize this we have produced similar plots where the color now indicates the vote share margins as seen by the Conservative party: ((Conservative vote - Labour vote)/vote sum). So negative values indicate a Labour victory and positive values a victory of the Conservative party. We only look at districts where Labour or the Conservative party took first and second place. Here it is:


The partisan swings from election to election are really clear. Finally, the long format is here. The latter plot allows to easily identify the party strongholds during this time period. Comments and suggestions are highly welcome. We wonder whether anybody has done such plots before or whether we can legitimately coin them as Eggmueller plots (lol).

Posted by Jens Hainmueller at 8:13 PM

October 19, 2007

Tim McCarver is a Bayesian with very strong priors....

The Red Sox beat the Indians last night in Game 5 of the ALCS, sending the series back to Fenway and enabling the majority of us at Harvard who are (at least fair-weather) Sox fans to, as Kevin Youkilis said last night, come down off the bridge for a few more days. Why do I bring this up? Well, after Boston's loss in Game 4, a commenter on this blog asked the following question:

In the disastrous inning of the Red Sox game tonight, the announcer (maybe Tim McCarver?) said “One would think that a lead-off walk would lead to more runs than a lead-off home-run, but it’s not true. We’ve researched it and this year a lead-off home-run has led to more multi-run innings than have lead-off walks.”

I must not be "one", b/c I think a lead-off home-run is much more likely to lead to multiple-run innings, b/c after the home-run, you have a run and need only 1 more to have multiple, and the actions after the first batter are mostly independent of the results of the first batter. So, I think he has it totally backwards. I was a fair stats student, so I need confirmation. He was backwards, right?

The short answer is that it was Tim McCarver, and as an empirical matter he was wrong to be surprised. I don't have access to full inning-by-inning statistics over a long period of time, but the most convincing analysis I found in a quick search (here) suggests that between 1974 and 2002, the probability of a multi-run inning conditional on a leadoff walk is .242 and the probability of a multirun inning after a leadoff home run is .276.

The blogosphere has had a lot of fun at McCarver's expense (not that it takes much to provoke such a reaction, granted): It's Math!, Zero > One, Tim McCarver Does Research, etc. His observation, though, is a good example of Bayesian updating at work: while I doubt that most baseball observers "would think that a lead-off walk would lead to more runs than a lead-off home-run," it is very clear that Tim McCarver thought that at some point. As evidence, in a 2006 game he made the following comment:

"There is nothing that opens up big innings any more than a leadoff walk. Leadoff home runs don't do it. Leadoff singles, maybe. But a leadoff walk. It changes the mindset of a pitcher. Since he walked the first hitter, now all of a sudden he wants to find the fatter part of the plate with the succeeding hitters. And that could make for a big inning."

In 2004, he said during the Yankees-Red Sox ALCS that "a walk is as good as a home run." And back in 2002, he made a similar comment during the playoffs; in fact, it was that comment that prompted the analysis that I linked to above! Clearly, he had a strong prior belief (from where, I don't know) that leadoff walks somehow get in the pitcher's head and produce more big innings. Now that he's been confronted by data, those belief are updating, but since his posterior has shifted so much from his prior it's not surprising that he thinks this is some great discovery. In a couple of years, he'll probably think that he always knew a leadoff home run was better.

As for the intuition, it looks like the commenter is also correct. Using the data cited above, the probability of scoring zero runs in an inning is approx. .723, while the probability of scoring no additional runs after a leadoff homer is approx. .724; the rest of distribution is similar as well.

Posted by Mike Kellermann at 1:02 PM

October 18, 2007

R Quiz Anybody?

Perl has the Perl quiz, Python has the Python challenges, Ruby has the Ruby quiz, but what about our good old friend R?? Does such a thing exist anywhere? Would be a nice idea I think...

Posted by Jens Hainmueller at 8:52 PM

October 17, 2007

How tall are you? No, really...

Continuing on the topic of self-reported health data, and how to correct for reporting (and other) biases, here an interesting paper on height and weight in the US. Those two measures have received a lot of interest in the past years, not least as components of the body-mass index BMI which is used to estimate the prevalence of obesity. BMI itself is not a great measure (more on that another day) but at least it’s relatively easy to collect via telephone and in-person interviews. Of course some people make mistakes while reporting their own vital measures, and some might do so systematically: a height of 6 foot sounds like a good height to have even to me, and I tend to think in the metric system!

Anyway, the paper by Ezzati et al examines the issue of systematic misreporting. They note that existing smaller-scale studies on this issue might in fact under-estimate the bias because of their design. People might limit their misreporting if they are measured before or after reporting their vitals, which is a challenge for validation studies. And participation might systematically differ with the interview modes of the analysis studies and a general health surveys (e.g. in-person versus telephone interviews) so that the studies are not directly comparable to population-level surveys.

The idea of the paper is to employ two nationally representative surveys to compare three different kinds of measurement for height and weight, by age group and gender. The first survey is the National Health and Nutrition Examination Survey NHANES which collects self-reported information through in-person interviews, and also through medical examination. The second survey is the Behavior and Risk Factor Surveillance Survey BRFFS, an annual cross-sectional telephone survey that is state-level representative and features widely in policy discussions.

The comparisons between the surveys might confirm your priors on misreporting. On average, women under-report their weight and men under 65 tend to over-report their height. The authors find that state-level obesity measures based on the BRFFS are too low – they re-calculate that a number of states in fact had obesity prevalences above 30% in 2000. Of course this is not a perfectly clean assessment, because the NHANES participants might have anticipated the clinical examination a few weeks after the in-person interview. But at the least this study is a good reminder that people do systematically misreport for some reason, and that analysts should treat self-reported BMI carefully.

Posted by Sebastian Bauhoff at 10:23 PM

October 10, 2007

Visualizing the evolution of open-edited text

Today's applied stats talk by Fernanda Viegas and Martin Wattenberg covered a wide array of interesting data visualization tools that they and their colleagues have been developing over at IBM Research. One of the early efforts that they described is an applet called History Flow, which allows users to visualize the evolution of a text document that was edited by a number of people, such as Wikipedia entries or computer source code. You can track which authors contributed over time, how long certain parts of the text have remained in place, and how text moves from one part of the document to another. To give you a flavor of what is possible, here is a visualization of the history of the Wikipedia page for Gary King (who is the only blog contributor who has one at the moment):


This shows how the page became longer over time and that it was primarily written by one author. The applet also allows you to connect textual passages from earlier versions to their authors. We noticed this one from Gary's entry:


"Ratherclumsy"'s contribution to the article only survived for 24 minutes, and was deleted by another user with best wishes for becoming "un-screwed". All kidding aside, this is a really interesting tool for text-based projects. Leaving aside the possibility for analysis, this would be useful for people working on coding projects. I can think of more than one R function that I've worked on where it would be nice to know who wrote a particular section of code....

Posted by Mike Kellermann at 5:52 PM

October 8, 2007

Fernanda Viegas and Martin Wattenberg on Data Visualization

Dear Applied Statistics Community,

Please join us for this week's installment of the Applied Statistics workshop, where Fernanda Viegas and Martin Wattenberg will be presenting their talk entitled, "From Wikipedia to Visualization and Back'. The authors provided the following abstract for their talk:

This talk will be a tour of our recent visualization work, starting with a case study of how a new data visualization technique uncovered dramatic dynamics in Wikipedia. The technique sheds light on the mix of dedication, vandalism, and obsession that underlies the online encyclopedia. We discuss the reaction of the Wikipedia community to this visualization, and how it led to a recent ambitious project to make data visualization technology available to everyone. This project, Many Eyes, is a web site where people may upload their own data, create interactive visualizations, and carry on conversations. The goal is to foster a social style of data analysis in which visualizations serve not only as a discovery tool for individuals but also as a means to spur discussion and collaboration.

Martin and Fernanda have also provided the following set of links as background for the presentation:

And to a website based upon recent work in data visualization

Link to Many Eyes site:

As always, the workshop meets at 12 noon on Wednesday, in room N-354 CGIS-Knafel. A light lunch will be provided

Posted by Justin Grimmer at 12:02 PM

October 4, 2007

Another way of thinking about probability?

Amy Perfors

On Tuesday I went to a talk by Terrence Fine from Cornell University. It was one of those talks that's worth going to, if nothing else because it makes you re-visit and re-question the sort of basic assumptions that are so easy to not even notice that you're making. In this case, that basic assumption was that the mathematics of probability theory, which views probability as a real number between 0 and 1, is equally applicable to any domain where we want to reason about statistics.

Is this a sensible assumption?

As I understand it, Fine made the point that in many applied fields, what you do is start from the phenomenon to be modeled and then use the mathematical/modeling framework that is appropriate to it. In other words, you go from the applied "meaning" to the framework: e.g., if you're modeling dynamical systems, then you decide to use differential equations. What's odd in applications of probability theory, he said, is that you basically go from the mathematical theory to the meaning: we interpret the same underlying math as having different potential meanings, depending on the application and the domain.

He discussed four different applications, which are typically interpreted in different ways: physically-determined probability (e.g., statistical mechanics or quantum mechanics); frequentist probability (i.e., more data driven); subjective probability (in which probability is interpreted as degree of belief); and epistemic/logical (in which probability is used to characterize inductive reasoning in a formal language). Though I broadly agree with these distinctions, I confess I'm not getting the exact subtleties he must be making: for instance, it seems to me the interpretation of probability in statistical mechanics is arguably very different from in quantum mechanics and they should therefore not be lumped together: in statistical mechanics, the statistics of flow arise some underlying variables (i.e., the movements of individual particles), and in quantum mechanics, as I understand it, there aren't any "hidden variables" determining the probabilities as all.

But that technicality aside, the main point he made is that depending on the interpretation of probability and the application we are using it for, our standard mathematical framework -- in which we reason about probabilities using real numbers between 0 and 1 -- may be inappropriate because it is either more or less expressive than necessary. For instance, in the domain of (say) IQ, numerical probability is probably too expressive -- it is not sensible or meaningful to divide IQs by each other; all we really want is an ordering (and maybe even a partial ordering, if, as seems likely, the precision of an IQ test is low enough that small distinctions aren't meaningful[1]). So a mathematics of probability which views it in that way, Fine argues, would be more appropriate than the standard "numerical" view.

Another example would be in quantum mechanics, where we actually observe a violation of some axioms of probability. For instance, the distributivity of union and intersection fails: P(A or B) != P(A)+P(B)-P(A and B). This is an obvious place where one would want to use a different mathematical framework, but since (as far as I know) people in quantum mechanics actually do use such a framework, I'm not sure what his point was. Other than it's a good example of the overall moral, I guess?

Anyway, the talk was interesting and thought-provoking, and I think it's a good idea to keep this point in the back of one's mind. That said, although I can see why he's arguing that different underlying mathematics might be more appropriate in some cases, I'm not convinced yet that we can conclude that using a different underlying mathematics (in the case of IQ, say) would therefore lead to new insight or help us avoid misconceptions. One of the reasons numerical probability is used so widely -- in addition to whatever historical entrenchment there is -- is that it is an indispensible tool for doing inference, reasoning about distributions, etc. It seems like replacing it with a different sort of underlying math might result in losing some of these tools (or, at the very least, require us to spend decades re-inventing new ones).

Of course, other mathematical approaches might be worth it, but at this point I don't know how well-worked out they are, and -- speaking as someone interested in the applications -- I don't know if they'd be worth the work in order to see. (They might be; I just don't know... and, of course, a pure mathematician wouldn't care about this concern, which is all to the good). Fine gave a quick sketch of some of these alternative approaches, and I got the sense that he was working on developing them but they weren't that well developed yet -- but I could be totally wrong. If anyone knows any better, or knows of good references on this sort of thing, please let us know in comments. I couldn't find anything obvious on his web page.

[1] I really really do not want to get into a debate about whether and to what extent IQ in general is meaningful - that question is really tangential to the point of this post, and I use IQ as illustration only. (I use it rather than something perhaps less inflammatory because it's the example Fine used).

Posted by Amy Perfors at 12:40 PM

June 20, 2007

SPM Career Achievement Award

The Society for Political Methodology has announced the winner of its inaugural Career Achievement Award. The first recipient will be Chris Achen, currently the Roger Williams Straus Professor of Social Sciences at Princeton University. The award will be presented at the APSA meeting this summer at the society's business meeting. Chris was chosen to receive the award by a committee consisting of Simon Jackman, Mike Alvarez, Liz Gerber and Marco Steenbergen, and their citation does a fine job of summarizing his many accomplishments over the years.

On a personal note, Chris was my senior thesis advisor back in 00-01 when he was at Michigan. That came about through a bit of luck; I had never taken a class from him, and one of the other professors at Michigan asked him to meet with me as a favor. Despite this, he was unfailingly generous with both support and constructive criticism. At least at the time, Chris had the habit of working rather late in the evenings. When I was working on my thesis, I'd often send him an e-mail asking a few questions when I left the computer lab at night, and by the time I got home there would be an answer in my inbox pointing out what I had missed or suggesting some new approach to try. If Chris hadn't taken me on as an advisee back then, I probably would not be in graduate school today.

The citation follows on the jump:

Christopher H. Achen is the inaugural recipient of the Career Achievement Award of the Society for Political Methodology. Achen is the Roger William Straus Professor of Social Sciences in the Woodrow Wilson School of Public and International Affairs, and Professor of Politics in the Department of Politics, at Princeton University. He was a founding member and first president of the Society for Political Methodology, and has held faculty appointments at the University of Michigan, the University of California, Berkeley, the University of Chicago, the University of Rochester, and Yale University. He has a Ph.D. from Yale, and was an undergraduate at Berkeley.

In the words of one of the many colleagues writing to nominate Achen for this award, "Chris more or less made the field of political methodology''. In a series of articles and books now spanning some thirty years, Achen has consistently reminded us of the intimate connection between methodological rigor and substantive insights in political science. To summarize (and again, borrowing from another colleague's letter of nomination), Achen's methodological contributions are "invariably practical, invariably forceful, and invariably presented with clarity and liveliness''. In a series of papers in 1970s, Chris basically showed how us how to do political methodology, elegantly demonstrating how methodological insights are indispensable to understanding a phenomenon as central to political science as representation. Achen's "little green Sage book'', Interpreting and Using Regression (1982) has remained in print for 25 years, and has provided generations of social scientists with a compact yet rigorous introduction to the linear regression model (the workhorse of quantitative social science), and is probably the most widely read methodological book authored by a political methodologist. Achen's 1983 review essay "Towards Theories of Data: The State of Political Methodology'' set an agenda for the field that still powerfully shapes both the practice of political methodology and the field's self-conception. Achen's 1986 book The Statistical Analysis of Quasi-Experiments provides a brilliant exposition of the statistical problems stemming from non-random assignment to "treatment'', a topic very much in vogue again today. Achen's 1995 book with Phil Shivley, Cross-Level Inference, provides a similarly clear and wise exposition of the issues arising when aggregated data are used to make inferences about individual behavior ("ecological inference''). A series of papers on party identification --- an influential 1989 conference paper, "Social Psychology, Demographic Variables, and Linear Regression: Breaking the Iron Triangle in Voting Research'' (Political Behavior, 1992) and "Parental Socialization and Rational Party Identification'' (Political Behavior, 2002) --- have helped formalize the "revisionist'' theory of party identification outlined by Fiorina in his 1981 Retrospective Voting book, and now the subject of a lively debate among scholars of American politics.

In addition to being a productive and extremely influential scholar, Achen has an especially distinguished record in training graduate students in methodology, American politics, comparative politics, and international relations. His students at Berkeley in the late 1970s and early 1980s included Larry Bartels (now at Princeton), Barbara Geddes (UCLA), Steven Rosenstone (Minnesota), and John Zaller (UCLA), among many others. His students at Michigan in the 1990s include Bear Braumoeller (now at Harvard), Ken Goldstein (Wisconsin), Simon Hug (Texas-Austin), Anne Sartori (Princeton), and Karen Long Jusko (Stanford). In addition to being the founding president of the Society for Political Methodology, Chris has been a fellow at the Center for Advanced Study in the Behavioral Sciences, has served as a member of the APSA Council, has won campus-wide awards for both research and teaching, and is a member of the American Academy of Arts and Sciences.

Posted by Mike Kellermann at 11:23 PM

June 13, 2007

Statistics and the Death Penalty

A few days ago, the AP moved a story reporting on academic studies of the deterrent effect of the death penalty on potential murderers. Many media outlets picked up the story under headlines such as "Studies say death penalty deters crime", "Death penalty works: studies", and my favorite, "Do more executions mean fewer murders?" Presumably the answer to the last question is yes, at least in the limit; if the state were to execute everyone (except the executioner, of course), clearly there would be fewer murderers.

I was surprised when I read the article on Monday morning, since my sense of the state of play in this area is that it is probably impossible to tell one way or the other. Those are the findings of a recent study by Donohue and Wolfers, which finds most existing studies to be flawed and, more importantly, points out a variety of reasons why estimating the correct deterrent effect is difficult in principle. Here is some of what Andrew Gelman had to say about their study last year:

My first comment is that death-penalty deterrence is a difficult topic to study. The treatment is observational, the data and the effect itself are aggregate, and changes in death-penalty policies are associated with other policy changes.... Much of the discussion of the deterrence studies reminds me of a little-known statistical principle, which is that statisticians (or, more generally, data analysts) look best when they are studying large, clear effects. This is a messy problem, and nobody is going to come out of it looking so great.

My second comment is that a quick analysis of the data, at least since 1960, will find that homicide rates went up when the death penalty went away, and then homicide rates declined when the death penalty was re-instituted (see Figure 1 of the Donohue and Wolfers paper), and similar patterns have happened within states. So it's not a surprise that regression analyses have found a deterrent effect. But, as noted, the difficulties arise because of the observational nature of the treatment, and the fact that other policies are changed along with the death penalty. There are also various technical issues that arise, which Donohue and Wolfers discussed.

Given the tone of the article (and certainly the headlines), you would have thought that the Donohue and Wolfers paper had been overlooked by the reporter, but no: he cites it in the article, and he interviewed Justin Wolfers! He seems to have missed the point, however; the issue is not that some studies say that "there is a deterrent effect" and some say "we're just not sure yet". The problem is that we aren't sure, and we probably never will be unless someone gets to randomly assign death penalty policy to states or countries. This raises a problem that we often face in social science: there are questions that are interesting, and there are questions that we can answer, and the intersection of those two categories is probably a lot smaller than any of us would like. This doesn't seem to be a realization that has crept into the media as of yet, so it is no surprise that studies that purport to give answers to interesting questions will get more coverage than those pointing out why those answers probably don't mean very much.

Posted by Mike Kellermann at 4:19 PM

June 7, 2007

Gosnell Prize Winner

Congratulations to the 2007 Gosnell Prize winners - Harvard's very own Alberto Abadie, Alexis Diamond, and Jens Hainmueller! They won for their paper "Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program", which was presented at this year's MPSA conference in Chicago. We saw an earlier version of the paper this past semester at the Applied Stats workshop, and I have to say, the award is well deserved. The Gosnell Prize is awarded to the best paper presented at any political science conference in the preceding year. Alexis is a two-time recipient, having shared the award with Jas Sekhon in 2005 for their paper on genetic matching.

Posted by Mike Kellermann at 3:51 PM

June 5, 2007

Stata 10 announced

Yesterday, StataCorp announced that Stata 10 will be available from June 25. Apart from a bunch of new routines, a main attraction will be their new graph editor which might well resolve major nightmares for users. Also it appears that there is now a way to copy & paste results to other applications without loosing the formating. Overall the new version looks great, if you're so inclined.

Here the announcement sent out on Statalist yesterday, and a longer description on the StataCorp website.

Posted by Sebastian Bauhoff at 8:46 AM

May 23, 2007

Disclosing clinical trials

The New York Times has an article today ("For Drug Makers, a Downside to Full Disclosure") discussing the recent creation of archives for pharmecutical clinical trial data, including data from trials that did not result in publications. This effort is an attempt to deal with the age old problem of publication bias, a problem supposedly identified by the ancient Greeks, as described in a letter to the editor of Lancet by Mark Pettigrew:

The writings of Francis Bacon (1561-1626) are a good starting point. In his 1605 book, The Advancement of Learning, he alludes to this particular bias by pointing out that it is human nature for "the affirmative or active to effect more than the negative or privative. So that a few times hitting, or presence, countervails oft-times failing or absence". This is a clear description of the human tendency to ignore negative results, and Bacon would be an acceptable father figure. Bacon, however, goes further and supports his claim with a story about Diagoras the Atheist of Melos, the fifth century Greek poet.

Diagoras was the original atheist and free thinker. He mocked the Eleusinian mysteries, an autumnal fertility festival which involved psychogenic drug-taking, and was outlawed from Athens for hurling the wooden statue of a god into a fire and sarcastically urging it to perform a miracle to save itself. In the context of publication bias, his contribution is shown in a story of his visit to a votive temple on the Aegean island of Samothrace. Those who escaped from shipwrecks or were saved from drowning at sea would display portraits of themselves here in thanks to the great sea god Neptune. "Surely", Diagoras was challenged by a believer, "these portraits are proof that the gods really do intervene in human affairs?" Diagoras' reply cements his claim to be the "father of publication bias": "yea, but . . . where are they painted that are drowned?"

While dealing with publication bias would seem to be a good thing, the Times article suggests (perhaps in an attempt to avoid publication bias itself) that some people are worried about this practice:

Some experts also believe that releasing the results of hundreds of studies involving drugs or medical devices might create confusion and anxiety for patients who are typically not well prepared to understand the studies or to put them in context.

“I would be very concerned about wholesale posting of thousands of clinical trials leading to mass confusion,” said Dr. Steven Galson, the director for the Center for Drug Evaluation and Research at the F.D.A.

It is a little hard for me to believe that this confusion would be worse than the litany of possible side effects given at the end of every pharmecutical commercial, but that is a different issue. From a purely statistical point of view, it seems like this is a no-brainer, a natural extension of efforts to ensure that published results can be replicated. Whether you are a frequentist or a Bayesian, inferences should be better when conditioned on all of the data that has been collected, not just the data that researchers decided to use in their publications. There could be a reasonable argument about what to do with (and how do define) corrupted data - data from trials that blew up in one way or another - but this seems like a second-order consideration.

It would be great if we could extend this effort into the social sciences. It would be easier to do this for experimental work since the data collection process is generally well defined. On the other hand, I suspect that there is less of a need for archives of experimental data in the social sciences, for two reasons. First, experimental work is still rare enough (at least in political science) that I think you have a decent chance of getting published even with "non-results". Second, my sense is that, with the possible exception of researchers closely associated with particular policy interventions, the incentives facing social scientists are not the same as those facing pharmecutical researchers. Social scientists may have a preference for "significant" results, but in most cases they don't care as much about the direction.

The kind of data archive described above would be more useful for observational research, but much harder to define. Most social scientists have invested significant time and energy collecting observational data only to find that there are no results that reviewers would think were worth publishing. On the other hand, how do we define a trial for observational data? Should there be an obligation to make one's data available any time that it is collected, or should it be restricted to data that has been analyzed and found uninteresting? Or should we think of data and models together, and ask researcher to share both their data and their analysis? I'm not sure what the answer is, but it is something that we need to think about as a discipline.

Posted by Mike Kellermann at 7:18 PM

May 22, 2007

Statistics and the law

Over at the Volokh Conspiracy, Professor Elmer Elhauge from Harvard Law School has a post about the future of empirical legal studies, comparing the law today to baseball before the rise of sabermetrics. From the post:

In short, in law, we are currently still largely in the position of the baseball scouts lampooned so effectively in Moneyball for their reliance on traditional beliefs that had no empirical foundation. But all this is changing. At Harvard Law School, as traditional a place as you can get, we now have by my count 10 professors who have done significant statistical analysis of legal issues. We just hired our first JD with a PhD in statistics. The movement is not at all limited to Harvard, and seems to be growing at all law schools.

So we are hardly devoid of empirical analysis of law. We are just, rather, in our early Bill James era, and can expect the analysis to get more sophisticated and systematic as things progress. I expect within a couple of decades we will have our own book distilling the highlights of things we will know then that conflict with what is now conventional legal wisdom.

We are all pretty pleased that Harvard Law now has a stats Ph.D. on faculty. But one of the commenters raises an interesting question; if empirical legal studies are like sabermetrics, who is the legal equivalent of Joe Morgan?

Posted by Mike Kellermann at 8:49 AM

May 10, 2007

Surveying Multiethnic America

The Program on Survey Research at Harvard is hosting an afternoon conference tomorrow on the challenges of surveying multiethnic populations:

Surveying Multiethnic America

May 11, 2007
12:30 – 5:00

Institute for Quantitative Social Science
CGIS N-050
1737 Cambridge St.
Cambridge, MA 02138

Across a variety of different academic disciplines, scholars are interested in topics related to multiethnic populations, and sample surveys are one of the primary means of studying these populations. Surveys of multiethnic populations face a number of distinctive methodological challenges, including issues related to defining and measuring ethnic identity, and locating, sampling, and communicating with the groups of interest.

This afternoon panel sponsored by the Program on Survey Research at Harvard University will look at recent survey research projects on multiethnic populations in the US. Researchers will discuss how they confronted the unique methodological challenges in their survey projects and will consider the implications of their approach for their key theoretical and empirical findings.


12:30 - 2:45

Sunshine Hillygus, Harvard University, Introduction

Manuel de la Puente, US Bureau of the Census, Current Issues in Multiethnic Survey Methods

Guillermina Jasso, New York University, New Immigrant Study

Deborah Schildkraut, Tufts University, The 21st Century Americanism Study

Yoshiko Herrera, Harvard University, Discussant

3:00 - 5:00

Tami Buhr, Harvard University, Harvard Multi-Ethnic Health Survey

Ronald Brown, Wayne State University, National Ethnic Pluralism Survey

Valerie Martinez-Ebers, Texas Christian University, National Latino Politics Survey

Kim Williams, Harvard University, Discussant

Posted by Mike Kellermann at 12:05 PM

May 9, 2007

What's your optimal GPA?

Amy Perfors

This may not be new to anybody but me, but recent news at UNC brought the so-called "Achievement Index" to my attention. The Achievement Index is a way of calculating GPA that takes into account not only how well one performs in a class, but also how hard the class is relative to others in the institution. It was first suggested by Valen Johnson, a professor of statistics at Duke University, in a paper in Statistical Science titled "An Alternative to Traditional GPA for Evaluating Student Performance." (The paper is available on his website; you can also find a more accessible pdf description here).

This seems like a great idea to me. The model, which is Bayesian, calculates "achievement index" scores for each student as latent variables that best explain the grade cutoffs for each class in the university. As a result, it captures several phenomena: (a) if a class is hard and full of very good students, then a high grade is more indicative of ability (and a low grade less indicative of lack of ability); (b) if a class is easy and full of poor students, then a high grade doesn't mean much; (c) if a certain instructor always gives As then the grade isn't that meaningful -- though it's more meaningful if the only people who take the class in the first place are the extremely bright, hard-working students. Your "achievement index" score thus reflects your actual grades as well as the difficulty level of the classes you have chosen.

Why isn't this a standard measure of student performance? 10 years ago it was proposed at Duke but failed to pass, and at UNC they are currently debating it -- but what about other universities? The Achievement Index addresses multiple problems. There would be less pressure toward grade inflation, for one thing. For another, it would address the unfortunate tendency of students to avoid "hard" classes for fear of hurting their GPA. Students in hard majors or taking hard classes also wouldn't be penalized in university-wide, GPA-based awards.

One might argue that students shouldn't avoid hard classes simply because of their potential grade, and I tend to agree that they shouldn't -- it was a glorious moment in my own college career when I finally decided "to heck with it" and decided to take the classes that interested me, even if they seemed really hard. But it's not necessarily irrational for a student to care about GPA, especially if important things -- many of which I didn't have to worry about -- hinge on it: things like scholarships or admission to medical school. Similarly, instructors shouldn't inflate grades and create easy classes, but it is often strictly "rational" to do so: giving higher grades can often mean better evaluations and less stress due to students whinging for a higher grade, and easier classes are also easier to teach. Why not try to create a system where the rational thing to do within that system is also the one that's beneficial for the university and the student in the long run? It seems like the only ones who benefit from the current system are the teachers who inflate their grades and teach "gimme" courses and the students who take those easy courses. The ones who pay are the teachers who really seek to challenge and teach their students, and the students who want to learn, who are intellectually curious and daring enough to take courses that challenge them. Shouldn't the incentive structure be the opposite?

I found a petition against the Achievement Index online, and I'm not very persuaded by their arguments. One problem they have is that it's not transparent how it works, which I could possibly see being a concern... but there are two kinds of transparency, and I think only one really matters. If it's not transparent because it's biased or subjective, then that's bad; but if it's not transparent simply because it's complicated (as this is), but is in fact totally objective and is published how it works - then, well, it's much less problematic. Sometimes complicated is better: and other things that matter a great deal for our academic success -- such as SATs and GREs -- aren't all that transparent either, and they are still very valuable. The petition also argues that using the AI system will make students more competitive with each other, but I confess I don't understand this argument at all: how will it increase competition above and beyond the standard GPA?

Anyway, it might seem like I'm being fairly dogmatic about the greatness of the Achievement Index, but I don't intend to be. I have no particular bone to pick, and I got interested in this issue originally mainly just because I wanted to understand the model. It's simply that I don't really see any true disadvantages and I wonder what I'm missing. Why don't more universities try to implement it? Can anyone enlighten me?

Posted by Amy Perfors at 10:20 AM

May 8, 2007

Data for Replications

We have blogged a fair bit about reproducibility standards and data-sharing for replication (see here and here). Some journals require authors to make datasets and codes available for a while already, and now these policies start to show effects. For example the American Economic Review requires authors to submit their data since 2004, and this information is now available on their website. The AER provides a basic readme document and files with the used variables for an increasing number of articles since late 2002; some authors also provide their program codes. There's a list of articles with available data here.

The 2006 Report of the Editor suggests that most authors now comply with the data posting requirements and that only few exceptions are made. At this point AER is pretty much alone among the top economics journals with offering this information. I wonder if authors substitute between the AER and other journals. Since the AER is still a very desirable place to publish, maybe this improves the quality of AER submissions if only confident authors submit? At least for now the submission statistics in the editor’s report don't suggest that they are loosing authors. Meanwhile hundreds of grad students can rejoice in a wealth of interesting papers to replicate.

Posted by Sebastian Bauhoff at 11:33 AM

May 7, 2007

No Applied Stats Workshop until September

Just as a reminder, the Applied Statistics Workshop has wrapped up for this academic year. Thanks to all who came to the talks, and we look forward to seeing you again in September.

Posted by Mike Kellermann at 1:38 PM

May 2, 2007

Is There a Statistics/Economics Divide?

OK, so now that I have a job, I feel like I can stick my foot in something smelly to see what happens. When I was on the market this past year, I was often asked about the difference (lawyers are always careful to ask about "the difference, if any") between a degree in statistics and a degree in something more "traditional" for a law scholar, such as economics or political science or sociology. Because of the prevelance and power of the Law & Economics movement in legal scholarship, there was particular interest in the difference between statistics and economics/econometrics. I had a certain amount of trouble answering the question. It was easy to point out that the best quantitative empiricists move within all fields and are able to read all literatures. As an aspiring statistician, it was also easy to give the statistical version of things, which is that statisticians invent data analysis techniques and methods that, after ten to twenty-five to forty years, filter into or are reinvented by other fields (whenever I said this, I clarified that this story was a caricature).

So what is the difference between an empirical, data-centered economist and an applied statistician? The stereotypes I've internalized from hanging out in an East Coast statistics department are that economists tend to focus more on parameter estimation, asymptotics, unbiasedness, and paper-and-pencil solutions to problems (which can then be implemented via canned software like STATA), whereas applied statisticians are leaning more towards imputation and predictive inference, Bayesian thinking, and computational solutions to problems (which require programming in packages such as R). Anyone care to disabuse me of these notions?

Posted by James Greiner at 12:07 PM

May 1, 2007

Racial bias in basketball?

The New York Times has an article discussing a working paper by Justin Wolfers and Joseph Price, looking at the rate at which white referees call fouls on black players (and black referees call fouls on white players). The paper can be found here. I haven't had a chance to read it yet, but if it uses "multivariable regression analysis" as it says in the Times article, then I'm sure it must be good.

Posted by Mike Kellermann at 11:21 PM

April 18, 2007

Appellate Cases and SUTVA Violations

Around a month ago, I blogged about the dangers of using appellate case outcomes as datapoints. The basic idea is that most models or inference structures assume some kind of independence among the units, perhaps independence given covariates (in which case the residuals are assumed to be i.i.d.), or perhaps the "Stable Unit Treatment Value Assumption" in the causal inference context. When applied to appellate cases in the United States legal system, these analyses assume away precedent. The instincts I developed as a practicing litigator tell me not to believe a study that assumes away precedent.

One solution to this problem previously proposed in the causal inference literature is to match "treated" and "control" appellate cases that are very close in time to each other (whatever "treated" and "control" are here). After a conversation I had with Mike Kellermann a week or so ago, I think this cure may be worse than the disease. The idea behind comparing cases very close in time to one another is that the general state of the law (in part defined by precedent) for the two cases will be similar. That's right, but recent developments in the law are more on the minds of judges.

Suppose Case A got treatment, and Case B got control. If the matching algorithm has worked, Case A and Case B will be similar in all ways except the treatment. If Case A and Case B are also close in time to one another, how plausible is it the judges who decide both will decide them without regard to each other?

Posted by James Greiner at 4:48 PM

April 11, 2007

Why I wish TV news was really boring

Amy Perfors

I've posted before about the various ways that the mass media of today interacts badly with cognitive heuristics people use, in such a way as to create apparently irrational behavior. Spending a fair amount of time recently standing in long security lines at airports crystallized another one to me.

The availability heuristic describes people's tendency to judge that events that are really emotionally salient or memorable are more probable than events that aren't, even if the ones that aren't are actually statistically more likely. One classic place you see this is in estimates of risk of dying in a terrorist attack: even though the odds are exceedingly low of dying this way (if you live in most countries, at least), we tend to spend far more resources, proportionally, fighting terror than in dealing with more prosaic dangers like automobile accidents or poverty. There might be other valid reasons to spend disproportionately -- e.g., terrorism is part of a web of other foreign-policy issues that we need to focus on for more long-term benefits; or people don't want to sacrifice the freedoms that would be necessary (like more restrictive speed limits) to make cars safer; or it's not very clear how to solve some problems (like poverty) -- and I really don't want to get into those debates -- the point is just that I think most everyone would agree that in all of those cases, at least part of the reason for the disproportionate attention is because dying in a terrorist attack is much more vivid and sensational than dying an early death because of the accumulated woes of living in poverty. And there's plenty of actual research showing that the availability heuristic plays a role in many aspects of prediction.

There's been a lot of debate about whether this heuristic is necessarily irrational. Evolutionarily speaking, it might make a lot of sense to pay more attention to the more salient information. To steal an example from Gerd Gigerenzer, if you live on the banks of a river and for 1000 days there have been no crocodile sightings there, but yesterday there was, you'd be well-advised to disregard the "overall statistics" and keep your kids from playing near the river today. It's a bit of a just-so story, but a sensible one, from which we might infer two possible morals: (a) as Steven Pinker pointed out, since events have causal structure, it might make sense to pay more attention to more recent ones (which tend to be more salient); and (b) it also might make sense to pay more attention to emotionally vivid ones, which give a good indication of the "costs" of being wrong.

However, I think the problem is that when we're talking about information that comes from mass media, both of these reasons don't apply as well. Why? Well, if your information doesn't come from mass media, to a good approximation you can assume that the events are statistically representative of the events that you might be likely to encounter. If you get your information from mass media, you cannot assume this. Mass media reports on events from all over the world in such a way that they can have the same vividness and impact as if they were in the next town over. And while it might be rational to worry a lot about crime if you consistently have shootings your neighborhood, it doesn't make as much sense to worry about it if there are multiple shootings in cities hundreds of miles away. Similarly, because mass media reports on news - i.e., statistically rare occurrences - it is easy to get the dual impression that (a) rare events are less rare than they actually are; and (b) that there is a "recent trend" that needs to be paid attention to.

In other words, while it might be rational to keep your kids in if there were crocodile attacks at the nearby river yesterday, it's pretty irrational to keep them in if there were attacks at the river a hundred miles away. Our "thinking" brains know this, but if we see those attacks as rapidly and as vividly as if they were right here -- i.e., if we watch them on the nightly news -- then it's very hard to listen to the thinking brain... even if you know about the dangers. And cable TV news, with its constant repetition, makes this even harder.

The source of the problem is due to the sampling structure of mass media, but it's of course far worse if the medium makes the message more emotional and vivid. So there's probably much less of a problem if you get most of your news from written sources -- especially multiple different ones -- than TV news. That's what I would guess, at least, though I don't know if anyone has actually done the research.

Posted by Amy Perfors at 3:11 PM

April 10, 2007

What determines which statistical software you use?

I was recently involved in a discussion among fellow grad students about what determines which statistical software package people use to analyze their data. For example, this recent market survey lists 44 products selected from 31 vendors and they do not even include packages like R that many people around Harvard seem to use. Another survey conducted by Alan Zaslavsky lists 15 packages while `just’ looking at the available software for the analysis of surveys with complex sample designs. So how do people pick their packages given the plethora of options? Obviously, many factors will go into this decision (departmental teaching, ease of use, type of methods used, etc. etc. etc. ). One particularly interesting factor in our discussion concerned the importance of academic discipline. It seems to be the case that different packages are popular in different disciplines. But how exactly usage patterns vary across fields remains unclear. We wondered whether any systematic data exists on this issue? For example, how many political scientists use R compared to other programs? What about statisticians, economists, sociologists, etc.? Any information would be highly appreciated.

Posted by Jens Hainmueller at 10:12 PM

April 4, 2007

Trial-Level Criminal Outcomes

With a coauthor, I am involved in a project which in part attempts to assess the effect of assigning judge A versus judge B to outcomes at the trial level in criminal cases. I've begun a literature search on this, and it seems like most attention thus far has focused on the sentencing stage (particularly relating to the controversy over the federal sentencing guidelines), and that few authors have used what one might call modern or cutting edge causal inference thinking. Can anyone out there help here? I'm I missing important studies?

(Feel free to email me off-blog if you'd prefer.)

Posted by James Greiner at 3:24 PM

CCCSN - Devon Brewer

The Cambridge Colloquium on Complexity and Social Networks is sponsoring a talk tomorrow that may be of some interest to readers of this blog. Details below:

"Taking Person, Place, and Time Seriously in Infectious Disease Epidemiology and
Diffusion Research"

Devon D. Brewer, University of Washington

Thursday, April 5, 2007
12:00 - 1:30 p.m.
CGIS North, 1737 Cambridge Street, Room N262

Abstract: Social scientists and field epidemiologists have long appreciated the role of social networks in diffusion processes. The cardinal goal of descriptive epidemiology is to examine "person, place, and time" in relation to the occurrence of disease or other health events. In the last 20 years, most infectious disease epidemiologist have moved away from the field epidemiologistÿÿs understanding of transmission as embedded in contact structures and shaped by temporal and locational factors. Instead, infectious disease epidemiologists have employed research designs that are best suited to studying non-infectious chronic diseases but unable to provide meaningful insight on transmission processes. A comprehensive and contextualized infectious disease epidemiology requires assessment of person (contact structure and individual characteristics), place, and time, together with measurement of specific behaviors, physical settings/fomites, and the molecular biology of pathogens, infected persons, and susceptible persons. In this presentation, I highlight examples of research that include multiple elements of this standard. From this overview, I show in particular how the main routes of HIV transmission in poor countries remain unknown as a consequence of inappropriate design in epidemiologic research. In addition, these examples highlight how diffusion research in the social sciences might be improved with greater attention to temporal and locational factors.

Devon D. Brewer, Ph.D., Director, has broad training and experience in thesocial and health sciences. Much of his past research has focused onsocial networks, research methods and design, memory and cognition, drug abuse, violence, crime, sexual behavior, and infectious disease (including sexually transmitted diseases, HIV, and hepatitis C). He earned his
bachelor's degree in anthropology from the University of Washington and his doctorate in social science from the University of California, Irvine. Prior to founding Interdisciplinary Scientific Research, Dr. Brewer held research positions at the University of Washington, an administrative position with Public Health-Seattle and King County, and teaching positions at the University of Washington, Pacific Lutheran University, and Tulane University. He has been a principal investigator on federal research grants and authored/co-authored more than 60 scientific publications.

Posted by Mike Kellermann at 11:31 AM

April 2, 2007

Applied Statistics - Richard Berk

This week, the Applied Statistics Workshop will present a talk by Richard Berk, professor of criminology and statistics at the University of Pennsylvania. Professor Berk received his Ph.D. from Johns Hopkins University and served on the faculties of Northwestern, UC-Santa Barbara and UCLA before moving to Penn in 2006. He has published widely in journals in statistics and criminology. His research focuses on the application of statistical methods to questions arising in the criminal justice system. One of his current projects is the development and application of statistical learning procedures to anticipate failures on probation or parole and to forecast crime “hot spots” a week in advance.

Professor Berk will present a talk entitled "Counting the Homeless in Los Angeles County," which is based on joint work with Brian Kriegler and Donald Ylvisaker. Their paper is available through the workshop website. The presentation will be at noon on Wednesday, April 2 in Room N354, CGIS North, 1737 Cambridge St. As always, lunch will be provided. An abstract of the paper follows on the jump:

Counting the Homeless in Los Angeles County

Richard Berk
Department Statistics
Department of Criminology
University of Pennsylvania


Over the past two decades, a variety of methods have been used to count the homeless in large metropolitan areas. In this paper, we report on a recent effort to count the homeless in Los Angeles County. A number of complications are discussed including the need to impute homeless counts to areas of the County not sampled and to take the relative costs of underestimates and overestimates of the number of homeless individuals into account. We conclude that despite their imperfections, the estimated counts provided useful and credible information to the stakeholders involved. Of course, not all stakeholders agreed.

Joint work with Brian Kriegler and Donald Ylvisaker.

Posted by Mike Kellermann at 8:23 AM

March 30, 2007

"That looks cool!" versus "What does it mean?"

Every Sunday, I flip open the New York Times Magazine to the weekly social commentary, "The Way We Live Now," and I check out the accompanying data presentation graphic. First, I think, "That looks cool." Then, for the next several minutes, I wonder, "What does it mean?" I'm usually looking at an illustration like this:

I sat down to write this entry ready to argue that clarity is always more important than aesthetics when communicating with data and that the media needs to be more educated when it comes to data presentation. I still think those things. However, after a little googling, I discovered that Catalogtree (as in "Chart by Catalogtree" in the graphic above) is a Dutch design firm, not a research organization, and I started to wonder whether the Times knowingly prioritizes art over data for these graphics. Maybe communication is not the primary goal. This is, after all, a magazine, including fashion and a serial comic strip along with coverage of political and social issues.

How should a publication balance illustration and information? If I belong to a statistics department, am I allowed to say, "That looks cool!" and not point out that a chart is indecipherable? My gut reaction is that information should always win, but maybe I'm wrong - and I do like the designs. You can see some of Catalogtree's other creations for the Times here and their other work here.

Posted by Cassandra Wolos at 1:49 PM

March 28, 2007

The singular of data is anecdote

Amy Perfors

This post started off as little more than some amusing wordplay brought on by the truism that "the plural of anecdote is not data". It's a sensible admonition -- you can't just exchange anectodes and feel like that's the equivalent of actual scientific data -- but, like many truisms, it's not necessarily true. After all, the singular of data is anecdote: every individual datapoint in a scientific study constitutes an anecdote (though admittedly probably a quite boring one, depending on the nature of your study). A better truism would therefore be more like "the plural of anecdote is probably not data", which of course isn't nearly as catchy.

The post started that way, but then I got to thinking about it more and I realized that the attitude embodied by "the plural of anecdote is not data" -- while a necessary corrective in our culture, where people far more often go too far in the other direction -- isn't very useful, either.

A very important caveat first: I think it's an admirable goal -- definitely for scientists in their professional lives, but also for everyone in our personal lives -- to as far as possible try to make choices and draw conclusions informed not by personal anecdote(s) but rather by what "the data" shows. Anecdote is notoriously unreliable; it's distorted by context and memory; because it's emotionally fraught it's all too easy to weight anecdotes that resound with our experience more highly and discount those that don't; and, of course, the process of anecdote collection is hardly systematic or representative. For all of those reasons, it's my natural temptation to distrust "reasoning by anecdote", and I think that's a very good suspicion to hone.

But... but. It would be too easy to conclude that anecdotes should be discounted entirely, or that there is no difference between anecdotes of different sorts, and that's not the case. The main thing that turns an anecdote into data is the sampling process: if attention is paid to ensuring not only that the source of the data is representative, but also that the process of data collection hasn't greatly skewed the results in some way, then it is more like data than anecdote. (There are other criteria, of course, but I think that's a main one).

That means, though, that some anecdotes are better than others. One person's anecdote about an incredibly rare situation should properly be discounted more than 1000 anecdotes from people drawn from an array of backgrounds (unless, of course, one wants to learn about that very rare situation); likewise, a collection of stories taken from the comments of a highly partisan blog where disagreement is immediately deleted -- even if there are 1000 of them -- should be discounted more than, say, a focus group of 100 people carefully chosen to be representative, led by a trained moderator.

I feel like I'm sort of belaboring the obvious, but I think it's also easy for "the obvious" to be forgotten (or ignored, or discounted) if its opposite is repeated enough.

Also, I think the tension between the "focus on data only" philosophy on one hand, and "be informed by anecdote" philosophy on the other, is a deep and interesting one: in my opinion, it is one of the main meta-issues in cognitive science, and of course comes up all the time in other areas (politics and policy, personal decision-making, stereotyping, etc). The main reason it's an issue, of course, is that we don't have data about most things -- either because the question simply hasn't been studied scientifically, or because it has but in an effort to "be scientific" the sample has been restricted enough that it's to know how well one can generalize beyond it. For a long time most studies in medicine used white men only as subjects; what then should one infer regarding women, or other genders? One is caught between the Scylla of using possibly inappropriate data, and the Charybdis of not using any data at all. Of course in the long term one should go out and get more data, but life can't wait for "the long term." Furthermore, if one is going to be absolutely insistent on a rigid reliance on appropriate data, there is the reductive problem that, strictly speaking, a dataset never allows you to logically draw a conclusion about anything other than itself. Unless it is the entire population, it will always be different than the population; the real question comes in deciding whether it is too different -- and as far as I can tell, aside from a few simple metrics, that decision is at least as much art as science (and is itself made partly on the basis of anecdote).

Another example, one I'm intimately familiar with, is the constant tension in psychology between ecological and external validity on the one hand, and proper scientific methodology on the other. Too often, increasing one means sacrificing the other: if you're interested in categorization, for instance, you can try to control for every possible factor by limiting your subjects to undergrad students in the same major, testing everyone in the same blank room at the same time of day, creating stimuli consisting of geometric figures with a clear number of equally-salient features, randomizing the order of presentation, etc. You can't be completely sure you've removed all possible confounds, but you've done a pretty good job. The problem is that what you're studying is now so unlike the categorization we do every day -- which is flexible, context-sensitive, influenced by many factors of the situation and ourselves, and about things that are not anything like abstract geometric pictures (unless you work in a modern art museum, I suppose) -- that it's hard to know how it applies. Every cognitive scientist I know is aware of this tension, and in my opinion the best science occurs right on the tightrope - not at the extremes.

That's why I think it's worth pointing out why the extreme -- even the extreme I tend to err on -- is best avoided, even if it seems obvious.

Posted by Amy Perfors at 10:06 AM

March 26, 2007

Applied Statistics - Spring Break

As many of you know, Harvard is on spring break this week, so the Applied Statistics Workshop will not meet. Please join us next Wednesday, April 4, for a presentation by Professor Richard Berk of the University of Pennsylvania. And for those of you at Harvard, enjoy some time off (or at least some time without students!).

Posted by Mike Kellermann at 8:19 AM

March 21, 2007

Efficient Vacationing, Summer 2007

With the ice melting and the birds chirping it’s the time again for planning the summer. Here a few worthwhile reasons not to be stuck behind your desk all summer. Maybe these are not the most exotic events and locations but at least they are ‘productive’ and you won’t feel guilty for being away.

The Michigan Summer Institute in Survey Research Techniques runs several sessions over a total of eight weeks from June 4 to July 27. The courses are mainly about designing, writing and testing surveys, and analyzing survey data. The level of the courses differs but they have some advanced courses on sampling and analysis. Because of a modular setup, it's possible to pick and choose broadly. I've heard good things about this institute, particularly from people who want to collect their own data.

Also in Michigan is the Summer Program in Quantitative Methods of Social Research which runs two sessions from June 25 to August 17. This program focuses on analytics and also caters for different levels of sophistication. I only know a few people who attended this program, with mixed reviews. Much seems to depend on what courses you actually take, some are great and others so-so.

The University of Chicago hosts this years’ Institute on Computational Economics from July 30 to August 9. The topics are quite advanced and focus on programming approaches to economic problems. This seems to be quite worthwhile, if it's your interest.

Further afield is the Mannheim Empirical Research Summer School from July 8 – 20. This event focuses on analysis of household data but also features sessions on experiment design and behavioral economics. I didn't hear about previous schools but would be curious to find out.

There are other summer schools that don’t have a strong methods focus. Harvard, LSE and a host of other universities offer a number of courses that might provide a quick dip into some of the substantive topics.

Posted by Sebastian Bauhoff at 6:19 PM

March 20, 2007

Judicial Decisions as Data Points

Empirical, particularly quantitative empirical, scholarship is all the rage these days in law schools. (By the way, as a quantitative legal empiricist,that makes me really nervous. If there's one constant in legal academia, it's that things go in and out of style as fast in law schools as they do in Milan fashion shows.)

One thing that has been bothering me lately about this next phase, new wave, dance craze aspect of legal scholarship is the use of appellate cases as datapoints. It's tempting to think that one can code appellate decisions or judicial opinions pursuant to some neutral criteria, then look for trends, tease out inferences of causation, etc. Here's a note of caution: they're not i.i.d. They're probably not i.i.d. given X (whatever X is). Precedent matters. In our legal system, the fact that a previous appellate case (with a published opinion) was decided a certain way is a reason to decide a subsequent, facially similar appellate case the same way, even if the first decision might have been (arguably) wrong. Folks will argue over how much precedent matters; all I can tell say is that as a law clerk to an appellate judge, I participated in numerous conversations that resulted in the sentiment, "I might/would have decided the present case differently had Smith v. Jones not been on the books, but I see no grounds for departing from the reasoning of Smith v. Jones here." I.i.d. models, or analyses that assume non-interference among units, should be viewed with great caution in this setting.

Posted by James Greiner at 4:40 PM

March 18, 2007

Three-way ties and Jeopardy: Or, Drew questions the odds

It's been in the news that a three-way tie happened on Jeopardy on Friday night. From the AP article:

The show contacted a mathematician who calculated the odds of such a three-way tie happening — one in 25 million.

I have to believe that the mathematician contacted didn't have all the facts (and the AP rushed to meet deadline), because once you're in Final Jeopardy there's little randomness about it. It's all down to game theory.

Suppose we first estimate the odds that all three players are tied at the end of Double Jeopardy.The total dollar value shared by all three is around $30000, or about $10000 each. Since questions have dollar values which are multiples of $200, we could reasonably assume that there are 100 dollar values, between 0 and 20000, where each player can end up. So the odds of a tie at this stage should be no more than one in a million - and this is a very conservative guess, since I assume that the probabilities are all equal (whereas they would likely have a central mode around 10000.)

Breaking a three way tie with a Final Jeopardy question would then require that all three players bet the same amount, and I think the odds are considerably less than 1 in 20 that they'd all bet the farm no matter the category.

But it shouldn't even get that far. The scenario on Friday night had two players tied behind the leader who didn't have a runaway. So we have somewhere around 1 in 20,000 odds that this would happen (the factor of two because the third player could be ahead or behind the tied players.)

The runners-up would both be highly likely to bet everything in order to get past the leader. And the leader, in this case, placed a tying bet for great strategic reasons - getting one more day against known opposition rather than taking the chance of a new superstar appearing the next day - as well as a true demonstration of giving away someone else's money to appear magnanimous.

Even if the leader only had a 10% chance of making that call, and given that the other two players were pressured to bet high, that's still 1 in 200,000 - over 100 times more likely with a fairly conservative estimation process.

Posted by Andrew C. Thomas at 11:14 PM

March 14, 2007

Who makes a good peer reviewer?

Amy Perfors

One of the interesting things about accruing more experience in a field is that as you do so, you find yourself called upon to be a peer reviewer more and more often (as I'm discovering). But because I've never been an editor, I've often wondered what this process looks like from that perspective: how do you pick reviewers? And what kind of people tend to be the best reviewers?

A recent article in the (open-access) journal PLoS Medicine speaks to these questions. Even though it's in medicine, I found the results somewhat interesting for what they might imply or predict about other fields as well.

In a nutshell, this study looked at 306 reviewers from the journal Annals of Emergency Medicine. Each of the 2,856 reviews (of 1,484 separate manuscripts) had been rated by the editors of the journal on a five-point scale (1=worst, 5=best). The study simply tried to identify what characteristics of the reviewers could be used to predict the effectiveness of the review. The basic finding?

Multivariable analysis revealed that most variables, including academic rank, formal training in critical appraisal or statistics, or status as principal investigator of a grant, failed to predict performance of higher-quality reviews. The only significant predictors of quality were working in a university-operated hospital versus other teaching environment and relative youth (under ten years of experience after finishing training). Being on an editorial board and doing formal grant (study section) review were each predictors for only one of our two comparisons. However, the predictive power of all variables was weak.

The details of the study are somewhat helpful for interpreting these results. When I first read that younger was better, I wondered to what extent this might simply be because younger people have more time. After looking at the details, I think this interpretation, while possible, is doubtful: the youngest cohort were defined as those that had less than ten years of experience after finishing training, not those who were largely still in grad school. I'd guess that most of those were on the tenure-track, or at least still in the beginnings of their career. This is when it's probably most important to do many many things and be extremely busy: so I doubt those people have more time. Arguably, they might just be more motivated to do well precisely because they are still young and trying to make a name for themselves -- though I don't know how big of a factor it would be given the anonymity of the process: the only people you're impressing with a good review are the editors of the journals.

All in all, I'm not actually that surprised that "goodness of review" isn't correlated with things such as academic rank, training in statistics, or being a good PI: not that those things don't matter, but my guess would be that nearly everyone who's a potential reviewer (for what is, I gather, a fairly prestigious journal) would have sufficient intelligence and training to be able to do a good review. If that's the case, then the best predictors of reviewing quality would come down to more ineffable traits like general conscientiousness and motivation to do a good review... This interpretation, if true, implies that a good way to generate better reviews is not to just choose big names, but rather to make sure people are motivated to put the time and effort into those reviews. Unfortunately, given that peer review is largely uncredited and gloryless, it's difficult to see how best to motivate them.

What do you all think about the idea of making these sort of rankings public? If people could put them on their CV, I bet there would suddenly be a lot more interest in writing good reviews... at least for the people for whom the CV still mattered.

Posted by Amy Perfors at 6:45 PM

March 13, 2007

Which Color for your Figure?

ever wondered about what would be the best color for your graphs? While common in the sciences, it may be fair to say that the use of color in graphs is still under-appreciated in many social science fields. Colors can be a every effective tool to visualize data in many forms, because color is essentially a 3-d concept:

- hue (red, green, blue)
- value/lightness: (light vs. dark)
- saturation/chroma (dull vs. vivid)

From my limited understanding of this topic, not much scientific knowlegde exists about how color is best used. However, a few general principles have emerged from the literature. For example, sequential information (ordering) is often best indicated through distinction in lightness. The tricky part here is that indicating sequence with colors requires the viewer to remember the color ordering. A small number of colors should be used. One principle that is sometimes advocated is the use of a neutral color midpoint, that makes sense when there is a "natural" midpoint in the data. If so, you may want to distinguish above and below the midpoint, and use dark color1 -> light color1 -> white -> light color2 -> dark color2 (e.g., dark blue to dark red) . If no natural midpoint exists, one option is to use a single hue and just vary lightness (e.g., white/pink to dark red). Another idea is that categorical distinctions are best indicated through hue (e.g., red=higher than average, blue=lower than average). Read Edward Tufte and the cites therein for more ideas on the use of color. In addition, a nice online tool that helps you choose color in a principled way is ColorBrewer, a website definitely worth a visit. Many of the color schemes advocated there are also available in R in the ColorBrewer {RColorBrewer} library. Good luck!

Posted by Jens Hainmueller at 11:14 PM

March 7, 2007

More on Cheating

In my last post, I solicited comments on ways to cheat when using a design-before-analysis framework for analyzing observational studies. My claim was that if one does the hard work of distinguishing intermediate outcomes from covariates (followed usually by discarding the former) and of balancing the covariates (often done by discarding non-comparable observations) without access to the outcome variable, it should be hard(er) to cheat. Felix suggested one way that should work but that should also be fairly easy to spot: temporarily substitute in a "good" (meaning highly predictive of the outcome variable) covariate as the outcome and find a design that achieves the desired result, then use this design with the "real" outcome. In a comment, Mike suggested another way: do honest observational studies, but don't tell anyone about those that don't come to desired results.

Here's my thought: in many observational settings, we have a strong prior that there is either an effect in a particular direction or no effect at all. In an anti-discrimination lawsuit, for example, the issue is whether the plaintiff class is suffering from discrimination. There is usually little chance (or worry) that the plaintiff class is in fact benefiting from discrimination. Thus, the key issue is whether the estimated causal effect is statistically (and practically/legally) significant. With that in mind, it seems like a researcher might be able to manipulate the distance metric essential to any balancing process. When balancing, we have to define (a) a usually one-dimensional distance metric to decide how close observations are to one another, and (b) a cutoff point beyond which we say observations are too far from one another to risk inference, in which case we discard the offending observations. If one side of a debate (e.g., the defendant) has an interest in results that are not statistically significant, that side can insist on distance metrics and cutofff points that result in discarding (as too far away from their peers) a great many observations. A smaller number of observations generally means less precision and a lower likelihood of a significant result. The other side can, of course, do the opposite.

I still think we're way better off in this world than in the model-snooping of regression. What do people think?

Posted by James Greiner at 4:53 PM

March 6, 2007

More Tools for Research

It’s been a while since Jens and I summarized some useful tools for research. Since then more productivity tools have appeared that make life easy for researchers. Some of the following might only work for Harvard affiliates but maybe your outfit offers something similar.

First, Harvard offers a table of contents service. After signing up you can request to receive the table of contents of most journals that Harvard Libraries carries. The handy part is a “Find it @ Harvard” button next to each article; clicking it takes you to the article through the library's account so that you have full access. This service also allows you to manage all journal subscriptions through only one account. (Best make the service email you the TOC as attachment, as in-text tables occasionally get cut off. Also, your spam filter might intercept those emails so check there if you don’t receive anything.)

Second, Harvard provides a new toolbar for the Firefox browser called LibX (see here). This provides quick links to Harvard’s e-tools (citation index, e-resources etc), lets you search in the Hollis catalog and provides a drag&drop field for Google Scholar. If you’re on a journal website without having gone through Harvard libraries, LibX allows you to reload the restricted webpage via Harvard to access to the full-text sources. Another nice feature is that LibX embeds cues in webpages. For example if you have installed the tool and are looking at a book on Amazon, you will notice a little Harvard shield on the page. Clicking it takes you straight to the book’s entry in Hollis. LibX also provides automatic links to print and e-resources for ISBN, DOI’s and other identifiers.

There are other useful tools for Firefox. I recently discovered the ScrapBook add-on which essentially works like bookmarks, but allows you to store only the part of a web page you’re interested in. Simply select the part and store it in your scrapbook. You can then access it offline and also comment or highlight. You can sort and import/export items too. A further useful built-in function uses search keywords in Firefox. This allows you to access a search box on any website through a user-defined keyword. For example you can define ``gs'' as keyword for the search box on the Google Scholar website. Then entering ``gs'' and a search term in the location bar in Firefox takes you straight to the search results for that term. If you use Google Scholar through your library you'll even get full access to the articles straight away.

Posted by Sebastian Bauhoff at 7:07 PM

February 27, 2007

Adventures in Identification II: Exposing Corrupt Politicians

Today we continue our voyage in the treasure quest for identification in observational studies. After our sojourn in Spain two weeks ago, the next stopover is in Brazil, where in a recent paper Claudio Ferraz and Frederico Finan discovered a nice natural experiment that allows to estimate the effect of transparency on political accountability. Many in the policy world are agog over the beneficial impact of transparency on good governance. Yet, empirical studies of this subject are often bedevilled by selection problems for obvious reasons. Ideally, we would like to find a situation in which changes in transparency are randomly assigned, which (also for obvious reasons) tends to be a low probability event. But is does happen. Turns out that in a recent anti-corruption program in Brazil, the federal government randomly audits 60 municipalities every month and then discloses the findings of the report to the municipality and the media. The authors exploit this variation and find that the dissemination of information on corruption, which is facilitated by media, does indeed have a detrimental impact on the incumbent’s electoral performance.

Here is the abstract of the paper:

Exposing Corrupt Politicians: The Effects of Brazil’s Publicly Released Audits on Electoral

This paper examines whether access to information enhances political accountability. Based upon the results of Brazil’s recent anti-corruption program that randomly audits municipal expenditures of federally-transferred funds, it estimates the effects of the disclosure of local government corruption practices upon the re-election success of incumbent mayors. Comparing municipalities audited before and after the elections, we show that the audit policy reduced the incumbent’s likelihood of re-election by approximately 20 percent, and was more pronounced in municipalities with radio stations. These findings highlight the value of information and the role of the media in reducing informational asymmetries in the political process.

Posted by Jens Hainmueller at 12:48 PM

February 23, 2007

Translating Statistics-Speak

I wish we all talked more about how scientific results are translated by the media. Fully understanding the assumptions and limitations of a study is challenging enough for those performing the research. In some ways, the journalists’ job is harder, finding lay language to summarize outcomes and implications without generalizing or ignoring uncertainty. I do not envy them the task.

Byron Calame, the public editor of the New York Times, recently discussed his paper's presentation of a study about marital status. On January 16, the front page read, "51% of Women are Now Living Without Spouse.” Calame’s response noted that in the study, “women” included females aged 15 and older; the Census set the lower bound at 15 to catch all married women. The original article did not call attention to the fact that teenagers living at home were counted as single women.

Apparently, when other journalists pointed out the misleading lack of clarity, some readers felt that they had been deceived. Is the “true” parameter just over 50% or just under? I would argue that the lower age bound set by the census is as reasonable as any. I also think that it doesn’t make much difference whether the percentage of women who are unmarried is a tiny bit over 50 or a tiny bit under (Sam Roberts, who wrote the original article, eventually made the same argument).

Regardless, Calame reports that an executive Times editor plans to spend more time discussing statistical results with colleagues who have expertise in the relevant fields. This seems like a great plan. I wonder how far this idea could be taken – how can researchers best work with journalists to successfully translate results?

A Crimson article published yesterday went so far as to refer to the “basic statistical measures—such as p-values or R-squared values,” or lack thereof, in a study conducted by Philip Morris. And when covering The New England Journal of Medicine’s discussion of stents for heart patients, The Times focused on the fact that some risks are “tough to assess.” This journalistic direction seems promising.

Posted by Cassandra Wolos at 2:01 PM

February 22, 2007

Cheating for Honest People

Let me follow up on yesterday’s post by Jim Greiner.

Jim’s problem: He’s touring the country touting tools for increased honesty in applied statistical research, only to be asked, effectively, for recommendations about using these tools to cheat more effectively. Yay academic job market.

Jim’s example goes like this: An analyst is asked to model the effect of a treatment, T, on the outcome, Y, while controlling for a bunch of confounders, X. To minimize the potential for data dredging we give the analyst only the treatment and the observed potential confounders to model the treatment assignment process, but we withhold the outcome data. Only after the analyst announces success in balancing the data (by including X, functions of X,f(X), deleting off-support observations etc), would we communicate the outcome data, plug the outcome in the equation, run it once, and be done.

So how can we help Jim help his audience cheat? Let’s make two assumptions (which I’d be willing to defend with my life). First, although the analyst is not given the actual outcome data, the analyst does know what the outcome is (wages, say). Second, the analyst is permitted to drop elements of X from the analysis, based on his or her analytic judgment.

Now let’s cheat. First, select the covariate, C, from the pool of potential confounders, X, believed to correlate most strongly with the outcome, Y. Second, treat C as the outcome and build a model through data dredging to maximize (or minimize, if this is your objective) the “effect” of T on C. Specifically, find the subset of functions of X, S(f(X)), that maximizes the effect of T on C while maintaining balance in S(f(X)). Third, upon receiving the outcome data, just plug them into the model but “forget” to mention that you didn’t include C in the treatment assignment model. If C really correlates strongly with Y then this procedure should lead to an upwardly biased estimate of T on Y.

I fear that this would work well in practice (though one could construct a counterexample). Seems to me, however, that it would be more technically demanding to cheat in this way than to cheat in, say, standard regression analysis.

Posted by Felix Elwert at 6:42 PM

February 20, 2007

Borat's Effect on Kazakhstan

If you’ve seen it or paid some attention to what’s going on in the popular media in the past six months, you will not have missed the movie ``Borat: Cultural Learnings of America for Make Benefit Glorious Nation of Kazakhstan’’ by Sacha Baron Cohen. The movie went from huge hype to packed movie theatres, and is due out on March 6 on DVD. Some described the movie as ``brilliant’’, for others it was 15 minutes of mediocre jokes drawn out into 82 minutes of film.

Whatever you may think, the government of Kazakhstan certainly took issue. They felt that their country was portrayed in a particularly unfair light, and started an image campaign with advertisements in the New York Times and other news media (see here for an article on that matter by the NYT). But what actually was the impact on Kazakhstan’s image of that movie? Fifteen minutes on Google Trends are suggestive (or frivolous, as Amy suggested).

Here is the timeline of events from Wikipedia: Borat was first screened at some film festivals from July 2006 onwards. It was officially released at the Toronto Film Festival on September 7, 2006 which started the hype. The movie opened in early November in the US, Canada and most European countries. It was number 1 at the US box office for two weeks and only left the top 10 in mid-December.

Here’s a graph of search terms and their associated search volume from Google Trends until November 2006 (you can get this live here and modify as you please). The blue line is the term ``borat movie’’; the red line is ``kazakhstan’’ and the orange line is ``uzbekistan’’ which will serve as (admittedly imperfect) control country. The news reference volume refers to the number of times each topic appeared in Google News.


As you can see, searches for ``borat movie'' take off in September 2006 which coincides with the official release. It spikes in late October before the movie opens at the box office and goes down afterwards. The event B is the announcement of the movie as picked up by Google News. All as expected even if the blips before July are a little strange.

Interestingly the search volume for ``uzbekistan’’ follows that of ``kazakhstan’’ quite well before the movie appears in the spotlight in September. From September onwards the volume for ``kazakhstan’’ somewhat tracks the volume for the movie instead. If you were to look at monthly data you would see that the relationship is not as clear but there does seem to be a trend. So maybe the movie generated some interest in the country.

Here’s another chart for September 2006 (from here). The blue and red lines are as before, but now the orange line is for ``kazakstan’’. It turns out that you can write the name correctly with or without the ``h’’. Maybe people who spell it for the first time would use this version. This search term appears in the search volume just before the movie hits the theaters.


Google Trends gives another hint. If you look at the cities of origin for the searches, you will notice a mix of US/European countries and cities in the second half of 2006. And ``kazakstan’’ is mostly searched by British users. In the first half of the year however almost all searches come from Almaty, the largest city in Kazakhstan.

Now, obviously nothing is causal and proven but it does look interesting. Not only did the search volume on Google shoot up around the time of the introduction of the movie, but also the geographic composition of the searches shifted to where the movie was very popular and the country not well known before Fall 2006.

What does all this mean for Kazakhstan? Is this good or bad publicity? It seems that people became interested in the country beyond the movie (see a USA Today story here). A poll of users of a UK travel website put Kazakhstan in the Top 3 places to visit (right after Italy and the UK if you believe the results), and the Lonely Planet already has an article on the real Kazakhstan ``beyond Borat''. We'll see if those people are really going in the end, and if the trend persists over time as Google supplies more information. But all in all the movie might have generated some useful publicity for the country. Estimating the impact on tourism and world opinion, anyone?

Posted by Sebastian Bauhoff at 1:24 AM

February 14, 2007

Data sharing and visualization

A friend of mine pointed me to this website, Many eyes. Basically any random person can upload any sort of dataset, visualize the dataset in any number of ways, and then make the results publically available so that anyone can see them.

The negative, of course, is much the same as with anything that "just anyone" can contribute to: there is a lot of useless stuff, and (if the source of the dataset is uncited) you don't know for sure how valid the dataset itself is. There may be a lot of positives, though: the volume of data alone is like a fantastic dream for many a social scientist; it's a great tool for getting "ordinary people" interested in doing their own research or analysis of their lives (for instance, I noticed some people graphing changes in their own sports performance over time); many of the interesting datasets have ongoing conversations about them; and only time will tell, but I imagine there is at least a chance this could end up being Wikipedia-like in its usefulness.

It may also serve as a template for data-sharing among scientists. Wouldn't it be nice if, every time you published, you had to make your dataset (or code) publically available? We might already be trending in that direction, but some centralized location for scientific data-sharing sure would speed it along.

Posted by Amy Perfors at 10:24 AM

February 13, 2007

Adventures in Identification I: Voting After the Bomb

Jens Hainmueller

I've decided to start a little series of entries under the header `Adventures in Identification.' The title is inspired by the increasing trend in the social sciences, in particular economics, public health, also political science, sociology, etc. to look for natural or quasi-experiments to identify causal effects in observational settings. Although there are of course plenty of bad examples of this type of study, I think the general line of research is very promising and the rising interest in issues of identification is commendable. Natural experiments often provide the only credible alternative to answer many of the questions we care about in the social sciences, where real experiments are often unethical or infeasible (or both) and observational data usually has selection bias written all over it. Enough said, let's jump right into the material: `Adventures in Identification I: Voting After the Bomb -- a Macabre Natural Experiments in electoral politics.

A recent question in political science and also economics is how terrorism effects democratic elections. Now clearly this seems a fairly tricky question to get some (identification) handle on. Heretic graduate students riding on their Rubin horses around IQSS will tell you two minutes into your talk that you can't just run a regression and call it `causal.' One setting where an answer may be (partly) possible is the case of the Spanish congressional elections in 2004. The incumbent conservative party led by Prime Minister Jose Maria Aznar had been favored to win by a comfortable margin according to opinion polls. On March 11, however, Islamic terrorists deposited nine backpacks full of explosive in several commuter trains in Madrid. The explosions killed 191 people and wounded 1,500. Three days later Spain's socialists under the lead of Jose-Luis Rodriguez Zapatero scored a stunning victory in the elections. Turnout was high and many have argued that voters seemingly expressed anger with the government, accusing it of provoking the Madrid attacks by supporting the U.S.-led war in Iraq, which most Spaniards opposed.

Now the question is how (if at all) the terrorist attacks affected the election result. As usual, only one potential outcome is observed and the crucial question is what the election results would have been like in the absence of the attacks. One could do a simple before and after study imputing this missing potential outcome based on some extrapolated pre-attacks trend in opinion polls. But then the question remains whether these opinion polls are an accurate representation of how people would have voted on election day. A difference-in-differences design seems better suited, but given that the attacks probably affected all voters a control group is hard to come by.

In a recent paper, Jose G. Montalvo, actually found a control group. Turns out that at the time the attacks hit, Spanish residents abroad had already cast their absentee ballots. Thus, they were not affected in their decision by the attacks. The author then sets up a diff-in-diffs exploiting voting trends in the treated group (Spanish residents) and the control group (Spanish citizens in a foreign country). He finds that the attacks had a large effect on the result to the benefit of the opposition party. Interestingly, this result seems to be different from the findings of other simple before and after studies on the topic (although I can't say because I have not read the other papers cited).

Of course, the usual disclaimers about DID estimates apply. Differential trends between the groups may exist if foreign residents perceived terrorism differently than Spanish residents over time. Foreign residents are probably very different than Spanish residents. But to the defense of the author, the results seem fairly robust given the checks he presents. And hey, it's a though question to ask and this provides a more appropriate way to get a handle on identifying the counterfactual outcome then simply comparing before and after.

Posted by Jens Hainmueller at 8:00 AM

February 9, 2007

Corruption in the Classroom

In the fall, I mentioned the debate over teaching kids to read using whole language versus phonics methods. The heavily funded Reading First program, part of No Child Left Behind, is intended to promote phonics and relies on research published by the National Reading Panel (which I don’t completely trust, but today that’s beside the point).

The latest is a report by psychologist Louisa Moats claiming that instead of changing their curricula to focus on phonics, reading programs are sprinkling key phonics catchphrases throughout their marketing materials and selling the same old whole language lessons. The press release for Moats’ report contrasted the situation with the F.D.A.’s oversight of drugs. The government authority approves the treatment; companies marketing the treatment rely on public trust in the authority. The difference is that education companies get away with much more than the drug companies ever could.

Reports like this highlight for me the differences in how natural and social science results become policy. I see that medical dishonesty can kill people while the effects of corruption in education are less direct. But how does it happen that New York City public schools spend anti-whole language funding on thinly disguised whole language curricula? What other social programs are subject to this kind of deceit?

Posted by Cassandra Wolos at 9:37 AM

February 7, 2007

Timing Is Everything

Jim Greiner

Per previous blog posts, I'm giving today's presentation at CGIS on causal inference and immutable characteristics. I've previewed some of the ideas from this research in blog posts. Basically, the idea is that if we shift our thinking from "actual" immutable characteristics (e.g., race), a concept I find poorly defined in some situations, to perceived immutable characteristics, then the potential outcomes framework of causation can sometimes be usefully applied to things like race, gender, and ethnicity.

A key point here is the timing of treatment assignment. If treatment is conceptualized in terms of perceptions, then a natural point at which to consider treatment applied is the moment the decision maker whose conduct is being studied first perceives a unit's race, gender, ethnicity, whatever. This works well only if we're willing to exonerate the decision maker from responsibility for whatever happened before that moment of first perception. In the law, sometimes we're willing to do so. Sometimes, we're not.

Take the employment discrimination context. Typically, we don't hold an employer responsible for the discrimination of someone else, particular when it occurred (say) prior to a job application, even if that prior discrimination means that some groups (e.g., minorities) have less attractive covariates (e.g., educational achievement levels) than others (e.g., whites). Perhaps potential outcomes could work here; a study of the employer's hiring can safely condition on educational achievement levels (i.e., take them as given, balance on them, etc.) and other covariates. More covariates means that the ignorability assumption required for most causal inference is more plausible.

Contrast the employment discrimination setting to certain standards applying to education institutions. For example, we may not want to allow a university to justify allocating fewer resources to female sports teams on the grounds that its female students show less interest in sports (even if we believed the university to be telling the truth). Here, we might consider that the preferences of the female students were probably shaped by prior stereotyping, and we might want to force the university to take steps to combat those stereotypes and change the female students' preferences. If so, we are unwilling to take the previous social pressure as "given," so we cannot balance on it. The result is fewer covariates and greater pressure on the ignorability assumption.

My thanks to Professor Roderick Hills of NYU law school, whose insightful question during a job talk I recently gave there helped solidify the above Title IX example.

Posted by James Greiner at 4:00 PM

February 6, 2007

Ask why...why, why, why


Posted by Jens Hainmueller at 10:11 PM

Presentation, Presentation (at conferences, that is)

An article by Jane Miller in the current issue of Health Services Research explains strategies for preparing conference posters. As she writes, posters are a "hybrid of a published paper and an oral presentation" and people often fail to recognize this in preparing a poster. The article reviews existing literature on research communication and provides some guidelines on how to present statistical methods and results appropriately. It's all common sense stuff, might come in handy for first-time presenters looking for guidance.

It also goes nicely with Gary's "Publication, Publication" guide for writing research papers which you can find here.

Jane E. Miller (2007) "Preparing and Presenting Effective Research Posters" Health Services Research 42(1p1): 311–328. doi:10.1111/j.1475-6773.2006.00588.x

Posted by Sebastian Bauhoff at 3:10 PM

February 1, 2007

A Rash Of Senicide?

There have been an awful lot of stories lately about the world's oldest person dying; in fact, it seems to have happened about three times in the last month or so. Then again, being the world's oldest person is a dubious honour to be sure, since the winner isn't likely to hold the title for very long and likely isn't even aware of their status. (Full disclosure: my great-grandmother was a centenarian but likely never knew my name.)

These stories have been bouncing in my mind lately and I'm trying to figure out why. I can think of a few scientifically relevant explanations:

1) The life expectancy of a centenarian is on the order of a year, and three successive deaths in a month is a rare event; conditioned on the first one, assuming independence and exponential life span (a reasonable assumption for the tail end), the probability of the next two events coming within a month is roughly 0.0033. And this happened to be the month for it.

2) The events aren't at all rare, and the centenarian death rate is actually dramatically higher, but it's a slow news month, and the stories themselves are floating to the top of the pile.

3) Online news services like Reuters and CNN have dedicated spaces for more `entertaining' and `bizarre' news stories, meaning that no matter how much news there is, people are seeing these stories.

4) Guinness sales are down, despite the "brilliant!" advertising campaign, and the World Record people are seeking out these changing events for the sake of their own discreet advertising.

5) I read this in The Onion and the satire hit me point blank, meaning I'm selecting and remembering the stories more often when they appear.

I'm thinking it's Number 5, but I'd be curious to know if anyone knew the mean centenarian death rate and whether this was a rare occurrence or not.

Posted by Andrew C. Thomas at 9:56 AM

January 31, 2007

Making bad choices, again

Amy Perfors

Most of us are aware of various distortions in reasoning that people are vulnerable to, mainly because of heuristics we use to make decisions easier. I recently came across an article in Psychological Science called Choosing an inferior alternative that demonstrates a technique that will cause people to choose an alternative that they themselves have previously acknowledged to be personally inferior. This is interesting for two reasons: first of all, exactly how and why it works tells us something about the process by which our brains update (at least some sorts of) information; and second, because I anticipate commercials and politicians and master manipulaters to start using these techniques any day now, and maybe if we know about it in advance we'll be more resistant. One can hope, anyway.

So what's the idea?

It's been known for a while that decision makers tend to slightly bias their evaluations of new data to support whatever alternative is currently leading. For instance, if I'm trying to choose between alternatives A, B, and C -- let's say they are restaurants and I'm trying to decide where to go eat -- when I learn about one attribute, say price, I'll tentatively rank them and decide that (for now) A is the best option. If I then learn about another attribute, say variety, I'll rerank them, but not in the same way I would have if I'd seen those two attributes at the same time: I'll actually bias it somewhat so that the second attribute favors A more than it otherwise would have. This effect is generally only slight, so if restaurant B is much better on variety and only slightly worse on price, I'll still end up choosing restaurant B: but if A and B were objectively about equal, or B was even slightly better, then I might choose A anyway.

Well, you can see where this is going. These researchers presented subjects with a set of restaurants and attributes to determined their objective "favorite." Then, two weeks later, they brought the same subjects in again and presented them with the same restaurants. This time, though, they had determined -- individually, for each subject -- the proper order of attributes that would most favor choosing the inferior alternative. (It gets a little more complicated than this, because in order to try to ensure that the subjects didn't recognize their choice from before, they combined nine attributes into six, but that's the essential idea). Basically what they did is picked the attribute that most favored the inferior choice and put it first, hoping to establish that the inferior choice would get installed as the leader. The attribute that second-most favored the inferior choice was last, to take advantage of recency effects. The other attributes were presented in pairs, specifically chosen so that the ones that most favored the superior alternative were paired with neutral or less-favorable ones (thus hopefully "drowning them out.")

The results were that when presented with the information in this order, 61% of people chose the inferior alternative. The good news, I guess, is that it wasn't more than 61% -- some people were not fooled -- but it was robustly different than chance, and definitely more than you'd expect (since, after all, it was the inferior alternative, and one would hope you'd choose that less often). Moreover, people didn't realize they were doing this at all: they were more confident in their choice when they had picked the inferior alternative. Even when told about this effect and asked if they thought they themselves had done it, they tended not to think so (and the participants who did it most were no more likely to think they had done it than the ones who didn't).

I always get kind of depressed at this sort of result, mainly because I become convinced that this sort of knowledge is then used by unscrupulous people to manipulate others. I mean, it's probably always been used somewhat subconsciously that way, but making it explicit makes it potentially more powerful. On the plus side, it really does imply interesting things for how we process and update information -- and raises the question of why we bias the leading alternative, given that it's demonstrably vulnerable to order effects. Just to make ourselves feel better about our current choice? But why would this biasing do that - wouldn't we feel best of all if we knew we were being utterly rational the whole time? It's a puzzle.

Posted by Amy Perfors at 10:29 AM

January 30, 2007

The Role of Sample Size and Unobserved Heterogeneity in Causal Inference

Jens Hainmueller

Here is a question for you: Imagine you are asked to conduct an observational study to estimate the effect of wearing a helmet on the risk of death in motorcycle crashes. You have to choose one of two different data-sets for this study: Either a large, rather heterogeneous sample of crashes (these happened on different roads, at different speeds, etc.) or a smaller, more homogeneous sample of crashes (let's say they all occurred on the same road). Your goal is to unearth a trustworthy estimate of the treatment effect that is as close as possible to the `truth', i.e. the effect estimate obtained from an (unethical) experimental study on the same subject. Which sample do you prefer?

Naturally, most people tend to choose the large sample. Larger sample, smaller standard error, less uncertainty, better inference…we’ve heard it all before. Interestingly, in a recent paper entitled "Heterogeneity and Causality: Unit Heterogeneity and Design Sensitivity in Observational Studies" Paul Rosenbaum comes to the opposite conclusion. He demonstrates that heterogeneity, and not sample size matters for the sensitivity of your inference to hidden bias (a topic we blogged about previously here and here). He concludes that:

“In observational studies, reducing heterogeneity reduces both sampling variability and sensitivity to unobserved bias—with less heterogeneity, larger biases would need to be present to explain away the same effect. In contrast, increasing the sample size reduces sampling variability, which is, of course useful, but it does little to reduce concerns about unobserved bias.”

This basic insight about the role of unit heterogeneity in causal inference goes back to John Stuart Mill’s 1864 System of Logic. In this regard, Rosenbaum’s paper is a nice comparison to Jas’s view on Mill’s methods. Of course, Sir Fisher dismissed Mill for his plea for unit homogeneity because in experiments, when you have randomization working for you, hidden bias is not a real concern so you may as well go for the larger sample.

Now you may say: well it all depends on the estimand, no? Do I care about the effect of helmets in the US as a whole or only on a single road? This point is well taken, but keep in mind that for causal inference from observational data we often care about internal validity first and not necessarily generalizability (most experiments are also done on highly selective groups). In any case, Rosenbaum’s basic intuition remains and has real implications for the way we gather data and judge inferences. Next time you complain about a small sample size, you may want to think about heterogeneity first.

So finally back to the helmet example. Rosenbaum cites an observational study that deals with the heterogeneity issue in a clever way: “Different crashes occur on different motorcycles, at different speeds, with different forces, on highways or country roads, in dense or light traffic, encountering deer or Hummers. One would like to compare two people, one with a helmet, the other without, on the same type of motorcycle, riding at the same speed, on the same road, in the same traffic, crashing into the same object. Is this possible? It is when two people ride the same motorcycle, a driver and a passenger, one helmeted, the other not. Using data from the Fatality Analysis Reporting System, Norvell and Cummings (2002) performed such a matched pair analysis using a conditional model with numerous pair parameters, estimating approximately a 40% reduction in risk associated with helmet use.”

Posted by Jens Hainmueller at 8:30 AM

January 26, 2007

Statistical porridge and other influences on the American public

In this past Sunday’s New York Times Book Review, Scott Stossel covers a book by Sarah E. Igo, a professor in the history department at the University of Pennsylvania. The Averaged American – which I haven’t read but plan to pick up soon – discusses how the development of statistical measurement after World War I impacted not only social science, but also, well, the average American. According to the review, Igo argues that statistical groundbreakers like the Gallup poll and the Kinsey reports created a societal self-awareness that hadn’t existed before.

What struck me, though, was the reviewer’s closing comment. Stossel writes, “Even as we have moved toward ever-finer calibrations of statistical measurement, the knowledge that social science can produce is, in the end, limited. Is the statistical average rendered by pollsters the distillation of America? Or its grinding down into porridge? For all of the hunger Americans have always expressed for cold, hard, data about who we are, literary ways of knowing may be profounder than statistical ones.”

Keep in mind that these words come from a literary person immersed in the literary world (specifically, Stossel is the managing editor of The Atlantic Monthly ) and should be understood in context. However, I hope that Stossel and the average American see the value of cold, hard, data handled well. I also think that we as social scientists and statisticians should accept his challenge to keep the porridge limited, the ideas unlimited, and our impact on the national consciousness profound! And maybe we should be a little offended, too.

Posted by Cassandra Wolos at 9:30 AM

January 24, 2007

The Goal of Causal Inference

Jim Greiner

I’ll be giving the talk at the Gov 3009 seminar in early February, and I’ll be presenting a paper I’m writing with Don Rubin on applying the potential outcomes framework of causation to what lawyers call “immutable characteristics” (race, gender, and national origin, for example). I’ll be previewing some of the idea from this paper on the blog.

One key point from this paper is the recognition that in law (specifically, in an anti-discrimination setting), the goal of causal inference may be different from that in a more traditional social science setting. A sociologist, for example, might study the effect of tax breaks for married couples on marriage rates; the obvious goal of the study is to see whether a contemplated intervention (tax breaks) has a desired effect. An economist might evaluate a job training program for a similar reason. In anti-discrimination law, however, we study the effect of units’ perceived races (or genders or whatever) on some outcome (e.g., hiring or promotion), but we have no interest in intervening to change these perceptions. Rather, we’re contemplating action that would mitigate the effects we find. The “intervention” we’re considering might be compensating the victim of discrimination, as is true in an employment discrimination suit. Or it might be ceasing a certain type of government action, such as the death penalty. But we’re not interesting in implementing a policy promoting or effectuating the treatment that we’re studying.

Posted by James Greiner at 1:14 PM

January 16, 2007

Applied Statistics Workshop

The Applied Statistics Workshop will resume for the spring semester on January 31, 2007. We will continue to meet in the CGIS Knafel Building, Room N354 on the third floor at noon on Wednesdays. The Workshop has a new website that has the tentative schedule posted for the semester. We will be moving the archives of papers from the previous semesters to the new site in the coming weeks, so you can track down your favorite talks from years past. As a preview of what's to come, here are the names and affiliations of some of the speakers presenting in the next month:

January 31st
Holger Lutz Kern
Department of Government
Harvard University

February 7th
Jim Greiner
Department of Statistics
Harvard University

February 14th
Alberto Abadie, Alexis Diamond, and Jens Hainmueller
Kennedy School of Government and Department of Government
Harvard University

February 21st
Dan Hopkins
Department of Government
Harvard University

Posted by Mike Kellermann at 5:51 PM

January 9, 2007

Visualization Guide

Courtesy of Aleks at Columbia, who brought this to my attention:

A very interesting collection of visualizations for projects, proposals and presentations. The periodic table arrangement itself is not at all useful, but the depth and organization sure is.

Posted by Andrew C. Thomas at 2:36 PM

December 13, 2006

Applied Statistics – Harrington

This week the Applied Statistics Workshop will present a talk by David Harrington, Professor of Biostatistics at Harvard’s School of Public Health, and in the Department of Biostatistical Science at the Dana Farber Cancer Institute.

Professor Harrington received his Ph.D. from the University of Maryland and taught at the University of Virginia before coming to Harvard. He has served as Principal Investigator on numerous NIH and NSF grants researching topics including Nonparametric Tests for Censored Cancer Data, and Statistical Problems for Markov Branching Processes. His research has appeared in Journal of the American Statistical Association, Biostatistics, Genetic Epidemiology, Journal of Clinical Oncology, and Biometrics among many others.

Professor Harrington is involved in two different lines of research. The first is research in statistical methods for clinical trials and prospective cohort studies in which the time to an event is a primary outcome. He has worked in efficient nonparametric tests and regression methods for right-censored data, sequential designs for clinical trials, and nonparametric methods for estimating nonlinear covariate effects on survival. Recently, he and co-workers in the Department of Biostatistics have been studying methods for analyzing survival data when some covariates have missing observations. Missing data are common in both prospective and retrospective cohort studies, and simply ignoring cases with missing observations can lead to substantial biases in inference.

Dr. Harrington 's second line of research, on which he will be presenting, is collaborative research in cancer. He is the principal investigator of the Statistical Coordinating Center for the Cancer Care Outcomes Research and Surveillance (CanCORS) Consortium. This NCI-funded study is a network of sites around the country that are conducting a population-based study of access to and outcomes from cancer care, with special focus on ethnic subgroups and subgroups defined by age.

Professor Harrington will present a talk entitled "Statistical Issues in the Cancer Care Outcomes Research and Surveillance Consortium (CarCORS)." The presentation will be at noon on Wednesday, December 13 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Eleanor Neff Powell at 9:23 AM

December 12, 2006

Better Way To Make Cumulative Comparisons With Small Samples?

On July 15, 1971 the research vessel Lev Berg set sail from Aralsk (Kazakhstan) to survey the Aral Sea, then the 4th largest freshwater lake in the world. The Soviet Union had been steadily draining the Aral for agricultural purposes since the 1950s and the Lev Berg was to measure the ecological damage. This trip included passing by the island Vozrozhdeniye on the South side.

Lev Berg Image
(Image Source: "The 1971 Smallpox Epidemic in Aralsk, Kazakhstan, and the Soviet Biological Warfare Program." Center for Nonproliferation Studies Occasional Paper No. 9, Jonathan B. Tucker and Raymand A. Zilinskas.)

Vozrozhdeniye was an ideal site for the main Soviet bioweapons field testing because itwas in a remote area, easily secured as an island, and had reliable winds from the Northto the South allowing ``safe'' testing and housing on the North end. The site was active from 1936 until 1990 when Yeltsin publicly denounced the program and
had it shut down. This is despite the Soviet Union having signed the 1972 Biological and Toxin Weapons Convention outlawing such research. Shortly after the Lev Berg returned to Aralsk, there was an unusual outbreak of smallpox there, starting with a young researcher who had been onboard. The following is the best
epidemiological data available:

Table Image
Comparison Case: in 1972 a Muslim man from Kosovo went on a pilgrimage to Mecca, returning through Baghdad where he was infected with smallpox. This was the first reported smallpox case in Kosovo since 1930 and it apparently went undiagnosed for six weeks producing 175 cases and 35 deaths. A good comparison since rates of vaccination were similar as were socio-economic conditions.

Kaplan-Meier graph with time-to-event = onset of illness:

Kaplan-Meier Image(Image Source: Ibid.)

Key difference: all three Aralsk deaths were from hemorrhagic smallpox and only five in Kosovo were. The baseline for naturally occurring smallpox: Rao's study in Madras, India had 10,857 cases with only 240 hemorrhagic. Only two possible explanations seem to remain for the differences:
- host conditions (nutrition, genetic resistance, environment) differ greatly.
- Aralsk strain was an unusual type.
Obviously, it would be nice to claim strong evidence that the Soviet case resulted from escaped smallpox. We know the extent of the bioweapons program from Yeltsin's opening of the files, but not the responsibility of this dissemination with 100% certainty.

This is just a motivating (and interesting) example; the real question is about testing really small samples, when exact inference doesn't seem appropriate. So what other approaches would readers suggest for making comparisons with these types of cumulative data besides simple Kaplan-Meier comparisons? Obviously typical
correlational analysis won't work (polychoric, multichoric, etc.) and standard tabular approaches are not going to be effective either.

Posted by Jeff Gill at 2:48 PM

December 7, 2006

NIPS highlights

Amy Perfors

I've just spent this week at the annual NIPS conference; though its main focus seems to be machine learning, there are always interesting papers on the intersection of computational/mathematical methods in cognitive science and neuroscience. I thought it might be interesting to mention the highlights of the conference for me - which obviously tends to focus heavily on the cognitive science end of things. (Be aware that links (pdf) are to the paper pre-proceedings, not final versions, which haven't been released yet).

From Daniel Navarro and Tom Griffiths, we have A Nonparametric Bayesian Method for Inferring Features from Similarity Judgments. The problem, in a nutshell, is that if you're given a set of similarity ratings about a group of objects, you'd like to be able to infer the features of the objects from that. Additive clustering assumes that similarity is well-approximated by a weighted linear combination of common features. However, the actual inference problem -- actually finding the features -- has always been difficult. This paper presents a method for inferring the features (as well as figuring out how many features their are) that handles the empirical data well, and might even be useful for figuring out what sorts of information (i.e., what sorts of features) we humans represent and use.

From Mozer et. al. comes Context Effects in Category Learning: An Investigation of Four Probabilistic Models. Some interesting phenomena in human categorization are the so-called push and pull effects: when shown an example from a target category, the prototype gets "pulled" closer to that example, and the prototypes of other related categories get pushed away. It's proven difficult to explain this computationally, and this paper considers four obvious candidate models. The best one uses a distributed representation and a maximum likelihood learning rule (and thus tries to find the prototypes that maximize the probability of being able to identify the category given the example); it's interesting to speculate about what this might imply about humans. The main shortcoming of this paper, to my mind, is that they use very idealized categories; but it's probably a necessary simplification to begin with, and future work can extend it to categories with a richer representation.

The next is work from my own lab (though not me): Kemp et. al. present an account of Combining causal and similarity-based reasoning. The central point is that people have developed accounts of reasoning about causal relationships between properties (say, having wings causes one to be able to fly) and accounts of reasoning about objects on the basis of similarity (say, if a monkey has some gene, an ape is more likely to have it than a duck is). But many real-world inferences rely on both: if a duck has gene X, and gene X causes enzyme Y to be expressed, it is likely that a goose has enzyme Y. This paper presents a model that intelligently combines causal- and similarity-based reasoning, and is thus able to predict human judgments more accurately than either of them alone.

Roger Levy and T. Florian Jaeger have a paper called Speakers optimize information density through syntactic reduction. They explore the (intuitively sensible, but hard to study) idea that people -- if they are rational -- should try to communicate in the information-theoretically optimal way: they should try to give more information at highly ambiguous points in a sentence, but not bother doing so at less ambiguous points (since adding information has the undesirable side-effect of making utterances longer). They examine the use of reduced relative clauses (saying, e.g., "How big is the family you cook for" rather than "How big is the family THAT you look for" - the word "that" is extra information which reduces the ambiguity of the subsequent word "you"). The finding is that speakers choose to reduce the relative clause -- to say the first type of sentence -- when the subsequent word is relatively unambiguous; in other words, their choices are correlated with information density. One of the reasons this is interesting to me is because it motivates the question of why exactly speakers do this: is it a conscious adaptation to try to make things easier for the listener, or a more automatic/unconscious strategy of some sort?

There are a number of other papers that I found interesting -- Chemudugunta et. al. on Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model; Roy et. al. on Learning Annotated Hierarchies from Relational Data, and Greedy Layer-wise Training of Deep Networks by Bengio et. al., to name a few -- so if this sort of thing interests you, I suggest checking out the NIPS proceedings when they come out. And if any of you went to NIPS also, I'd be curious what you really liked and think I should have included on this list!

Posted by Amy Perfors at 4:07 PM

December 6, 2006

Applied Statistics - Imbens and Ridder

This week the Applied Statistics Workshop will present a talk by Guido Imbens, Professor of Economics at Harvard University, and Geert Ridder, Professor of Economics at the University of Southern California.

Professor Imbens has recently rejoined the Department of Economics at Harvard and is one of the faculty sponsors of the Applied Statistics Workshop, so we are delighted that he will be speaking at the Workshop. He received his Ph.D. from Brown University and served on the faculties of Harvard, UCLA, and Berkeley before returning to Harvard. He has published widely, with a particular focus on questions relating to causal inference. Professor Imbens has been the recipient of numerous National Science Foundation grants and teaching awards. His work has appeared in Econometrica, Journal of Econometrics, Journal of the Royal Statistical Society, and Biostatistics among many others.

Geert Ridder is Professor of Economics at the University of Southern California. Before coming to the United States he was Professor of Econometrics at the Rijksuniversiteit Groningen and the Vrije Universiteit in Amsterdam in The Netherlands. In the United States he was Professor of Economics at the Johns Hopkins University and visiting professor at Cornell University, the University of Iowa, and Brown University. He received his Ph.D. from the University of Amsterdam. Professor Ridder’s research area is econometrics, in particular microeconometrics, and its applications in labor economics, public finance, economic development, economic demography, transportation research, and the economics of sports. His methodological interests are the (nonparametric) identification of statistical and economic structures from observed distributions (mainly in duration data and discrete choice data), models and estimation methods for duration data and panel data, (selectively) missing data, causal inference, and errors-in-variables. His work has appeared in Econometric, Economics of Education Review, Journal of the European Economic Association, and Journal of Econometrics among others.

Professors Imbens and Ridder will present a talk entitled "Complementarity and Aggregate Implications of Assortative Matching: A Nonparametric Analysis." The presentation will be at noon on Wednesday, December 6, in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Eleanor Neff Powell at 10:34 AM

December 5, 2006

Causality in the Social Sciences Anybody?

Funny how there is no section on causal inference in the social sciences here? It says that to meet Wikipedia's quality standards, this article may require cleanup. Hopefully, somebody will find the time to contribute a social science section. Why not you? My guess is that readers of this blog know plenty about this topic...and the current entry is lacking a lot of what statistics has to say about causality.

Posted by Jens Hainmueller at 10:00 AM

November 29, 2006

Applied Statistics - Alan Zaslavsky

This week the Applied Statistics Workshop will present a talk by Alan Zaslavsky, Professor of Health Care Policy (Statistics) in the Department of Health Care Policy at Harvard Medical School. Dr. Zaslavsky's statistical research interests include surveys, census methodology, small area estimation, official statistics, missing data, hierarchical modeling, and Bayesian methodology. His research topics in health care policy center on measurement of the quality of care provided by health plans through consumer assessments and clinical and administrative data. Among his current major projects are (1) the Consumer Assessments of Healthcare Providers and Systems (CAHPS) survey implementation for the Medicare system, (2) methodology for surveys in psychiatric epidemiology, centered on validation of the CIDI-A (adolescent) survey in the National Comorbidity Study-Adolescent, and (3) studies on determinants of quality of care for cancer, including both the Statistical Coordinating Center and a research site for the NCI-funded CanCORS (Cancer Consortium for Outcomes Research and Surveillance) study. Other research interests include measurement of disparities in health care, and privacy and confidentiality for health care data.

He is a member of the Committee on National Statistics (CNSTAT) of the National Academy of Sciences and has served on CNSTAT panels on census methodology, small area estimation and race/ethnicity measurement, as well as the Committee on the National Quality Report on Health Care Delivery of the Institute of Medicine.

Dr. Zaslavsky received his A.B. degree at Harvard College, his M.S. at Northeastern University, and his Ph.D. at the Massachusetts Institute of Technology. He is a Fellow of the American Statistical Association.

Professor Zaslavsky will present a talk entitled "Modeling the covariance structure of random coefficients to characterize the quality variation in health plans." The presentation will be at noon on Wednesday, November 29th, in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Eleanor Neff Powell at 7:59 AM

November 22, 2006

Business Information and Social Science Statistics, Part II

I mentioned in this earlier blog entry an interview I did with DM Review. Here's the sequel.

Posted by Gary King at 2:25 PM

November 21, 2006

Back to the Drawing Board?


Have you ever been to a social science talk and heard somebody saying things like "i guess I will have to go back to the drawing board…" I always wondered what that really meant, until an engineering friend of mine suggested taking a look at this.

Maybe we can get one for the IQSS?

Posted by Jens Hainmueller at 11:34 AM

November 17, 2006

Bayesian brains?

Amy Perfors

Andrew Gelman has link to a study that just came out in Nature Neuroscience whose author, Alex Pouget at the University of Rochester, suggests that "the cortex appears wired at its foundation to run Bayesian computations as efficiently as can be possible." I haven't read the paper yet, so I don't have much in the way of intelligent commentary, but I'll try to take a look at it soon. In the meantime, here is a link to the press release so you can read something about it even if you don't have access to Nature Neuroscience. From the blurb, it sounds pretty neat, especially if you (like me) are at all interested in the psychological plausibility of Bayesian models as applied to human cognition.

Posted by Amy Perfors at 11:40 AM

The "Imperial Grip" of Instrumental Variables

The Economist is agog over the increasing prominence of instrumental variables in econometrics ("Winds of Change", November 4, 2006). While it is always nice to get some square inches in a publication with a circulation greater than a few thousand, I'm afraid that I tend to sympathize more with the "instrument police" than the "instrumentalists."

For a variable to be a valid instrument, it must be (a) correlated with the variable for which we are trying to estimate a causal effect, and (b) only affect the outcome through the proposed causal variable, such that an exclusion restriction is satisfied. This is true for every estimation in which a proposed instrument is used; one must make a separate case for the validity of the exclusion restriction with respect to each analysis. Leaving aside what should be the second-order problem of actually carrying out an IV analysis, which may be a first-order problem in practice ("what do you mean it has no mean?"), our inability to verify the exclusion restriction in the case of naturally occuring instruments forces us to move from the substance of the problem we are trying to investigate to a duel of "just-so stories" for or against the restriction, a debate that typically cannot be resolved by looking at the empirical evidence.

Consider the two papers desribed in the Economist article. The first attempts to estimate the effect of colonialism on current economic outcomes. The authors propose wind speed and direction as an instrument for colonization, arguing (plausibly) that Europeans were more likely to colonize an island if they were more likely to encounter it while sailing. So far so good. Then they argue that, while colonization in the past has an effect on economic outcomes in the present, being situated in a location favorable for sailing in the past (i.e., before steam-powered ships) does not. Is this really plausible? The authors think so, I don't, and it isn't obvious that there is a way to resolve the matter. In the second example, the failure of ruling dynasties to produce an heir in Indian princely states is used as an instrument for the imposition of direct rule by the British. Here the exclusion restriction may be more plausible (or - shameless plug - maybe not, if it is the shift from a hereditary to a non-hereditary regime rather than colonialism per se that affects outcomes). One way or the other, is this really what we should be arguing about?

None of this is to say that instrumental variable models can never be useful. When we can be more confident that the exclusion restriction is satisfied (usually because we designed the instrument ourselves), then IV approaches make a lot of sense. Unfortunately (or fortunately), we can't go back and randomly assign island discoveries using something like a coin flip rather than the trade winds. Despite this, nothing seems to slow down the pursuit of more and more tortured instruments. The observation that "the instrumental variable now enjoys an almost imperial grip on the imagination of economists" carries more irony that was perhaps intended.

Posted by Mike Kellermann at 11:03 AM

November 16, 2006

How to present math in talks

Since writing my last post (The cognitive style of better powerpoint), I noticed that two other bloggers wrote rather recently on the same topic. The first, from Dave Munger at Cognitive Daily, actually proposes a bit of an experiment to compare the efficacy of text vs. powerpoint - results to be posted Friday. The second, from Chad Orzel at Uncertain Principles, offers a list of "rules of thumb" for doing a good PowerPoint talk.

Given all this, you'd think I wouldn't have anything to add, right? Well, never underestimate my willingness to blather on and on about something. I actually think there's one thing neither they nor I discuss much, and that is presenting mathematical, technical, or statistical information. Both Orzel and I recommend, as much as possible, avoiding equations and math in your slides. And that's all well and good, but sometimes you just have to include some (especially if you're a math teacher and the talk in question is a lecture). For me, this issue crops up whenever I need to describe a computational model -- you need to give enough detail that it doesn't look like the results just come out of thin air, because if you don't, nobody will care about what you've done. And often "enough detail" means equations.

So, for whatever it's worth, here are my suggestions for how to present math in the most painless and effective way possible:

Abandon slideware. This isn't always feasible (for instance, if the conference doesn't have blackboards), nor even necessarily a good idea if the equation count is low enough and the "pretty picture" count is high enough, but I think slideware is sometimes overused, especially if you're a teacher. When you do the work on the blackboard, the students do it with you; when you do it on slideware, they watch. It is almost impossible to be engaged (or keep up) when rows of equations appear on slides; when the teacher works out the math on the spot, it is hard not to. (Okay, harder).

If you can't abandon slideware:

1. Include an intuitive explanation of what the equation means. (This is a good test to make sure you understand it yourself!). Obviously you should always do this verbally, but I find it very useful to write that part in text on the slide also. It's helpful for people to refer to as they try to match it with the equation and puzzle out how it works and what it means -- or, for the people who aren't very math-literate, to still get the gist of the talk without understanding the equation at all.

2. Decompose the equation into its parts. This is really, really useful. One effective way to do this is to present the entire thing at once, and then go through each term piece-by-piece, visually "minimizing" the others as you do so (either grey them out or make them smaller). As a trivial example, consider the equation z = x/y. You might first grey out (y) and talk about x. Then talk about y and grey out x: you might note things like that y is the denominator, you can see that when y gets larger our result gets smaller, etc. My example is totally lame, but this sort of thing can be tremendously useful when you get equations that are more complicated. People obviously know what numerators and denominators are, but it's still valuable to explicitly point out in a talk how the behavior of your equation depends on its component parts -- people could probably figure it out given enough time, but they don't have that time, particularly when it's all presented in the context of loads of other new information. And if the equation is important enough to put up, it's important to make sure people understand all of its parts.

3. As Orzel mentioned, define your terms. When you go through the parts of the equation you should verbally do this anyway, but a little "cheat sheet" there on the slide is invaluable. I find it quite helpful sometimes to have a line next to the equation that translates the equation into pseudo-English by replacing the math with the terms. Using my silly example, that would be something like "understanding (z) = clarity of images (x) / number of equations (y)". This can't always be done without cluttering things too much, but when you can, it's great.

4. Show some graphs exploring the behavior of your equation. ("Notice that when you hold x steady, increasing y results in smaller z"). This may not be necessary if the equation is simple enough, but if it's simple enough maybe you shouldn't present it, and just mention it verbally or in English. If what you're presenting is an algorithm, try to display pictorially what it looks like to implement the algorithm. Also, step through it on a very simple dataset. People remember and understand pictures far better than equations most of the time.

5. When referring back to your equation later, speak English. By this I mean that if you have a variable y whose rough English meaning is "number of equations", whenever you talk about it later, refer to it as "number of equations", not y. Half of the people won't remember what y is after you move on, and you'll lose them. If you feel you must use the variable name, at least try to periodically give reminders about what it stands for.

6. Use LaTeX where possible. LaTeX's software creates equations that are clean and easy to read, unlike PowerPoint (even with lots of tweaking). You don't necessarily have to do the entire talk in LaTeX if you don't want to, but at least make the equations in LaTeX, screen capture them and save them as bitmaps, and paste them into PowerPoint. It is much, much easier to read.

Obviously, these points become more or less important depending on the mathematical sophistication of your audience, but I think it's far far easier to make mathematical talks too difficult rather than too simple. This is because it's not a matter (or not mainly a matter) of sophistication -- some of the most egregious violaters of these suggestions that I've seen have been at NIPS, a machine learning conference -- it's a matter of how much information your audience can process in a short amount of time. No matter how mathematically capable your listeners are, it takes a while (and a fair amount of concentration) to see the ramifications and implications of an equation or algorithm while simultaneously fitting it in with the rest of your talk, keeping track of your overall point, and thinking of how all of this fits in with their research. The easier you can make that process, the more successful the talk will be.

Any agreements, disagreements, or further suggestions, I'm all ears.

Posted by Amy Perfors at 11:24 AM

November 15, 2006

Gender as a Personal Choice

Jim Greiner

Greetings from the job market for legal academics, which combines the worst aspects of the job markets of all other fields. Apologies for being slow to bring this up, but an article in last week’s New York Times (Tuesday, November 7, 2006, page A1, by Damien Cave) is worth a look. The subject area is recording gender in New York City records. The City’s Board of Health is considering a proposal to allow persons born in the City to change the sex as documented on their birth certificates upon providing certain documentation (e.g., affidavits from doctors and mental health professionals) asserting that the proposed gender change would be permanent. Previously, the City required more physical manifestations of a sex change before it would change its records.

Question: are we moving toward a world in which sex, like race, becomes a personal choice, at least as recorded in official records? Note that in the race context, the law can’t seem to make up its mind on this. The Census Bureau records self-reports only, and many modern social scientists consider race a social construct only, with no relevant biological component. But some existing statutes still define race in terms of biology (e.g., 18 U.S.C. § 1093(6)).

Second question: suppose we are moving toward such a world; what will it do to our efforts to enforce anti-discrimination laws?

Posted by James Greiner at 1:51 PM

November 14, 2006

Meta-analysis, Part II

Last time I wrote about the popularity of meta-analysis for synthesizing the results of multiple studies and cited education researcher Derek Briggs, who believes that the method is used too often and sometimes incorrectly.

Recently, I informally re-examined the data from a published meta-analysis on reading instruction methods, running four different Bayesian models on the set of effect sizes given in the paper. All of the hierarchical Bayesian models (which varied only in the priors used and covariates included) showed that a significant amount of uncertainty was ignored by the original meta-analysis, which assumed that the effect size produced by each study was an estimate of one overall true mean. The preliminary results from my analysis supported Briggs' position, since they did not show the significant results that were evident in the meta-analysis paper; in other words, none of the Bayesian analyses came close to indicating a significant effect for the reading instruction method in question. I claim no reliable conclusion for my own analysis – I’m even not going to specify the original paper here – but re-examining the methods of meta-analyses seems worthwhile for the purpose of uncovering uncertainty, if not developing new techniques for synthesizing multiple studies.

The implications are nontrivial: the evidence supporting the teaching methods required by the billion dollar Reading First initiative, part of the Department of Education’s No Child Left Behind Act, is a long collection of meta-analyses performed by the National Reading Panel.

Posted by Cassandra Wolos at 12:43 PM

November 13, 2006

Applied Statistics –Joshua Angrist

This week the Applied Statistics Workshop will present a talk by Joshua Angrist, Professor of Economics at the Massachusetts Institute of Technology.

Professor Angrist received his Ph.D. in Economics at Princeton University. After which he joined the Economics Departments at Harvard University and Hebrew University before coming to MIT. He is a Fellow of the American Academy of Arts and Sciences, The Econometric Society, and has served as Co-editor of the Journal of Labor Economics. His publications have appeared in Econometrica, The American Economic Review, The Economic Journal, and The Quarterly Journal of Economics among others. His research interests include the effects of school inputs and organization on student achievement, the impact of education and social programs on the labor market, immigration, labor market regulation and institutions, and econometric methods for program and policy evaluation. Prof. Angrist also has a long-standing interest in public-policy. In addition to his academic work, he has worked as a consultant to the U.S. Social Security Administration, The Manpower Demonstration Research Corporation, and for the Israeli government after the Oslo peace negotiations in 1994.

Professor Angrist will present a talk entitled "Lead them to Water and Pay them to Drink: An Experiment in Services and Incentives for College Achievement." The presentation will be at noon on Wednesday, November 15th, in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Eleanor Neff Powell at 1:08 PM

November 10, 2006

Chernoff Faces

We haven't had much on graphics on this blog yet, partly because there are several specialized fora for this peculiar aspect of statistics: for instance, junkcharts, the R-gallery, information aesthetics, the Statistical Graphics and Data Visualization blog, the Data Mining blog, Edward Tufte's forum, Andrew Gelman's blog and others. Yet, I assume readers of this blog wouln't mind a picture every once in a while, so here are some Chernoff faces for you right there. In spirit of Mike's recent entry, they illustrate team statistics from the 2005 baseball season:


I recently came across the Chernoff faces while looking for a neat way to display multivariate data to compare several cities along various dimensions in a single plot. Chernoff faces are a method introduced by Herman Chernoff (Prof Emeritus of Applied Math at MIT and of Statistics at Harvard) in 1971 that allows one to convert multivariate data to cartoon faces, the features of which are controlled by the variable values. So for example in the above graph, each teams winning percentage are represented by face height, smile curve, and hair styling; hits are represented by face width, eye height, nose height; etc. (for details and extensions see here).

The key idea is that human are well trained to recognize faces and discern small changes without difficulty. Therefore Chernoff faces allow for easy outlier detection and pattern recognition despite multiple dimensions of the data. Since the features of the faces vary in perceived importance, the way in which variables are mapped to the features should be carefully chosen.

Mathematica and R have canned algorithms for Chernoff faces (see here and here). I haven't seen a Chernoff plot in a social science journal yet, but maybe I am reading the wrong journals. Does anyone know articles that use this technique? Also do you think that this is an effective way of displaying data that should be used more often? Obviously there are also problems with this type of display, but even if you don't like the key idea you have to admit that they look much funnier then the boring bar-graphs or line plots we see all the time.

Posted by Jens Hainmueller at 10:29 AM

November 9, 2006

The cognitive style of better powerpoint

Amy Perfors

While at the BUCLD conference this last weekend, I found myself thinking about the cognitive effects of using PowerPoint presentations. If you haven't read Edward Tufte's Cognitive Style of PowerPoint, I highly recommend it. His thesis is that powerpoint is "costly to both content and audience", basically because of the cognitive style that standard default PPT presentations embody: hierarchical path structure for organizing ideas, emphasis on format over content, and low information resolution chief among them.

Many of these negative results -- though not all -- occur because of a "dumb" use of the default templates. What about good powerpoint, that is, powerpoint that isn't forced into the hierarchical path-structure of organization, that doesn't use hideous, low-detail graphs? [Of course, this definition includes other forms of slide presentation, like LaTeX; I'll use the word "slideware" to mean all of these]. What are the cognitive implications of using slideware, as opposed to other types of presentation (transparencies, blackboard, speech)?

Here are my musings, unsubstantiated by any actual research:

I'd bet that the reliance on slideware actually improves the worst talks: whatever its faults, it at least imposes organization of a sort. And it at least gives a hapless audience something to write down and later try to puzzle over, which is harder to do if the talk is a rambling monologue or involves scribbled, messy handwriting on a blackboard.

Perhaps more controversially, I also would guess that slideware improves the best talks - or, at least, that the best talks with slideware can be as good as the best talks using other media. The PowerPoint Gettysburg Address is a funny spoof, but seriously, can you imagine a two-hour long, $23-million-gross movie of someone speaking in front of a blackboard or making a speech? An Inconvenient Truth was a great example of a presentation that was enhanced immeasurably by the well-organized and well-displayed visual content (and, notably, it did not use any templates that I could tell!). In general, because people are such visual learners, it makes sense that a presentation that can incorporate that information in the "right" way will be improved by doing so.

However, I think that for mid-range quality presenters (which most people are) slideware is still problematic. Here are some things I've noticed:

1. Adding slides is so simple and tempting that it's easy to mismanage your time. I've seen too many presentations where the last 10 minutes are spent hastily running through slide after slide, so the audience loses all the content in the disorganized mess the talk has become.

2. Relatedly, slideware creates the tendency to present information faster than it can be absorbed. This is most obvious when the talk involves math -- which I might discuss in a post of its own -- but the problem occurs with graphs, charts, diagrams, or any other high-content slides (which are otherwise great to have). Some try to solve the problem by creating handouts, but the problem isn't just that the audience doesn't have time to copy down the content -- they don't have the time to process it. Talks without slideware, by forcing you to present content at about the pace of writing, give the audience more time to think about the details and implications of what you're saying. Besides, the act of copying it down itself can do wonders for one's understanding and retention.

3. Most critically, slideware makes it easier to give a talk without really understanding the content or having thought through all the implications. If you can talk about something on an ad hoc basis, without the crutch of having written everything written out for you, then you really understand it. This isn't to say that giving a slideware presentation means you don't really understand your content; just that it's easier to get away with not knowing it.

4. Also, Tufte mentioned that slideware forces you to package your ideas into bullet-point size units. This is less of a problem if you don't slavishly follow templates, but even if you don't, you're limited by the size of the slide and font. So, yeah, what he said.

That all said, I think slideware is here to say; plus, it has many advantages over other types of presentation. So my advice isn't to not use slideware (except, perhaps, for math-intensive talks). Just keep these problems in mind when making your talks.

Posted by Amy Perfors at 11:53 AM

November 8, 2006

Fixing Math Education by Making It Less Enjoyable?

Justin Grimmer

In a recent Brookings Institution report on the mathematics scores of junior high and high school students from different nations uncovers some paradoxical correlations. Using standardized test scores, the report shows that nations with the highest scores also have the students with the lowest confidence in their math ability and the lowest levels of enjoyment from learning math. This is evident in American students, with high confidence and enjoyment, but only with middle-of-the-pack scores on standardized tests.

Casting correlation/causation concerns aside, the Brookings report goes on to argue that the American mathematical education experience is perhaps too enjoyable for students. Rather than informing students about the important mathematical concepts that the foreign textbooks provide, American textbooks are characterized as trying too hard to create an enjoyable classroom experience.

The policy implication provided is to make mathematics less enjoyable in American classrooms by discarding colorful pictures and interesting story problems. At the very least, the report suggests that educator’s attention should be redirected from making math fun to making math education solely about mathematics.

Because of the study’s limited nature, any drastic policy recommendations should be avoided. After all, the report’s argument merely identifies two paradoxical relationships and then speculates a causal mechanism that provides one potential explanation for the trend. No effort is made to eliminate other alternative causal mechanisms. For example, cultural explanations could explain the discrepancy of the scores and confidence ratings, aside from differences in teaching methodologies. The study also attempts to make an ecological inference, inferring individual level behavior from aggregated data. While not damming in itself, it does weaken the strength of the conclusions.

That being said, perhaps the problem with American mathematics education does not lie in the attempt to make students happy, but in the material that is presented. Rather than providing students with an in depth understanding of concepts and introducing proof techniques, high school math assignments are often about memorization and a superficial knowledge of the techniques involved. Perhaps, if the focus were changed to make high school mathematics less like balancing a check book and more like Real Analysis, American math students would see an increase in their happiness in the classroom and also their test scores.

Posted by Justin Grimmer at 11:51 AM

November 7, 2006

Election Day

As everyone must know (unless you are lucky enough to not own a television), today is Election Day in the US. I always think of analyzing elections (and pre-election polling) as the quintessential statistical problem in political science, so I'm sure that many of us are eagerly waiting to get our hands of the results. Recent elections in the U.S. have been somewhat controversial, to say the least, which is probably bad for the country but unquestionably good for the discipline (see the Caltech/MIT Voting Technology Project for one example), and my guess is that this election will continue the trend. Law professor Rick Hasen of sets the threat level for post-election litigation at orange; anyone looking for an interesting applied statistics project would be well advised to check out his site in the coming weeks. In the meantime, the Mystery Pollster (Mark Blumenthal) has an interesting post on the exit polling strategy for today's election; apparently we shouldn't expect preliminary and incomplete results to be leaked until 5pm this year.

Posted by Mike Kellermann at 12:36 PM

November 3, 2006

Negative Results

Felix Elwert

In September, The Institute of Medicine released its report on “The Future of Drug Safety,” featuring some goodies on the dissemination of research findings.

One of the recommendations echoes one of the favorite hallway complaints at IQSS: that journals are perennially hung up on publishing *** alpha less than 0.05 yay-yay statistically significant results.

Says the Washington Post:

“[According to the report] manufacturers should also be required to register all clinical trials they sponsor in a government-run database to allow patients and physicians to see the outcome of all studies, not just those published in medical journals, the report said. Studies that show positive results for a drug are more likely to be published by journals than negative ones.”

Welcome to the world of publication bias. (The report is yours for a highly significant $44.)

Posted by Felix Elwert at 11:59 AM

November 2, 2006

Incumbency as a Source of Contamination in Mixed Electoral Systems

Jens Hainmueller

Since the early 1990s, more than 30 countries have adopted mixed electoral systems that combine single-member districts (SMD) in one tier with proportional representation (PR) in a second tier. Political scientists like these type of electoral systems because each voter gets to cast two votes, the first vote according to one set of institutional rules and the second vote according to another. Some have argued that this allows for causal inference because it offers a controlled comparison of voting patterns under different electoral rules. But does it really?

The more recent literature on so called contamination effects undermines this claim. Several papers (Herron and Nishikawa 2001; Cox and Schoppa 2002; Ferrara, Herron, and Nishikawa 2005) have found evidence that there are interaction effects between the two tiers in mixed electoral systems. For example, small parties are able to attract more PR votes in those districts in which they run SMD candidates. The argument is that running a SMD candidate gives a human face to the party and thus enables it to attract additional PR votes.

In a recent paper, Holger Kern and I attempt to add to this debate by identifying incumbency as a source of contamination in mixed electoral systems. It is well known that incumbents that run in single-member district (SMD) races have a significant advantage compared to non-incumbents (Gelman and King 1990). It thus seems plausible to expect that this advantage carries over to the proportional representation (PR) tier, and that incumbents are able to attract additional PR votes for their party in the district. In our paper we identify such an effect using a regression-discontinuity design that exploits the local random assignment to incumbency in close district races (based on an earlier paper by Lee 2006). The RD design allows us to separate a subpopulation of district races in which treatment is assigned as good as randomly from the rest of the data that is tainted by selection effects. We find that incumbency causes a gain of 1 to 1.5 percentage points in PR vote share. We also present simulations of Bundestag seat distributions, demonstrating that contamination effects caused by incumbency have been sufficiently large to trigger significant shifts in parliamentary majorities.

Needless to say, any feedback is highly appreciated.

Posted by Jens Hainmueller at 12:00 PM

November 1, 2006

An Individual-Level Story and Ecological Inference

Jim Greiner

I blogged some last year (see here) on whether an individual-level story is necessary, or useful, to ecological inference. For a review of what ecological inference is, and what I mean by an individual-level story, see the end of this entry. Last year, I stated that such a story was helpful in explaining an ecological inference technique, even if it might not be strictly necessary for modeling. Gary disagreed that such a story was at all helpful, and we had a little debate on the subject, which you can access here. Lately, though, I’ve been thinking that an individual-level story really is necessary for good modeling, not just for communication of a model. In particular, it seems like an individual-level model is required to incorporate survey information into an ecological inference model. Survey data is, after all, data collected at the level of the individual, and with only an aggregate-level model, it’s hard to see how one could incorporate it. Any thoughts from anyone out there?

To review: ecological inference is the effort to predict the values of the internal cells of contingency tables (usually assumed to be exchangeable) when only the margins are observed. A classic example is in voting, where one observes how many (say) black, white, and Hispanic potential voters there are in each precinct, and one also observes how many votes were cast for Democratic and Republican candidates. What one wants to know if, say, how many blacks voted Democrat. By an individual-level story, I mean a model of voting behavior at the level of the individual voter and a mathematical theory of how to aggregate up to the precinct-level counts.

Posted by James Greiner at 12:00 PM

October 31, 2006

Predicting Elections

Jacob Eisenstein at MIT has developed an smart election predictor for the US Senate Elections using a Kalman Filter. The filter helps to decide how much extra weight to attach to more recent polls. Check it out here; he also has some details on the method here.

Posted by Sebastian Bauhoff at 2:01 PM

More thoughts on publication bias and p-values

Amy Perfors

In a previous post about the Gerber & Malhotra paper about publication bias in political science, I rather optimistically opined that the findings -- that there were more significant results than would be predicted by chance, and that many of those were suspiciously close to 0.05 -- were probably not deeply worrisome, at least for those fields in which experimenters could vary the number of subjects run based on the significance level achieved thus far.

Well, I now disagree with myself.

This change of mind comes as a result of reading about the Jeffreys-Lindley paradox (Lindley, 1957), a Bayes-inspired critique of significance testing in classical statistics. It says, roughly, that with large enough sample size, a p-value can be arbitrarily close to zero even though the null hypothesis is highly probable (i.e., very close to one). In other words, a classical statistical test might reject the null hypothesis at an arbitrarily low p-value, even though the evidence that it should be accepted is overwhelming. [A discussion of the paradox can be found here].

When I learned about this result a few years ago, it astonished me, and I still haven't fully figured out how to deal with all of the implications. (This is obvious, since I forgot about it when writing the previous post!). As I understand the paradox, the intuitive idea is that, with larger sample size, you will naturally get some data that appears unlikely (and, the more data you collect, the more likely you are to see some really unlikely data). If you forget to compare the probability of that data under the null hypothesis with the probability of the data under the alternative hypotheses, then you might get an arbitrarily low p-value (indicating that the data are unlikely under the null hypothesis) even if the data is even more unlikely under any of the alternatives. Thus, if you just look at the p-value, without taking effect size, sample size, or the comparative posterior probability of each hypothesis under consideration, you are likely to wrongly reject the null hypothesis on the basis of the p-value, even if it is the most likely of all possibilities.

The tie-in with my post before, of course, is that it implies that it isn't necessarily "okay" practice to keep increasing sample size until you achieve statistical significance. Of course, in practice, sample sizes rarely get larger than 24 or 32 -- at the absolute outside, 50 to 100 -- which is much smaller than infinity. Does this practical consideration, then, mean that the practice is okay? As far as I can tell, it is fairly standard (but then, so is the reliance on p-values to the exclusion of effect sizes, confidence intervals, etc., so "common" doesn't mean "okay"). Is this practice a bad idea only if your sample gets extremely large?

Lindley, D.V. (1957) A statistical paradox. Biometrika, 44. 187-192

Posted by Amy Perfors at 10:00 AM

October 30, 2006

Applied Statistics - Nan Laird & Christoph Lang

This week the Applied Statistics Workshop will present a talk by Nan Laird, Professor of Biostatistics in the Harvard School of Public Health, and Christoph Lang, Assistant Professor of Biostatistics in the Harvard School of Public Health.

Before joining the Department of Biostatistics, Professor Laird received her Ph.D. in Statistics from Harvard and was an Assistant Prof. of Statistics at Harvard. She has published extensively in Statistics in Medicine, Biostatistics, American Journal of Human Genetics and the American Journal of Epidemiology among others. Her research interest is the development of statistical methodology in four primary areas: statistical genetics, longitudinal studies, missing or incomplete data, and analysis of multiple informant data.

Professor Lang earned his Ph.D. in Applied Statistics from the University of Reading, and has been a member of the Department of Biostatistics since then. His publications have appeared in Biostatistics, the American Journal of Human Genetics, Genetic Epidemiology, and Genetics. Prof. Lange's current research interests fall into the broad areas of statistical genetics and generalized linear models. Recent topics in statistical genetics include family-based association tests, meta-analysis of linkage studies, GEE-methods in linkage analysis and marker-assisted selection.

Prof. Laird and Prof. Lang will present a talk entitled “Statistical Challenges and Innovations for Gene Discovery”. An abstract for the talk and associated background papers are available from the course website. The presentation will be at noon on Wednesday, November 1st, in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Eleanor Neff Powell at 9:05 AM

October 29, 2006

America by the Numbers

Reading the Data Mining blog, I just learned about this cool visualization of the US population density presented by Time magazine.

Take a closer look here. Cute, isn't it?

Posted by Jens Hainmueller at 3:14 PM

October 26, 2006

Newcomb's Paradox: Reversing Causality?

Justin Grimmer

Newcomb’s paradox is a classic problem in philosophy and also an entertaining puzzle to consider. Here is one version of the paradox. Suppose you are presented with two boxes, A and B. You are allowed to take just box A, just box B, or both A and B. There will always be $1000 in box A, and there will either be $0 or $1,000,000 in box B.

A ‘predictor’ determines the contents of box B before you have arrived, using the following plan. If the predictor believes you will pick both box A and B, then she places nothing in box B, but if she believes that you will only take box B, then she places the $1,000,000 in box B.

What makes this predictor special is her amazing accuracy. In the previous billion plays of the game she has never been wrong.

So, you have the two boxes in front of you, what should you do? Keep in mind, the predictor has already made her decision when you arrive at the boxes, so by our normal rules of causality (events in the future cannot cause past events), our actions cannot change what the predictor has decided.

Posted by Justin Grimmer at 12:00 PM

October 25, 2006

Unconscious Bias & Expert Witnesses

Jim Greiner

Quantitative expert witnesses are essential to modern litigation. But why do they disagree so often?

An excerpt from an article by Professor Franklin Fisher appears below. It’s a tad long, but it’s really worth reading. Does it ring a familiar bell with anyone out there?

“It is not, however, always easy to avoid becoming a ‘hired gun’ . . . The danger is sometimes a subtle one, stemming from a growing involvement in the case and friendship with the attorneys. For the serious professional, concerned about preserving his or her standards, the problem is not that one is always being asked to step across a well-defined line by unscrupulous lawyers. Rather, it is that one becomes caught up in the adversary proceeding itself and acquires the desire to win. . . . Particularly because lawyers play by rules that go beyond those of academic fair play, it becomes insidiously easy to see only the apparent unfairness of the other side while overlooking that of one’s own.”

Franklin M. Fisher, Statisticians, Econometricians, and Adversary Proceedings, 81 J. AM. STAT. ASS’N. 277, 285 (1986)

Posted by James Greiner at 12:00 PM

October 24, 2006


Here’s an interesting piece that should help you keep your New Semester resolutions by understanding procrastination better. Sendhil Mullainathan recently used research by Dan Ariely and Klaus Wertenbroch as motivation for his undergraduate psychology and economics class. Though it’s not exactly statistics, it seems the insights could be useful for grad students and their advisors.

Ariely and Wertenbroch did several experiments to see how deadlines might help overcome procrastination. They examine whether deadlines might be effective pre-commitment devices, and whether they can enhance performance. In one of their experiments, they asked participants to proofread three meaningless synthetic texts. Participants received financial rewards for finding errors and submitting on time (just like in a problem set…). They randomized participants into three categories: three evenly-spaced deadlines every 7 days; an end-deadline after 21 days; or a self-imposed schedule of deadlines within a three week period.

Which one would you select if you could? Maybe the end-deadline because it gives you the most flexibility in arranging the work (similar to a final exam or submitting your dissertation all at once)? Ariely and Wertenbroch found that the end-deadline does the worst both in terms of finding errors and submitting on time. Participants with evenly-spaced deadline did best. But that group also liked the task the least, maybe because they had several unpleasant episodes of reading silly texts, or because they spent more time than the other groups.

So when you start your semester with good intentions, consider setting some reasonable and regular deadlines that bind, and get a calendar. Or just wait for the New Year for another chance to become resolute and have another drink in the meantime.

Posted by Sebastian Bauhoff at 12:44 PM

October 19, 2006

Simpson’s Paradox

Jim Greiner

As a lawyer, I have to be interested not just in what quantitative principles are true, but also in how to present “truth” to people without quantitative training. To that end, HELP! One of the maddening things about statistics is Simpson’s paradox. The quantitative concept, undoubtedly well-known to most readers of this blog, is that the correlation between two variables can change sign and magnitude, depending on what is conditioned on. That is, Corr(A, B | C) might be positive, while Corr(A, B | C, D) might be negative, while Corr (A, B | C, D, E) might be positive again. At bottom, this is what’s going on when regression coefficients become (or cease to be) significant as one adds additional variables to the right-hand side. Because regression currently enjoys a stranglehold on expert witness analyses in court cases (I’ll be ranting on that in the future), communicating Simpson's Paradox a matter of real concern for someone like me who cares about what juries see, hear, and think. Any ideas on how to get this concept across?

Posted by James Greiner at 11:13 AM

October 18, 2006

Meta-analysis: To Trust or Not to Trust

Cassandra Wolos

Social scientists, who often have a limited ability to create true experiments and replicate studies, value ways to learn from the synthesized results of previous work. A popular quantitative tool designed for this purpose is meta-analysis, which calculates a standardized effect size for each of a set of studies in a literature review and then performs inference on the resulting set of effect sizes. Meta-analysis is particularly common in education research.

Can we trust the results of these analyses?

On the one hand, when performed correctly, meta-analysis should successfully summarize the information available in multiple studies. Combining the results in this way can increase the power of overall conclusions when the sample size in each study is relatively small.

On the other hand, a good meta-analysis relies on the assumption that the original studies were unbiased and generally well-performed. In addition, we hope that the researchers in each study had the same target population in mind and worked independently of each other. Further complicating matters is the potential for publication bias – a meta-analysis will rarely include unpublished studies with less impressive effect sizes.

The second hand represents the view of Derek Briggs at the University of Colorado, Boulder, who in a 2005 Evaluation Review paperobjected to what he saw as the overuse of meta-analysis in social science research. He also suggested that assumptions necessary for a reliable meta-analysis are not always met.

More to come on this topic next time.

Posted by Cassandra Wolos at 10:00 AM

October 16, 2006

Applied Stats - Loeffler

This week the Applied Statistics Workshop will present a talk by Charles E. Loeffler, Ph.D. Candidate in Sociology at Harvard University.

Charles graduated from Magna Cum Laude from Harvard with a degree in Social Studies, before going on to receive his M. Phil in Criminology from Cambridge University. He has recently completed the National Consortium on Violence Research Pre-Dissertation Fellowship under the mentorship of Prof. Steven Levitt of the University of Chicago. His work has appeared in The New Republic Online, Federal Sentencing Reporter, and Ars Aequi: A Biographical History of Legal Science. Charles's research interests include Criminology, Quasi-Experimental Methods and Decisionmaking.

Charles will present a talk entitled "Is justice blind? A natural experiment in the use of judicial discretion in criminal trials". The working paper for the talk is available from the course website. The presentation will be at noon on Wednesday, October 18th, in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Eleanor Neff Powell at 5:00 AM

October 12, 2006

Causation and Manipulation VII: The Cartoon Version

Doging Bill Collectors

As Tailor (A) fits customer (B) and calls out measurements, college boy (C) mistakes them for football signals and makes a flying tackle at clothing dummy (D). Dummy bumps head against paddle (E) causing it to pull hook (F) and throw bottle (G) on end of folding hat rack (H) which spreads and pushes head of cabbage (I) into net (J). Weight of cabbage pulls cord (K) causing shears (L) to cut string (M). Bag of sand (N) drops on scale (O) and pushes broom (P) against pail of whitewash (Q) which upsets all over you causing you to look like a marble statue and making it impossible for you to be recognized by bill collectors. Don't worry about posing as any particular historical statue because bill collectors don't know much about art (more on causal chains in cartoons click here).

Posted by Jens Hainmueller at 11:00 PM

October 11, 2006

Further readings on the Iraqi excess deaths study

Today's papers were full with reports of a new study in the Lancet (here) on counting the excess deaths in Iraq since the US invasion in 2003. The article by Johns Hopkins researchers is an update on a study published in 2004 which generated a huge debate about the political as well as statistical significance of the estimates. This time the media's attention is again on the magnitude of the estimate (655,000 excess deaths, most of them due to violence) which is again vastly higher than other available numbers. The large uncertainty (95% CI 390,000 - 940,000) gets fewer comments this time, maybe because the interval is further away from 0 than in the 2004 study.

Just to point you to some interesting articles, here is a good summary in today’s Wall Street Journal. Wikipedia has a broad overview of the two studies and criticisms here. Brad deLong responded to criticisms of the 2004 study here; he also covers problems with the cluster sampling approach. And check this and this for some related posts on this blog.

By the way, the WSJ article has a correction for misinterpreting the meaning of 95% confidence. Maybe you can use it convince your stats students that they should pay attention.

Posted by Sebastian Bauhoff at 11:59 PM

October 10, 2006

Causation and Manipulation VI: The cognitive science version

I can't resist chiming in and contributing post VI on causation and manipulation, but coming at a rather different angle: rather than ask what we as researchers should do, the cognitive science question is what people and children do do - what they assume and know about causal inference and understanding.

You might think that people would (for lack of a better term) suck at this, given other well-known difficulties in reasoning, anecdotal reports from educators everywhere, etc, etc. However, there's a fair amount of evidence that people -- both adults and children -- can be quite sophisticated causal reasoners. The literature on this is vast and growing, so let me just point out one quite interesting finding, and maybe I'll return to the topic in later posts.

One question is whether children are capable of using the difference between evidence from observations and evidence from intervention (manipulation) to build a different causal structure. The well-named "theory theory" theory of development suggests that children are like small scientists and should therefore be quite sophisticated causal reasoners at an early age. To test this, Schulz, Kushnir, & Gopnik [pdf showed preschool children a special "stickball machine" consisting of a box, out of which two sticks (X and Y) rose vertically. The children were told that some sticks were "special" and could cause the other sticks to move, and some weren't. In the test condition, children saw X and Y move together on their own three times; the experimenter then intervened to pull on Y, causing it to move and X to fail to move. In the experimental condition, the experimenter pulled on one stick (X) and both X and Y moved three times; a fourth time the experimenter pulled on Y again, but only it moved (X was stationary).

The probability of each stickball moving conditioned on the other are the same in both cases: however, if the children reason about causal interventions, then the experimental group -- but not the control group -- should perceive that X might cause Y to move (but not vice-versa). And indeed, this was the case.

Children are also good at detecting interventions that are obviously confounded, overriding prior knowledge, and taking base rate into account (at least somewhat). As I said, this is a huge (and exciting) literature, and understanding people's natural propensities and abilities to do causal reasoning might even help us address the knotty philosophical problems of what a cause is in the first place.

Posted by Amy Perfors at 11:00 PM

October 6, 2006

Causation and Manipulation, V

Jim Greiner

Fair warning: This entry includes a plug for one of my papers

Anti-discrimination laws require lawyers to figure out the causal effect of race (gender, ethnicity) on certain decision making. Previous posts have been exploring the often-tossed-around idea of considering the treatment to be perceived race, as opposed to "actual" (whatever that means) or self-identified race, to answer the no-causation-without-manipulation objection. This feels like a good idea, but it really only works in some cases and not others. It works when we can identify a specific actor (or an institution) whose behavior we want to study. Capital sentencing juries and a defendant firm in an employment discrimination lawsuit are two that work. We can think about changing these specific actors' perceptions of particular units (capital defendants, potential employees), and we can think about WHEN it makes sense to think of treatment (the perception) as being applied: at the moment the actor first perceives the unit's race (or gender or whatever). In contrast, "the public" or "the set of all employers in the United States" are two examples of actors that don't work. The timing of treatment assignment no longer makes sense, the counterfactuals are too hard to imagine, and the usual non-interference-among-units assumption becomes hard to think about.

What does all this buy us? A fair amount. First, this line of thinking identifies cases in which rigorous causal inference based on the potential outcomes framework remains beyond our reach. Figuring out the causal effect of gender or salaries nationwide is one example; another is the causal effect of candidate race on election outcomes. Second, in those cases in which we can identify a specific actor, we get a coherent conceptualization of the timing of treatment assignment, which allows us to distinguish pre- from post-treatment variables. This is a big deal. Entire lawsuits sometimes turn on it.

All this has important implications for civil rights litigation, as I discuss in my paper, "Causal Inference in Civil Rights Litigation." You can get a draft (pdf) of this paper from my website, which you can access by clicking on my name to the left. I'd appreciate any reader reactions/suggestions.

Posted by James Greiner at 10:19 PM

October 5, 2006

Causation and Manipulation IV: Conditional Effects

Mike Kellermann

People who read this blog regularly know that few things get authors and commentators as worked up as questions about causal inference, either philosophical (here, here, and here) or technical (here, here, here, etc.). I wouldn't want to miss out on the fun this time around -- and how could I pass up the opportunity to have the IV post on causation and manipulation?

Jens and Felix have both discussed whether non-manipulable characteristics such as race or gender ("attributes" for Holland) can be considered causes within the potential outcomes framework. I agree with them that, at least as far as Holland is concerned, the answer is (almost always) no - no causation without manipulation. The fact that we are having this discussion 20 years later suggests (to me, at least) that this answer is intuitively unsatisfying. It is worth remembering a comment made by Clark Glymour in his discussion of the Holland (1986) article:

People talk as they will, and if they talk in a way that does not fit some piece of philosophical analysis and seem to understand each other well enough when they do, then there is something going on that the analysis has not caught.

Identifying perceptions of an attribute (rather than the attribute itself) as the factor subject to manipulation makes a lot of sense in situations where the potential outcomes are to a certain degree out of the control of the individual possessing the attribute, as in the discrimination example. Extending this idea to situations in which outcomes are generated by the subject possessing the attribute (in which "self-perceptions" would be manipulated) would commit researchers to a very particular understanding of attributes such as race and gender that would hardly be uncontroversial.

In these cases, I think that it makes more sense to look at the differences in well-specified Rubin-Holland causal effects (i.e. the results of manipulation) conditional on values of the attribute rather than identifying a causal effect as such. So, for example, in the gender discrimination example we could think of the manipulation as either applying or not applying for a particular job. This is clearly something that we could randomize, so the causal effect would be well defined. We could calculate the average treatment effect separately for men and women and compare those two quantities, giving us the difference in conditional causal effects. I'm sure that there is a catchy name for this difference out there in the literature, but I haven't run across it.

So, is this quantity (the difference in conditional causal effects) of interest to applied researchers in the social sciences? I would argue that it is, if for nothing else than giving us a more nuanced view of the consequences of something that we can manipulate. Is it a Rubin-Holland causal effect? No, but that is a problem only to the extent that we privilege "causal" over other useful forms of inference.

Posted by Mike Kellermann at 11:00 PM

October 4, 2006

Causation and Manipulation III: Let’s Be Specific

Felix Elwert

Two recent post by Jim and Jens ponder the holy grail of manipulability via the exchange between Holland and Heckman. Can non-manipulable things like gender or race cause things in the potential outcomes framework?

Holland (1986) says no because it’s hard to conceive of changing the unchangeable. Fair enough. But this argument has been carried too far in some quarters and not far enough in others. Here’s why:

Invoking Holland, some population scientists now go so far to claim that we can’t conceive of things like marriage or divorce as causes because the decision to marry or divorce is beyond the direct control of an experimenter. Please. At most we need some exogeneity, a little speck of indifference, a tipping point to make them amenable to coherent causal thinking (and estimation). Heckman goes even farther than this, and he is right: the issue is not whether I, personally, can wreck all marriages in my study, but whether we can coherently conceive of a counterfactual world where things are different as a matter of theoretical speculation ("mental act"). In this, however, even Heckman seems to yield: A minimum requirement for thinking about counterfactual worlds would appear to be the possibility of conceiving of these worlds in a coherent fashion. And this, I believe is the underlying unease of the statisticians whom Heckman criticizes: whether one can even coherently imagine counterfactual worlds in which gender is changed.

On the other hand, social scientists love to talk about the effects of gender and race, which – pace Michael Jackson and Deidre McCloskey – are really hard to think of as manipulable, ceteris paribus. What Holland’s dictum contributes in this respect is the entirely appropriate call for getting the question straight.* For what most of these studies look for is evidence of discrimination. Thinking about discrimination within the potential outcomes framework makes it clear that the issue really isn’t whether we can manipulate the race or gender of a specific person, but rather whether we can manipulate the perception of the person’s race or gender in the eyes of the discriminator. Cases in point: Goldin and Rouse’s study on discrimination in symphony orchestras, where the gender of applicants was obscured (i.e. perceptions manipulated) by staging auditions behind an opaque gauze barrier. Similarly, Grogger and Ridgeway’s paper in the latest issue of JASA uses natural variation in the perceptibility of driver’s skin color (dusk, the veil of darkness) to test for racial profiling in traffic controls. In either case, the causal question was not, what would happen if we changed the musician/driver from female/black to male/white, but, What would happen if we could change knowledge/perception of race and gender.

In other words, there are important causal questions to be asked about race and gender, but these questions don’t necessarily require the manipulability of race and gender. Not even within the potential outcomes framework of causality.

* My pet peeve: Much of social science is so busy providing answers that it forgets to ask well-formulated questions.

Posted by Felix Elwert at 11:00 PM

October 3, 2006

Causation and Manipulation II: The Causal Effect of Gender?

Jens Hainmueller

In a recent post, Jim Greiner asked whether we adhere to the principle of "no causation without manipulation." This principle, if true, raises the question of whether it makes sense to talk about the causal effect of gender.

The Rubin/Holland position on this is clear: it makes no sense to talk about the causal effect of gender because what manipulation and thus what counterfactual one has in mind (a sex-transformation surgery?) is clearly ill-defined. One can ask related questions like sending resumes to employers randomizing female and male names and see whether one gender is more likely to be invited to a job interview, but it makes no sense to think about a causal effect of gender per se.

The contrasting view is presented by one of their main foils, James Heckman, who writes in a recent paper (Andrew Gelman also had a blog post on this): "Holland claims that there can be no causal effect of gender on earnings. Why? Because we cannot randomly assign gender. This confused statement conflates the act of definition of the causal effect (a purely mental act) with empirical difficulties in estimating it. This type of reasoning is prevalent in statistics. As another example of the same point, Rubin (1978, p. 39) denies that it is possible to define a causal effect of sex on intelligence because a randomization cannot in principle be performed. In this and many other passages in the statistics literature, a causal effect is defined by a randomization. Issues of definition and identification are confused. [...] the act of definition is logically separate from the acts of identification and inference." Heckman sees this as a "view among statisticians that gives rise to the myth that causality can only be determined by randomization, and that glorifies randomization as the ‘‘gold standard’’ of causal inference."

So what do you make of this? Does it make sense to think about a causal effect of gender or not? Does it make sense to try to estimate it, i.e. interpret a gender gap in wages as causal (balance on all confounders except gender). How about the causal effect of race, etc.? Just to be precise here notice that Rubin/Holland admit that "even thought it may not make much sense to talk about the 'causal' effect of a person being a white student versus being a black student, it can be interesting to compare whites and blacks with similar background characteristics to see if there are differences" in some outcome of interest.

Posted by Jens Hainmueller at 10:00 PM

October 2, 2006

Applied Statistics –Subharup Guha & Louise Ryan

This week the Applied Statistics Workshop will present a talk by Subharup Guha, Post-Doctoral Research Fellow in the Harvard School of Public Health Department of Biostatistics, and Louise Ryan, Henry Pickering Walcott Professor of Biostatistics in the Harvard School of Public Health and Department of Biostatistical Science at the Dana-Farber Cancer Institute.

Before coming to Harvard, Dr. Guha received his Ph.D. in Statistics at Ohio State University. Dr. Guha’s publications appear in Environmental and Ecological Statistics, Journal of the American Statistical Association, Journal of Computational and Graphical Statistics and the Journal of the Royal Statistical Society. His research interests include Bayesian modeling, computational biology, MCMC simulation, Semiparametric Bayesian methods, Spatio-temporal models and survival analysis.

Professor Ryan earned her Ph.D. in Statistics from Harvard University, and has been a member of the Department of Biostatistics since then. She has received numerous honors and distinctions during that time including the the Spiegelman Award from the American Public Health Association, and was named Mosteller Statistician of the Year. She has published extensively in Biometrics, Journal of the American Statistical Association, Journal of Clinical Oncology, and the New England Journal of Medicine. Her research interests focus on statistical methods related to environmental risk assessment for cancer, developmental and reproductive toxicity and other non-cancer endpoints such as respiratory disease, with a special interest in the analysis of multiple outcomes as they occur in these applied settings.

Dr. Guha and Professor Ryan will present a talk entitled "Gauss-Seidel Estimation of Generalized Linear Mixed Models with Application to Poisson Modeling of Spatially Varying Disease Rates." The paper that accompanies the talk is available from the course website. The presentation will be at noon on Wednesday, October 4th, in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Eleanor Neff Powell at 12:02 PM

It Takes Two (Non-Motown Version)

The New York Times recently published an obituary for David Lykken, who was a pioneer of twin studies. His “Minnesota Twin Studies” suggested the importance of genetic factors in life outcomes. But his work with twins also spurred empirical research in many fields, not just genetics – and for good reason.

The idea of using twins for social science studies is very appealing: some twins are genetically identical, and also grow up in the same family and environment. So from a statistical perspective, comparing outcomes such as earnings between pairs of twins is like having a “perfect match." This idea made the rounds in many fields, such as labor economics. By using the argument that all unobserved characteristics (e.g. “genetic ability”) should be equal and can thus be differenced away, twin studies were used to estimate the returns to education – the effect of education on wages.

Alas there are potential problems with using twin data. For example, measurement error in a difference estimation can lead to severe attenuation bias precisely because twins are so similar. If there is little variation in educational attainment, even small measurement errors can strongly affect the estimate. Researchers have been ingenious about this (e.g. by instrumenting one persons’ education with the level that her twin reported, as in Ashenfelter and Krueger). While this may reduce the attenuation bias it can magnify the omitted variables bias which motivated the use of twins in the first place. Because there are only small differences in schooling, small unobserved differences in ability can lead to a large bias. The culprits can be details such as differences in birth weight (Rosenzweig and Wolpin have a great discussion of such factors). In addition, twins who participate in such studies are a selected group: they are getting along well enough to participate, and many of them get recruited at “twin events.” But not all twins party in Twinsburg, Ohio.

Of course none of this is to belittle the contribution of Dr Lykken, who besides helping to start this flurry of work also was also a major contributor to happiness research.

Posted by Sebastian Bauhoff at 1:10 AM

September 29, 2006

Political Statistics Blogs

Mike Kellermann

With the 2006 election coming up soon, here are a couple of blogs that might appeal to both the political junkie and the methods geek in all of us. Political Arithmetik , a blog by Charles Franklin from Wisconsin, is full of cool graphs that illustrate the power of simple visualization and non-parametric techniques, something that we spend a lot of time talking about in the introductory methods courses in the Gov Department. (On a side note, I think that the plots like this of presidential approval poll results that you find on his site and others have to be one of the best tools for illustrating sampling variability to students who are new to statistics.) Professor Franklin also contributes to another good polling blog, Mystery Pollster, run by pollster Mark Blumenthal. It just moved to a new site, which now has lots of state-level polling data for upcoming races. All in all, plenty of good stuff to distract you from the "serious" work of arguing about causal inference, etc.

Posted by Mike Kellermann at 11:00 PM

September 28, 2006

Causation and Manipulation

Jim Greiner

In a 1986 JASA article, Paul Holland reported that he and Don Rubin had once made up the motto, “NO CAUSATION WITHOUT MANIPULATION.” The idea is that even in an observational study, causal inference cannot proceed unless and until the quantitative analyst identifies an intervention that hypothetically could be implemented (although Professor Holland accepts the idea that the manipulation may be not ever be carried out for physical or ethical reasons). The idea of studying the causal effect of things that we as human beings could never influence is incoherent because such things could never be the subject of a randomized experiment.

My question: do we really adhere to this principle? Take the one causal link established via observational studies that pretty much everyone (even Professor Freedman, see below) agrees on: smoking causes lung cancer. Has anyone ever bothered to imagine what manipulation to make people smoke is contemplated? Aren’t we pretty sure it wouldn’t matter how we intervened, i.e., however it happens that people smoke, those who smoke get lung cancer at a higher rate? (It might matter what they smoke, how much they smoke, perhaps even where and when, but what got them started and what keeps them at it?) If folks agree with me on this, what’s left of Professor Holland’s maxim?

Paul W. Holland, Statistics and Causal Inference, 81 J. Am. Stat. Ass’n 945, 959 (1986)

David Freedman, From Association to Causation: Some Remarks on the History of Statistics, 14 Stat. Sci. 243, 253 (1999)

Posted by James Greiner at 11:00 PM

September 27, 2006

Mind the Coding

Here's something new to pick at, in addition to methods problems: coding isues. A recent Science (August 18, 2006, pages 979-982) article by Bruce Dohrenwend and colleagues reported on revised estimates of post traumatic stress disorders of Vietnam veterans. See here for an NYT article. The new study indicates that some 18.7% of Vietnam veterans developed diagnosable post-traumatic stress, compared with earlier estimates of 30.9%. The differences comes mainly from using revised measures of diagnosis and exposure to combat for a subset of the individuals covered in the original data source, the 1988 National Vietnam Veterans' Readjustment Study (NVVRS). The authors added military records to come up with the new measures.

Given the political and financial importance (the military has a budget for mental health), this is quite a difference. One critical issue pointed out by the Science article is that the original study did not adequately control for veterans who had been diagnosed for mental health problems before being sent to combat. Just looking at the overall rates after combat is not a great study design. But this also makes me wonder about how the data was collected in the first place. Maybe the most disabled veterans didn’t reply to the survey, or were in such state of illness that they couldn’t (or had died of related illnesses). The NVVRS is supposedly representative but this would be an interesting point to examine.

This article also illustrates how important the data, measures and codings are in social science research these days. It seems that taking these issues more seriously should be part of the academic and policy process just like replication should be (see here and here for some discussion this issue). While study and sample design are under much scrutiny these days, there are still few discussions about the sensitivity to coding and data. Given the difference they can make, this should change.

Posted by Sebastian Bauhoff at 11:00 PM

September 26, 2006

Publication bias, really?!?

Amy Perfors

I'm a little late into the game with this, but it's interesting enough that I'll post anyway. Several folks have commented on this paper by Gerber and Malhotra (which they linked to) about publication bias in political science. G&M looked at how many articles were published with significant (p<0.05) vs. non-significant results, and found -- not surprisingly -- that there were more papers with significant results than would be predicted by chance; and, secondly, that many of the significant results were suspiciously close to 0.05.

I guess this is indeed "publication bias" in the sense of "there is something causing articles with different statistical significance to be published differentially." But I just can't see this as something to be worried about. Why?

Well, first of all, there's plenty of good reason to be wary of publishing null results. I can't speak for political science, but in psychology, a result can be non-significant for many many more boring reasons than that there is genuinely no effect. (And I can't imagine why this would be different in poli sci). For instance, suppose you want to prove that there is no relation between 12-month-olds' abilities in task A and task B. It's not sufficient to show a null result. Maybe your sample size wasn't large enough. Maybe you're not actually succeeding in measuring their abilities in either or both of the tasks (this is notoriously difficult with babies, but it's no picnic with adults either). Maybe A and B are related, but the relation is mediated by some other factor that you happen to have controlled for. etcetera. Now, this is not to say that no null results are meaningful or that null results should never be published, but a researcher -- quite rightly -- needs to do a lot more work to make it pass the smell test. And so it's a good thing, not a bad thing, that there are fewer null results published.

Secondly, I'm not even worried about the large number of studies that are just over significance. Maybe I'm young and naive, but I think it's probably less an indication of fudging data than a reflection of (quite reasonable) resource allocation. Take those same 12-month-old babies. If I get significant results with N=12, then I'm not going to run more babies in order to get more significant results. Since, rightly or wrongly, the gold standard is the p<0.05 value (which is another debate entirely), it makes little sense to waste time and other resources running superfluous subjects. Similarly, if I've run, say, 16 babies and my result is almost p<0.05, I'm not going to stop; I'll run 4 more. Obviously there is an upper limit on the number of subjects, but -- given the essential arbitrariness of the 0.05 value -- I can't see this as a bad thing either.

Posted by Amy Perfors at 11:00 PM

Applied Statistics – Ben Hansen

This week the Applied Statistics Workshop will present a talk by Ben Hansen, Assistant Professor of Statistics at the University of Michigan. Professor Hansen graduated from Harvard College, magna cum laude, with a degree in Mathematics and Philosophy. He went on to win a Fulbright Fellowship to study philosophy at the University of Oslo, Norway, after which he earned his Ph.D. in Logic and Methodology of Science at the University of California, Berkeley.

Professor Hansen’s primary research interests involve causal inference in comparative studies, particularly observational studies in the social sciences. His publications appear in the Journal of Computational and Graphical Statistics, Bernoulli, Journal of the American Statistical Association, and Statistics and Probability Letters. He is currently working on providing methods for statistical adjustment that enable researchers to mount focused, specific analogies of their observational studies to randomized experiments.

Professor Hansen will present a talk entitled "Covariate balance in simple, stratified and clustered comparative studies." The working paper that accompanies the talk is available from the course website. The presentation will be at noon on Wednesday, September 27, in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

If you missed the workshop’s first meeting, you should check out the abstract of Jake Bowers’ talk, “Fixing Broken Experiments: A Proposal to Bolster the Case for Ignorability Using Subclassification and Full Matching”.

Posted by Eleanor Neff Powell at 4:34 PM

September 25, 2006

Freeloading: Economics meets Poly Sci, imitates Art

In the next few weeks, the number of articles posted to this site is set to increase, partly because school's back in session, and partly because we've recruited some new authors for the committee. This is a good thing in general. However, I know I work best on a deadline, so it happens that I tend to post when the flow is slower, and less when a lot of articles are being posted by the other authors.

To bring this back to the realm of science: Am I taking the position of a economic free rider (or "freeloader", if you prefer), if I tend to post less frequently than other authors, or is someone in my position merely acting as a balancing actor, keeping stability?

As for the "art", I doubt that this observation is opera-worthy, but it does tend to happen a lot in social situations I've seen. Certainly in an early episode of Seinfeld where George wanted to split a cab but not have to pay for it because they "were going that way anyway".

Posted by Andrew C. Thomas at 2:00 PM

September 19, 2006

Dirichlet Simplex Exploration: Or, My Prayer Answered

Andrew Fernandes, a fellow Canadian expat and PhD student at NC State, responded to my earlier request for advice on exploring a Dirichlet-type simplex.

Among other places, the idea is presented in the Wikipedia entry for Simplex. He suggests perturbing the cumulative sums, then putting the perturbed sums back in order to draw a time-reversible proposal. This has the advantage of not sending too many parameters below zero - a maximum of one - as opposed to an equal perturbation of each parameter, and not pinning a high-valued parameter in place with a standard Dirichlet proposal.

Posted by Andrew C. Thomas at 11:32 PM

September 15, 2006

What are your thoughts?

Amy Perfors

Ah, the beginning of fall term -- bringing with it the first anniversary of this blog (yay!), a return to our daily posting schedule (starting soon), and a question for you, our readers:

Do you have any feedback for us? Specifically, are there topics, issues, or themes you would like us to cover more (or less) than we do? Would you like to see more discussion of specific content and papers? More posts on higher-level, recurring issues in each of our fields (or across fields)? More musings about teaching, academia, or the sociology of science? Obviously the main factor in what we write about comes down to our whims and interests, but it's always nice to write things that people actually want to read.

In my specific case, I know that I try not to blog about many cognitive science and psychology topics that I think about if they aren't directly related to statistics or statistical methods in some way: I fear that it wouldn't be of interest to readers who come here for a blog about "Social Science Statistics". However, maybe I've been needlessly restrictive...?

So, readers, what are your opinions?

Posted by Amy Perfors at 11:21 AM

September 10, 2006

The Tenth Dimension

The semester is about to start, which means it is math camp time at the Government Department. The very first topic is usually an introduction to dimensions, starting from R1 (lines), to R2 (planes), to R3 (3D planes), to R4 (3D plane plus time). Here is a nice flash animation (click on “imagining ten dimensions” on the left) that takes you a step further, from zero to ten dimensions in less than 5 minutes (including cool visual and acoustic effects). It doesn’t necessarily become more graspable as you ascend ... :-)

Posted by Jens Hainmueller at 8:26 AM

August 7, 2006

In Which Drew Suggests that Scientists Avoid The Word 'Regression'

I've spent quite a bit of time in the last few weeks - probably too much - thinking about the term 'regression' and its use in statistics, and why I find it so dislikeable. I sincerely doubt any campaign I try to start will have any real effect, so let me lay down the reasons why I feel we as scientists should refer to linear modelling as just such, and not as 'regression'.

One reason is that the word only has a tenuous connection to the actual algorithm - the other is that it far too often implies a causal relationship where none exists.

As the story goes, Francis Galton took a group of tall men and measured the height of their sons, and found that on average, the sons as a group were shorter than their fathers. Drawing on similar work he had done with pea plants, he described this phenomenon as "regression to the mean," recognizing that the sample of fathers was nonrandom. A "regression coefficient" then described the estimated parameter which, when multiplied by the predictor, would produce the mean value.

I can only surmise that "determining regression coefficients through minimizing the least squares difference" was too verbose for Galton and his buddies, and "regression analysis" stuck. Now we have lawyerese terms like "multiple regression analysis," which really should read "multiple parameter regression analysis" since we're only running one algorithm, but we appear stuck with it.

So what's the big deal? Nomenclature isn't an easy business, and two extra syllables in "linear model" might slow things down. But aside from my gripe with using "regress" as a transitive verb (the Latin student in me cringing), even the most generous interpretation of the word's root, and the experiments that revealed it, yield to trouble.

"Regression" literally means "the act of going back." If we accept this definition in this context, we have to have something to which we can return. Clearly, this implies discovering the mean - but chronologically, it can only mean discovering the cause, that which came before.

Linear modelling makes no explicit assumptions about cause and effect, a major source of headache in our discipline, but the word itself, consciously or otherwise, binds us to this fact.

The remedy to this is not simple; after all, I'm talking about trying to break the correlation-is-causation fallacy through words, which is both a difficult task and the sort of behaviour that will keep people from sitting with you at lunch. But we can improve things slowly and subtly in this fashion:

1) If you are confident that your analysis will unveil a causal relationship, say so. Call it "regression-to-cause", or "causal linear model", or something like that.

2) If you're not so sure, call it a (generalized) linear model, or a lin-mod, or a least-squares, or another term that does not necessarily imply causation. Resist the temptation to fall back to the word "regression" until a long time has passed.

This doesn't have to be a completely nerve-wracking exercise; just use a strike-through when necessary, to show that the term regression'linear model' is better suited to describe what we're trying to build here.

Posted by Andrew C. Thomas at 11:30 PM

July 30, 2006

C. Frederick Mosteller, 1916-2006

C. Frederick Mosteller, the first chairman of the Statistics Department at Harvard, passed away last week at the age of 89. He served as chair of the Statistics Department from 1957 to 1969, and later chaired the departments of Biostatistics and Health Policy and Management at the Harvard School of Public Health. His obituary in the New York Times mentions his work reviewing the performance of pollsters in the Dewey-Truman election of 1948 and his explanation of the Red Sox inexplicable loss in the 1946 World Series ("There should be no confusion here between the 'winning team' and the 'better team'"), but doesn't say that he took a leave of absence in the early sixties to record a lecture series for NBC. According to one history of the Statistics Department, 75,000 students took the course for credit and 1.20 million (give or take) watched the lectures on television. Imagine doing that today....

Posted by Mike Kellermann at 8:55 PM

July 1, 2006


A letter I wrote in reaction to the Texas decision made it into today's New York Times. It even has a nice little plug for IQSS at the bottom.

Posted by Andrew C. Thomas at 2:52 PM

June 28, 2006

News In Texas Redistricting

The noted Texas redistricting case, known politically for its role involving Tom DeLay and academically for the amici curiae brief filed by Gary King, Andrew Gelman, Jonathan Katz and Bernard Grofman, was ruled on by the Supreme Court today. In short, the party-based gerrymandering was not a problem - nor was the fact that it was done off the traditional calendar - but the composition of districts involving the dilution of Hispanic voters was. The court has ordered that those irregular districts be redrawn. (Note: only the composition of District 23 was considered to be in violation of the Voting Rights Act, but you obviously can't redraw one district without affecting another.)

The nature of this ruling should surprise no one involved in Jim Greiner's Quantitative Social Science and Expert Witnesses class.

A good summary is here.

Posted by Andrew C. Thomas at 11:23 AM

June 18, 2006

Statz Rap

Amy Perfors

A friend emailed this to me;apparently the teaching assistants at the University of Oregon have creative as well as statistical talents. It's pretty funny. Perhaps every intro to statistics class could begin with a showing... video here

Posted by Amy Perfors at 4:21 PM

May 25, 2006

Dirichlet Spaces and Metropolis Traces

Drew Thomas

A problem I've had come up again and again is the ability to explore a space bound by a Dirichlet prior with a Metropolis-type algorithm. I've yet to find a satisfactory answer and I'm hoping someone else will have some insight.

The research question I have deals with allocating patients to hospitals, considering the effect of the number of beds - one example of the "supply-induced demand" question. (The analysis is being done under Prof. Erol Pekoz, who's visiting Harvard Stats this year.) Conjugate priors for this problem have eluded me, and so the quantity of interest, the probability that a patient will be sent to a particular hospital for inpatient care, is being inferred through a Metropolis algorithm.

Here's the thing: there are at most 64 different hospitals to which a patient can be assigned. Even after assuming that if a hospital has not yet received a patient from a particular area they won't ever, the number of hospitals is extreme.

One suggested proposal has been a Dirichlet distribution with parameters equal to the current values, times a constant. That way the expected value of the proposal will be the same as the last draw. However, when the number is too low, the smallest dimensions will have parameter value less than 1, which leads to trouble, as the value will tend to zero; when it's too high, the biggest parameters don't move at all, and the effect of moving some of its mass is lost.

I've considered implementing a parallel-tempering method but I'd like to keep it cleaner. Does anyone have a better method that's reasonably quick to run, rather than monkeying with each parameter individually?

Posted by Andrew C. Thomas at 6:00 AM

May 23, 2006

Inheritance Laws

Jason Anastasopoulos, guest blogger

Question: Many political philosophers that focused on questions of property (including Plato) believed that equality of conditions were necessary for the development of a virtuous citizenry and virtuous leaders. The key to creating this equality of conditions, they argued, was the implementation of strict inheritance laws limiting the transfer of wealth from one generation to the next. Does anyone know of any quantitative models or empirical studies that examine the interaction between social stratification and inheritance laws? If you do, email me at

Posted by James Greiner at 6:00 AM

May 20, 2006

It's summer!

It's the end of the term for both Harvard and MIT... so in view of the fact that we on the authors committee are about to embark on summers of tireless dedication to research while scattered to the far reaches of the planet, posting to this blog will be reduced until fall.

A special thanks to the loyal readers and commenters of this blog -- you folks have made this year a really rewarding experience for us. We won't stop posting, so do hope you still stop by occasionally and are still with us when we resume on a full schedule at the end of the summer.

Posted by Amy Perfors at 2:09 PM

May 18, 2006

Reactions To The Virginity Pledge Study

Drew Thomas

Harvard School of Public Health doctoral candidate Janet Rosenbaum has been in the news lately, following the publication of her study of virginity pledges in the American Journal of Public Health, as well as her recent IQSS seminar. (Full disclosure: Janet is a friend of mine. I'll address her as Ms. Rosenbaum for this entry.) Since it's certainly a hot topic, it's no surprise how much attention her findings have received; first, the big news agencies picked it up, then the blogosphere took their shift - mainly over the "controversy" resulting from the study. (See for an example.)

But I think the more relevant part of the whole debate is the point Ms. Rosenbaum was trying to make about surveys and self-reporting: we use these data to make broad, sweeping conclusions on social phenomena, and while they are the best we have, they aren't up to the best standard we could achieve.

Posted by Andrew C. Thomas at 6:04 AM

May 16, 2006

Communication, Anyone?

Jim Greiner

The course I co-taught this semester on Quantitative Social Science & Law has come to an end. There were a lot of “lessons learned” in the class, both for the students (at least, I hope so) and for the teaching staff (more definitely). Of all of these lessons, one sticks in my head: we ought to focus on teaching quantitative students how to communicate with folks without formal statistical training.

Some quantitative folks will graduate and spend the rest of their lives talking to and working with only quantitative people. Some, but not many. Most of us will be talking and working with people who have little or no statistics classes under their belts. But do we ever teach the communication skills needed to function effectively with the proles? I’ve never seen or heard of a class that focuses on these skills. Not one. Does that strike anyone besides me as odd?

Posted by James Greiner at 6:00 AM

May 15, 2006

A bit of google frivolity

Amy Perfors

Google has just come out with a new tool, Google Trends, which compares the frequencies of different web searches and thus provides hours of entertainment to language and statistics geeks like myself. In honor of that -- and, okay, because it's nearing the end of the term and I'm just in the mood -- here's a rather frivolous post dedicated to the tireless folks at Google, for entertaining me today.

Some observations:

1) One thing that is interesting (though in hindsight not surprising) is that Google Trends seems like a decent tool for identifying how marked a form is. The basic idea is that a default term is unmarked (and often unsaid), but the marked term must be used in order to communicate that concept. For instance, in many sociological domains, "female" is marked more than "male" is -- hence people refer to "female Presidents" a lot more than they refer to "male Presidents", even though there are many more of the latter: the adjective "male" is unnecessary because it just feels redundant. In contrast, you much more often say "male nurse" than "female nurse", because masculinity is marked in the nursing context.

Anyway, I noticed that for many sets of words, the term that is searched for most often is the marked term, even though the unmarked term probably occurs numerically more often. For instance, Blacks, whites indicates far more queries for "blacks"; Gay, straight many more for "gay"; and Rich, poor, middle class the most for rich, followed by poor, and least of all middle class.

I have two hypotheses to explain this: (a) people generally google for information, and seek information about what they don't know; so it's not surprising that more people don't know about the non-default, usually numerically smaller item. And, (b) since unmarked means it doesn't need to be used, it's not really a surprise that people don't use it. Still, I thought it was interesting. And clearly this phenomenon, if real at all, is at most only one of many factors affecting query frequency: for instance, Christian, atheist, muslim indicates far more hits for "Christian", and those in very Christian areas.

2) Another observation: the first five numbers seem to have search frequencies that drop by half with each consecutive number. Is this interesting for cognitive reasons? I have no idea.

3) As far as I can tell, no search occurs more often than "sex." If anyone can find something with greater frequency, I'd love to hear it. On the one hand, it may say good things for our species that "love" beats out "hate", but that may just mean more people are searching for love than hate. And "war" beats out "peace", sadly enough.

4) "Hate bush" peaked right before the 2004 election, "love bush" about six months before that. I have no idea what that's all about.

5) It's amazing to me how many people clearly must use incredibly unspecific searches: who searches for "one"? Or "book"? Though there is no indication of numbers (a y axis on these graphs would be incredibly handy), a search needs a minimum number of queries otherwise it won't show up, so somebody must be making these.

6) In conclusion, I note that Harvard has more queries than MIT. Does this mean that MIT is the "default"? Or that Harvard generates more interest? Since I'm an MIT student but writing for a Harvard blog, I plead conflict of interest...

Posted by Amy Perfors at 6:00 AM

May 12, 2006

Statistical Discrimination in Health Care

Sebastian Bauhoff

This blog has frequently written about testing for discrimination (see for example here, here, and here). This is also a hot issue in health care. In health care there is a case for 'rational' discrimination' where physicians respond to clinical uncertainty by relying on priors about the prevalence of diseases across racial groups (for example).

A paper by Balsa, McGuire and Meredith in 2005 lays out a very nice application of Bayes Rule to look into this question. The Institute of Medicine suggests that there are three types of discrimination: simple prejudice, stereotyping, and statistical discrimination where docs use probability theory to overcome uncertainty. The latter occurs when the uncertainty of a patients condition leads the physician to treat her differently from similar people of different race.

The paper uses Bayes Rule to conceptualize the decision a doctor has to make when hearing symptom reports from a patient and has to decide whether the patient really has the disease:

Pr(Disease | Symptom) = Pr(Symptom | Disease) * Pr(Disease) / Pr(Symptom)

A doc would decide differently if she believed that disease prevalence differs across racial groups (which affects Pr(Disease)), or if diagnostic signals are more noisy from some groups (which changes Pr(symptom)), maybe because the quality of doctor-patient communication differs across races.

The authors test their model on diagnosis data from family physicians and internists, and find that sensible priors about disease prevalance could explain racial differences in the diagnosis of hypertension and diabetes. For the diagnosis of depression there is evidence that differences in doctors' decisions may be driven by different communication patterns between white docs and their white vs. minority patients.

Obviously prejudice and stereotyping are different from statistical discriminiation, and have quite different policy implicatons. This is a really nice paper that makes these distinctions clear as well as nicely using Bayes Rule to conceptualize the issues. The general idea might also apply to other issues of policy including police stop and search.

Posted by Sebastian Bauhoff at 6:00 AM

May 10, 2006

An Intoxicating Story

From Wikipedia's entry on the t-test:

The t-statistic was invented by William Sealy Gosset for cheaply monitoring the quality of beer brews. "Student" was his pen name. Gosset was statistician for Guinness brewery in Dublin, Ireland, hired due to Claude Guinness's innovative policy of recruiting the best graduates from Oxford and Cambridge for applying biochemistry and statistics to Guinness's industrial processes. Gosset published the t-test in Biometrika in 1908, but was forced to use a pen name by his employer who regarded the fact that they were using statistics as a trade secret. In fact, Gosset's identity was unknown not only to fellow statisticians but to his employer - the company insisted on the pseudonym so that it could turn a blind eye to the breach of its rules. Today, it is more generally applied to the confidence that can be placed in judgements made from small samples.

I like the way they think.

Posted by Andrew C. Thomas at 6:00 AM

May 9, 2006

Running Statistics On Multiple Processors

Jens Hainmueller

You just bought a state-of-the-art PC with dual processors and yet your model still runs forever? Well, your statistical software is probably not multi-threading, meaning that despite the fact that your computer actually has two processors, the whole computation runs only on one of them. Don’t believe me? Well check your CPU usage, it's probably stuck at 50 percent (or less).

You might ask why statistical software doesn't use both processors simultaneously. The fact is that splitting up computations to two or even more processors is a non-trivial issue that many software packages do not accomplish yet. This may change in the near future, however, as the advent of dual processors for regular PCs exhibits increasing pressure on statistical software producers to allow for multi-threading.

In fact, Stata Corp. has recently released Stata/MP, a new version of Stata/SE that runs on multiprocessor computers. Their website proclaims that: "Stata/MP provides the most extensive support for multiple-processor computers and dual-core computers of any statistics and data-management package." So this bodes well for Stata users.

What’s in it for Non-Stataists? People at S-PLUS told me yesterday that there is "currently an enhancement request to add functionality to S-PLUS that will allow it to use multiple processors. This request has been submitted to our developers for further review." Unfortunately no further information is available at this point.

In my favourite software R, there are efforts to get concurrency and potentially parallelism. Currently, the SNOW package allows for simple parallel computing.

It will be interesting to see how other statistical software producers like SAS, LIMDEP, etc. will react to this trend toward dual processing. Does anybody have more information about this issue?

Posted by Jens Hainmueller at 6:00 AM

May 8, 2006

Coarsened at Random

Jim Greiner

I’m the “teaching fellow” (the “teaching assistant” everywhere but Harvard, which has to have its lovely little quirks: “Spring” semester beginning in February, anyone?) for a course in missing data this semester, and in a recent lecture, an interesting concept came up: coarsened at random.

Suppose you have a dataset in which you know or suspect that some of your data values are rounded. For example, ages of youngsters might be given to the nearest year or half-year. Or perhaps in a survey, you’ve gotten some respondents’ incomes only within certain ranges. Then the data has been “coarsened” in the sense that you know that the true value is within a certain range, but you don’t know where within that range.

Happily, techniques have been developed to handle this sort of situation. In many ways, the game is the same as that in the missing data setting. Just as in the missing data context good things happen when the data are missing at random, so also in this context good things happened when the data are coarsened at random. Thus, to begin with, you have to consider (among other things) whether you think the probability that you will observe only a range of possible data values, as opposed to the specific true value, depends on something you don’t observe (such as that specific true value). A good place to start on all this is Heitjan & Rubin, “Inference from Coarse Data via Multiple Imputation with Application to Age Heaping,” 85 JASA 410 (1990).

One final point: you might think that coarsened at random is a specific case of missing at random. Actually, it’s the other way around. Data can be (and often is assumed to be) coarsened at random but not missing at random. Think and you’ll see why.

Posted by James Greiner at 6:00 AM

May 4, 2006

Detecting Attempted Election Theft

At the Midwest conference last week I saw Walter Mebane presenting his new paper entitled "Detecting Attempted Election Theft: Vote Counts, Voting Machines and Benford's Law." The paper is really fun to read and contains many cool ideas about how to statistically detect vote fraud in situations where only minimal information is available.

With the advent of voting machines that replace traditional paper ballots physically verifying vote counts becomes impossible. As Walter Mebane puts it: "To steal an election it is no longer necessary to toss boxes of ballots in the river, stuff the boxes with thousands of phony ballots, or hire vagrants to cast repeated illicit votes. All that may be needed nowadays is access to an input port and a few lines of computer code.?

How does Mebane utilize statistical tools to detect voting irregularities? He relies on two sets of tests:

The first test relies on Benford’s Law. The idea here is that if individual votes originate from a mix of at least two statistical distributions there may be a rationale to expect that the distribution of the digits in reported vote counts should satisfy the second digit Benford's law. Walter provides simulations showing that the Benford's Law test is sensitive to some kinds of manipulation of vote counts but not others.

The second set of tests relies on randomization. The idea is based on the assumption that in each precinct (especially crowded ones) voters may be randomly and independently assigned to each machine used in the precinct. The test involves checking whether the split of the votes is the same on all the machines used in a precinct. If some of the machines were indeed hacked, the distribution of the votes among candidates would differ on the affected machines. Mebane tests these expectations against data from three Florida counties with very interesting findings.

In general, the paper was very well received by the audience. Some attendees raised concerns about the randomization test, arguing that voters may not be randomly assigned to voting machines (for example old voters may be more likely to go to the first machine in line etc.). The discussant, Jonathan Wand, raised the idea of actually using random assignment of voters to voting machines as an administrative tool to facilitate fraud detection ex post. He also proposed to use sampling techniques to make recounts more feasible (but that would require voting machines that do leave a paper trail). Another comment alluded to the fact that if somebody smart wants to steal an election, he or she might anticipate some of Walter's tests and design manipulations so that they satisfy the test.

Overall, my impression is that although his research is admittedly still at an early stage, Mebane is onto something very cool here and I am eager to see the redrafts and more results in the future. This is a very important topic given that more and more voting machines will be used in the future. Everybody interested in the vote fraud should read this paper.

Posted by Jens Hainmueller at 6:00 AM

May 3, 2006

Sensitivity Analysis

Felix Elwert

Observational studies, however well done, remain exposed to the problem of unobserved confounding. In response, methods of formal sensitivity analysis are growing in popularity these days (see Jens's post on a related issue here.)

Rosenbaum and Rubin's basic idea is to hypothesize the existence of an unobserved covariate, U, and then to recompute point-estimates and p-values for a range of associations between this unobserved covariate and, in turn, the treatment T and the outcome Y. If moderate associations (= moderate confounding) change the inference about the effect of the treatment on the outcome we question the robustness of our conclusions.

But how to assess whether the critical association between U, T, and Y that would invalidate the standard results is large in substantive terms?

One popular strategy compares this critical association to the strength of the association between T, Y, and an important known (and observed) confounder. For example, one might say that the amount of unobserved confounding it would take to invalidate the conclusions of a study on the effect of sibship size on educational achievement would have to be at least as large as the amount of confounding generated by omitting parental education from the model.

This is indeed the strategy used in a few studies. But what if U should be taken to stand not for a single but for a whole collection of unobserved confounders? Clearly, it then is no longer credible to compare the critical association of U with the amount of confounding created by a single known covariate. Better to compare it to a larger set of observed confounders. But with larger sets of included variables, we have the problem of interactions between them, and of surpressing and amplifying relationships. In short, gauging the critical association of U with T and Y in substantive terms will become a whole lot less intuitive.

(FYI, Robins and his colleagues in epi have proposed an alternative method of sensitivity analysis, which hasn’t found followers in the social sciences yet, to my knowledge. I’m currently working on implementing their method in one of my projects.)

Posted by Felix Elwert at 6:03 AM

May 2, 2006

The 80% Rule, Part II

Jim Greiner

In my last post, I introduced the so-called 80% rule in employment discrimination cases. In this post, I discuss some of the reasons why it stinks. For the sake of illustration, pretend I’m interested in knowing whether a company discriminates against women in hiring, and recall that the 80% rule says that I should see whether the hiring rate for women is less than 80% of the hiring rate for men.

The first issue with the 80% rule is that it means different things depending on the hiring rate for men. Suppose 90% of men that apply for a job are hired. 80% of 90% is 72%, so the difference between men and women is 18%; that might seem like something worth investigating. But suppose the company at issue is very exclusive, so it only hires 5% of men who apply; 80% of 5% is 4%. Is this 1% difference something to worry about? Perhaps it is, perhaps it isn’t, but it sure is different from the 18% difference in the previous example.

A second issue with the 80% rule is that it varies depending on whether we’re talking about success rates or failure rates ("success" means getting hired here, "failure" means not getting hired). In one of my hypotheticals above, a company hired 90% of the men who applied. So the success rate is 90%, and the failure rate is 10%. If we apply the 80% rule to the success rate, we should worry if the hiring rate for women is below 72%. But what happens if we apply the reasoning of the rule to the failure rate for men? By analogy to the 80% rule’s reasoning, it seems like we should worry if the failure rate for women is greater than, say, 120% (100% + 20%), or perhaps 125% (1/.8 = 1.25), of the failure rate for men. Take the 125% for the sake of argument, and return to our hypothetical in which the failure rate for men was 10%. 125% of 10% is 12.5%, so we should worry if the failure rate for women is greater than 12.5%. But a failure rate for women of greater than 12.5% corresponds to a success rate for woment of less than 87.5%, and we just said that we’re supposed to worry if the success rate was less than 72%. So which is it, 87.5% or 72%?

A final criticism (for the purposes of this post; I could go on and on here): is any of this significant in the statistical sense? P-values, anyone? Significance tests? Posterior intervals? Anything at all?

Next time you hear someone applying the 80% rule in an employment discrimination case, invite the speaker join us on this planet.

Posted by James Greiner at 6:00 AM

May 1, 2006

Applied Statistics - Ben Hansen

This week the Applied Statistics Workshop will present a talk by Ben Hansen, Assistant Professor of Statistics at the University of Michigan. Professor Hansen received his Ph.D. from the University of California at Berkeley and was an NSF Post-doctoral Fellow before joining the faculty at Michigan in 2003. His research interests include optimal matching and stratification, causal inference in comparative studies, and length-optimal exact confidence procedures. His work has appeared in JASA and the Journal of Computational and Graphical Statistics, among others.

Professor Hansen will present a talk entitled "Matching with prognosis scores: A new method of adjustment for comparative studies." The corresponding paper is available from the course website. The presentation will be at noon on Wednesday, May 3 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. An abstract of the paper appears on the jump:

In one common route to causal inferences from observational data, the statistician builds a model to predict membership in treatment and control groups from pre-treatment variables, X, in order to obtain propensity scores, reductions f(X) of the covariate possessing certain favorable properties. The prediction of outcomes as a function of covariates, using control observations only, produces an alternate score, the prognosis score, with favorable properties of its own. As with propensity scores, stratification on the prognosis score brings to uncontrolled studies a concrete and desirable form of balance, a balance that is more familiar as an objective of experimental control. In parallel with the propensity score, prognosis scores reduce the dimension of the covariate; yet causal inferences conditional on them are as valid as are inferences conditional only on the unreduced covariate. They suggest themselves in certain studies for which propensity score adjustment is infeasible. Other settings call for a combination of prognosis and propensity scores; as compared to propensity scores alone, the pairing can be expected to reduce both the variance and bias of estimated treatment effects. Why have methodologists largely ignored the prognosis score, at a time of increasing popularity for propensity scores? The answer lies in part with older literature, in which a similar, somewhat atheoretical concept was first celebrated and then found to be flawed. Prognosis scores avoid this flaw, as emerges from theory presented herein.

Posted by Mike Kellermann at 9:43 AM

April 28, 2006

Human irrationality?

Amy Perfors

I've posted before about the "irrational" reasoning people use in some contexts, and how it might stem from applying cognitive heuristics to situations they were not evolved to cover. Lest we fall into the depths of despair about human irrationality, I thought I'd talk about another view on this issue, this time showing that people may be less irrational than the gloom-and-doom views might suggest.

In Simple heuristics that make us smart Gigerenzer et. al. argue that, contrary to popular belief, many of the cognitive heuristics people use are actually very rational given the constraints on memory and time that we have to face. One strand of their research suggests that people are far better at reasoning about probabilities when they are presented as natural frequencies rather than numbers (as most studies do). Thus, for instance, if people see pictures of, say, 100 cars, 90 of which are blue, they are more likely not to "forget" this base rate than if they are just told that 90% of cars are blue.

A recent paper in the journal Cognition (vol 98, 287-308) expands on this theme. Zhu & Gigerenzer found that children steadily gain in the ability to reason about probabilities, as long as the information is presented using natural frequencies. Children were told a story such as the following:

Pingping goes to a small village to ask for directions. In this village, the probability that the person he meets will lie is 10%. If a person lies, the probability that he/she has a red nose is 80%. If a person doesn't like, the probability that he/she also has a red nose is 10%. Imagine that Pingping meets someone in the village with a red nose. What is the probability that the person will lie?

Another version of the story gave natural frequencies instead of conditional probabilities, for instance "of the 10 people who lie, 8 have a red nose." None of the fourth-grade through sixth-grade children could answer the conditional probability question correctly, but sixth graders approached the performance of adult controls for the equivalent natural frequency question: 53% of them matched the correct Bayesian posterior probability. The fact that none of the kids could handle the probability question is not surprising -- they had not yet been taught the mathematical concepts of probability and percentage. What is interesting is that, even without being taught, they were capable of reasoning "the Bayesian way" about as well as adults do.

The most interesting part of this research, for me, is less about the question of whether people "are Bayesian" (whatever that means), but rather that it highlights a very important message: representation matters. When information is presented using a representation that is natural, we find it a lot easier to reason about it correctly. I wonder how many of our apparent limitations reveal less about problems with our reasoning, and more about the choice or representation or the nature of the task.

Posted by Amy Perfors at 6:00 AM

April 27, 2006


Felix Elwert

Why did people code their missing values as real numbers such as 999 in the old days? Why not “." from the get go? And why do many big, federally funded surveys insist on numerical missing values to this day?

Don’t we all have stories about how funny missing value codes (“-8") got people in trouble (think The Bell Curve)? Are there any anecdotes where people got in trouble for mistaking “." for a legitimate observation?

Posted by Felix Elwert at 6:00 AM

April 26, 2006

Inauthentic Paper Detector

Sebastian Bauhoff

A group at the Indiana School of Informatics has developed a software to detect whether a document is "human written and authentic or not." The idea was inspired by the successful attempt of MIT students in 2004 to place a computer-generated document at a conference (see here). Their program collated random fragments of computer science speak into a short paper that was accepted at a major conference without revision. (That program is online and you can generate your own paper, though unfortunately it only writes computer science articles).

The new tool lets users paste pieces of text and then assesses whether the content is likely to be authentic or just gibberish. The program tries to identify human-style writing that is characterized by certain repition patterns and apparently does rather well. It is not clear whether this works well for social science type articles. The first paragraphs of a recent health economics article (to remain unnamed) only have a 35.5% chance of being authentic. Hmm...

So is this just a joke or useful programming? The authors say it could be used to differentiate whether a website is authentic or bogus, or to identify different types of texts (articles vs blogs, for example). I wonder what the algorithms behind such technology are, and whether this will lead to an arms race between fakers and detectors? If one of them can recognize a human-written text could this be used by the faking software?

If further tweaked, could this have an application in the social sciences? Maybe we could use the faking software to search existing papers, collate them smartly and use that to identify patterns and get new ideas? Maybe everyone should run their papers through a detector software before submitting it to a journal or presenting at a workshop? And students watch out! No more random collating at 3am to meet the next day deadline!

PS: this blog entry has been classified as "inauthentic with a 26.3% chance of being an authentic text"...

Posted by Sebastian Bauhoff at 2:41 PM

Data from China: Land of Plenty? (II)

Sebastian Bauhoff

In the last entry I wrote that China is the new exciting trend for researchers interested in development issues. There are now a number of surveys available, and it is getting easier to obtain data. (For a short list, see here.) However there are two key issues that are still pervasive: language difficulties and little sharing of experiences.

While some Chinese surveys are available in English translation, it is still difficult to fully understand their context. China is a very interesting yet peculiar place. It clearly helps to work with someone who speaks (and reads!) the language, though you might still miss some unexpected information -- and there are many things that can be surprising.

More annoying however is the lack of sharing of information and data. This problem has two associated parts. For the existing data, people seem to struggle with similar problems but don't provide their solutions to others. In the case of the China Health and Nutrition Survey for example, numerous papers have been written on different aspects and the key variables are being cleaned over and over. Apart from the time that goes into that, this can lead to different results.

Another lack of sharing is with regards to existing data or ongoing surveys. There are now a lot of people either who either have or are currently collecting data in China. But it is rather difficult even to find out about existing sources. If you're lucky, you've found an article that uses one. If you're not you might find one only once you put in your funding application.

To really start exploring the exciting opportunities that China may have to offer for research, these problems need to get fixed. I can understand that people don't necessarily want to hand over their data, but it seems that there is too little known about existing surveys, even to researchers who have been working on China for longer. And as for the cleaning of existing data and reporting problems, it just seems like a waste not to share. I wonder if there are similar experiences from other countries?

Posted by Sebastian Bauhoff at 6:00 AM

April 25, 2006

Open and Transparent Data

You, Jong-Sung

There was a big scandal in scientific research recently. Dr. Hwang Woo-suk, Seoul National University in Korea, announced last June that he and his team had cloned human embryonic stem cells from 11 patients. It was a remarkable breakthrough in stem cell research and many people expected that he would eventually get a Nobel Prize. Hwang's team, however, was found to have intentionally fabricated key data in two landmark papers on human embryonic stem cells, according to a Seoul National University panel. Now, the prosecution is probing into his team’s alleged fabrication of data and violation of bioethics law.

Remarkably, the prestigious journal Science was not able to detect the data faking before and after publication of the articles. It is understandable considering that peer reviewers typically examine the presented analysis of the data but do not receive nor examine the actual data itself. Even more surprisingly, most of the 26 co-authors of the June 2005 article were unaware of the data fabrication. It was revealed only through an inside whistleblower who was the second author of the earlier article, and through a team of investigative journalists.

This incident makes us aware of the weakness and vulnerability of the review system of academic journals. Indeed, there have been many fraud cases in the history of scientific research, and Dr. Hwang has just added one more such case. Although outright faking may not be very common, errors in data and data analysis might be much more common than most people assume them to be.

I was struck by numerous errors that were found by students of Gov 2001 who replicated the analysis of an article published in a prominent social science journal. Many of the errors are probably benign and not critical to their key findings, but some errors may be critical and even deliberate. It can be tempting to distort the data or results of data analysis when a researcher has spent much time and energy to find evidence to support his or her hypothesis and the results are close but fall short of significance.

In his entry entitled Citing and Finding Data, Gary King discussed the [in]ability to reliably cite, access, and find quantitative data, all of which remain in an entirely primitive state of affairs. Sebastian Bauhoff also stressed the need for making data available in his entry Data Availability. I cannot agree with them more. If journals require authors to submit data as well as manuscript of their paper and publish data that were used for articles as an on-line appendix, it will certainly reduce the errors in data and data analysis as well as spur further research. This should be applied to qualitative data (such as interview transcripts) as well as quantitative data.

Posted by Jong-sung You at 6:00 AM

April 24, 2006

Applied Statistics - Brian Ripley

This week the Applied Statistics Workshop will present a talk by Brian Ripley, Professor of Applied Statistics at the University of Oxford. Professor Ripley received his Ph.D. from the University of Cambridge and has been on the faculties of Imperial College, Strathclyde, and Oxford. His current research interests are in pattern recognition and related areas, although he has worked extensively in spatial statistics and simulation, and continues to maintain an interest in those subjects. New statistical methods need good software if they are going to be adopted rapidly, so he maintains an interest in statistical computing. He is the co-author of Modern Applied Statistics with S , currently in its fourth edition. Professor Ripley is also a member of the R core team, which coordinates the R statistical computing project, a widely adopted open-source language for statistical analysis.

Professor Ripley will present a talk entitled "Visualization for classification and clustering." Slides for the talk are available from the course website. The presentation will be at noon on Wednesday, April 26 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Mike Kellermann at 3:49 PM

The 80% Rule, Part I

Jim Greiner

I’ve blogged previously about the course in statistics and law I’m co-teaching this semester (see, for example, here). The course is now in its second simulation, which deals with employment discrimination. In a recent class, the 80% rule came up. I wish it hadn’t. In fact, I wish the ``rule�? had never seen the light of day. In this post, I’ll explain what the 80% rule is. In a subsequent post, I’ll explain why it stinks.

Suppose we’re interested in figuring out whether members of a protected class (say, women) are being hired, promoted, fired, disciplined, whatever at a different rate from a comparison group (say, men, and for the sake of discussion, let’s say we’re interested in hiring). Long ago, the Equal Opportunity Employment Commission (“EEOC�?) released a statement saying that it would ordinarily regard as suspect a situation in which the hiring rate for women was less than 80% of the hiring rate for men. Note that the EEOC has the authority to bring suit in the name of the United States against a defendant that has violated federal employment discrimination laws.

It would be bad enough of the EEOC used the 80% rule for the purpose it gave, i.e., a statement about how the agency would exercise its investigative and prosecutorial discretion. Alas, courts, perhaps desperate for guidance on quantitative principles, have picked up on the idea, and some now use it as an indicator of which disparities are legally significant. Courts do so despite the outcry of those in the quantitative community interested in such things. More on that outcry in my next post.

Posted by James Greiner at 6:00 AM

April 21, 2006

Mass media and the representativeness heuristic

Amy Perfors

Since the days of Kahneman & Tversky, researchers have been finding evidence showing that people do not reason about probabilities as they would if they were "fully rational." For instance, base-rate neglect -- in which people ignore the frequency of different environmental alternatives when making probability judgments about them -- is a common problem. People are also often insensitive to sample size and to the prior probability of various outcomes. (this page offers some examples of what each of these mean).

A common explanation is that these "errors" arise as the result of using certain heuristics that usually serve us well, but lead to this sort of error in certain circumstances. Thus, base-rate neglect arises due to the representativeness heuristic, in which people assume that each case is representative of its class. So, for instance, people watching a taped interview with a prison guard with extreme views will draw conclusions about the entire prison system based on this one interview -- even if they were told in advance that his views were extreme and unusual, and that most guards were quite different. The prison guard was believed to be representative of all guards, and thus the prior information indicating that his views were really quite rare was disregarded.

In many circumstances, a heuristic of this sort is sensible: after all, it's statistically unlikely to meet up with someone or something that is, uh, statistically unlikely -- so it makes sense to usually assume that whatever you interact with is representative of things of that type. The problem is -- and here I'm harking back to a theme I touched on in an earlier post -- that this assumption no longer works in today's media-saturated environment. Things make it into the news precisely because they are unlikely; but even if we know that consciously, it is easy to ignore that information. This may be part of the reason that so many people believe, for instance, that crime is much likelier than it is, that terrorism is an ever-present danger, that most politicians are corrupt, etc. I could go on. The cognitive heuristics that are so useful in the ordinary day-to-day rely on certain assumptions that are no longer even approximately valid when interpreting secondhand, media-driven information. And therein lies a problem.

Posted by Amy Perfors at 6:00 AM

April 20, 2006

Explaining Individual Attitudes Toward Immigration in Europe

Jens Hainmueller and Michael Hiscox

We have written a paper that investigates individual attitudes toward immigration in 22 European countries. In line with our research on individual attitudes toward trade policies (see previous blog entries here, here, and here), we find that a simple labour market model (a la Heckscher-Ohlin) does not do very well in accounting for preferences at the individual level. This finding resonates well with economic theory, given that more recent economic models are actually quite equivocal about whether immigrants will have an adverse impact on the wages or employment opportunities of local workers with similar skills (see our discussion of these models here).

Please find our abstract after the jump. Here is the link to the paper. As always, comments are highly appreciated.

Educated Preferences: Explaining Attitudes Toward Immigration In Europe:

Recent studies of individual attitudes toward immigration emphasize concerns about labor market competition as a potent source of anti-immigrant sentiment, in particular among less-educated or less-skilled citizens who fear being forced to compete for jobs with low-skilled immigrants willing to work for much lower wages. We examine new data on attitudes toward immigration available from the 2003 European Social Survey. In contrast to predictions based upon conventional arguments about labor market competition, which anticipate that individuals will oppose immigration of workers with similar skills to their own, but support immigration of workers with different skill levels, we find that people with higher levels of education and occupational skills are more likely to favor immigration regardless of the skill attributes of the immigrants in question. Across Europe, higher education and higher skills mean more support for all types of immigrants. These relationships are almost identical among individuals in the labor force (i.e., those competing for jobs) and those not in the labor force. Contrary to the conventional wisdom, then, the connection between the education or skill levels of individuals and views about immigration appears to have very little, if anything, to do with fears about labor market competition. This finding is consistent with extensive economic research showing that the income and employment effects of immigration in European economies are actually very small. We find that a large component of the effect of education on attitudes toward immigrants is associated with differences among individuals in cultural values and beliefs. More educated respondents are significantly less racist and place greater value on cultural diversity than their counterparts; they are also more likely to believe that immigration generates benefits for the host economy as a whole.

Posted by Jens Hainmueller at 6:00 AM

April 19, 2006

The Language of Research

Drew Thomas

It seems that the difficulty in learning languages isn't always restricted to spoken words. A recent article in the New York Times ("Searching For Dummies", March 26 - here's a link, though it's for pay now) quotes an Israeli study which demonstrates the ineptitude of graduate students in making specific Internet searches in 2002.

Now, I know a lot has happened in the world of search engines in the last 4 years, and I admit my bias in being an MIT undergrad at the time meant that I was waist-deep in Google and its way of sorting information. See if you can't do any of these challenges now, with no time limit:

"A picture of the Mona Lisa; the complete text of either "Robinson Crusoe" or "David Copperfield"; and a recipe for apple pie accompanied by a photograph."

What's the trick to this kind of searching? Unless you have an excellent, selective and disambiguating search engine, knowing search grammar and context is essential.

For example, getting the text of David Copperfield is now a three-hop, one search process: search for it on Google, and select the Wikipedia entry, which has been cleanly separated from the magician and includes not one but three versions. A Google search for "David Copperfield" Dickens gets us the full text as the first hit, showing that context improves. I have little doubt that 6 years ago, such a search would take forever without an extra bit of context, such as "Chapter 3" in order to elicit a full text.

So the technology has gotten better. But the illusion of control remains; I find it more difficult to find other disambiguations that Wikipedia hasn't considered. Moreover, for any meaningful searches, such as to relevant papers in particular areas where I don't know the nomenclature, this feeling of power is challenged.

This is a skill that permeates all levels of society, from kindergarten on up, but there's a definite lack of appreciation for it. To learn it like a language, early on and with constant practice, seems to be the solution; to learn the context, grammar and syntax of the search (and research), and to appreciate that we're trying to communicate our intentions using all the tools we have available; by blaming them, we all typify poor carpenters.

Posted by Andrew C. Thomas at 6:00 AM

April 18, 2006

Censoring or Truncation Due to "Death"?: What’s the Question? (Part II)

Jim Greiner

In my last post, I pointed out that when presented with a causal inference situation of treatment, intermediate outcome, and final outcome, we have to be careful to define a sharp question of interest. Sometimes, we’re interested in the ITT, or the effect of the treatment on the final outcome. At other times, we’re interested in the effect of the intermediate outcome on the final outcome, and the treatment is our best way of manipulating the intermediate outcome so as to draw causal inferences.

In my view, these principles are important in the legal context. Take race in capital sentencing, for example.

To begin, it’s a big step to draw causal inferences about race in a potential outcomes framework; the maxim "no causation without manipulation"? (due, I believe, to Paul Holland) explains why. I believe that step can be taken, but that’s another subject. Suppose we take it, i.e., we decide to apply a potential outcomes framework to an immutable characteristic. The treatment (applied to the capital defendant) is being African-American, the intermediate outcome is whether the defendant is convicted, and the final outcome is whether a convicted defendant is sentenced to die. (Note that, in an instance of fairly macabre irony, if one applies the language of censoring or truncation due to death here, "death"? is an acquittal on the capital charge.)

What causal question do we care about? If all we want to study is the relationship between race and the death penalty, then we don’t care whether a defendant avoids a death sentence via acquittal or avoids a death sentence after a conviction by being sentenced to life. If, on the other hand, what we want to study is fairness in sentencing proceedings, then we need principal stratification; we need to isolate a set of defendants who would be convicted of the capital charge if African-American and convicted of the capital charge if not African-American. Both are potentially interesting causal questions. Let’s just make sure we know which we’re asking.

Posted by James Greiner at 6:00 AM

April 17, 2006

Applied Statistics - Gerard van den Berg

This week the Applied Statistics Workshop will present a talk by Gerard van den Berg, Professor of Labor Economics at the Free University of Amsterdam. Before joining the faculty at Amsterdam in 1996, he worked at Northwestern University, New York University, Stockholm School of Economics, Tilburg University, Groningen University, and INSEE-CREST. From 2001 to 2004, he was Joint Managing Editor of The Economic Journal, and has published in Econometrica, Review of Economic Studies, American Economic Review, and other journals. He is currently a visiting scholar at the Center for Health and Wellbeing at Princeton University. His research interests are in the fields of econometrics, labor economics, and health economics, notably duration analysis, treatment evaluation, and search theory.

Professor van den Berg will present a talk entitled "An Economic Analysis of Exclusion Restrictions for Instrumental Variable Estimation." The paper is available from the course website. The presentation will be at noon on Wednesday, April 19 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Mike Kellermann at 12:00 AM

April 14, 2006

Statistical Humor

You, Jong-Sung

Here are some good statistics jokes for all of you.

  • "When she told me I was average, she was just being mean".
  • “Old statisticians never die-- they just become insignificant. - Gary Ramseyer, First Internet Gallery of Statistical Jokes
  • "Three statisticians go deer hunting with bows and arrows. They spot a big buck and take aim. One shoots and his arrow flies off three meters to the right. The second shoots and his arrow flies off three meters to the left. The third statistician jumps up and down yelling, "We got him! We got him (on average)!" - Richard Lomax and Seyed Moosavi, 1998, Using Humor to Teach Statistics: Must They Be Orthogonal?

Is the use of humor an effective way of teaching statistics? Lomax and Moosavi (1998), citing J. Bryant and D. Zillmann (1988) suggest that there is little empirical evidence that humor either (1) increases student attention, (2) improves the classroom climate or (3) reduces tension. Fortunately, however, the same research indicates that humor actually does (1) increase enjoyment and (2) motivates students to achieve higher. Hence, it may not be a bad idea to incorporate some statistical jokes (their article and Gary Ramseyer's website are two good sources).

This isn't a joke as such, but here is another interesting statistical dialogue from Lomax and Moosavi:

Q. I read that a sex survey said the typical male has six sexual partners in his life and the typical female has two. Assuming the typical male is heterosexual, and since the number of males and females is approximately equal, how can this be true?

A. You’ve assumed that "typical" refers to the arithmetical average of the numbers. But "average" also means "middle" and "most common". (Statisticians call these three kinds of averages the mean, the median and the mode, respectively.) Here’s how the three are used: Say you’re having five guests at a dinner party. Their ages are 100, 99 17, 2, and 2. You tell the butler that their average age is 44 (100+99+17+2+2=220¸5=44). Just to be safe, you tell the footman their average age is 17 (the age right in the middle). And to be sure everything is right, you tell the cook their average age is 2 (the most common age). Voila! Everyone is treated to pureed peas accompanied by Michael Jackson’s latest CD, followed by a fine cognac. In the case of the sex survey, "typical" may have referred to "most common", which would fit right in with all the stereotypes. (That is, if you believe sex surveys.)

Posted by Jong-sung You at 6:00 AM

April 13, 2006

Data from China: Land of Plenty? (I)

Sebastian Bauhoff

While the media keeps preaching that this century is Chinese, many researchers are getting excited about new opportunities for data collection and access to data. For the past decades, many development researchers have focused on India because of the regional variation and good infrastructure for surveys. It seems that now China holds a similar promise, and could provide an interesting comparison to India.

I recently started collecting information on China (here); below are some highlights. If you know of more surveys, do let me know.

Probably the best known micro-survey at this point is the China Health and Nutrition Survey CHNS, which is a panel with rounds in 1989, 1991, 1993, 1997, 2000, and 2004 (the 2006 wave is funded) and covers more than 4,000 households in 9 provinces. Though this is an amazing dataset, using it is not always easy. For example there are problems of linking individuals over time. New longitudinal master files are continuously released but the fixes are sometimes are hard to integrate in ongoing projects (the ID's are mixed up). Also there seem to be some inconsistencies in the recording, especially in earlier rounds and some key variables such as education. The best waves seem to be those of 1997 and 2000.

There is also a World Bank Living Standards Measurement Study (LSMS) for China. That survey used standardized (internationally comparable?) questionnaires and was conducted in 780 households and 31 villages in 1996/7. For those interested in the earlier periods, there is commercial data at the China Population Information and Research Center which has mainly census-based data starting from 1982. The census itself is also available electronically now (and with GIS maps) but there is a lively debate as to how reliable the figures are, and whether key measures changed over time. But it should still be good for basic cross-sectional analysis.

Posted by Sebastian Bauhoff at 6:00 AM

April 12, 2006

Censoring or Truncation Due to ``Death": What’s the Question?

Jim Greiner

A few weeks ago, Felix Elwert gave a bang-up presentation at the Wednesday seminar series on the effect of cohabitation on divorce rates (see here). One of the most interesting points I took away from the discussion was the following: in some social science situations in which a treatment is followed by an intermediate outcome, then by a final outcome, we might be interested in different causal questions. One causal question is the effect of the treatment on the final outcome; this is commonly called the intention-to-treat effect (ITT). The name comes from, I believe, an encouragement design context; the treatment is an encouragement to, say, get a vaccine, the intermediate outcome is whether a test subject gets a vaccine, the final outcome is whether the test subject gets a disease, and the ITT is the effect of encouragement on disease rates.

A second causal question different from the ITT is the effect of the intermediate outcome on the final outcome; in the vaccine example above, the question here would be the effect of the vaccine on disease rates.

Felix’s point was that if we think of cohabitation as the treatment, marriage as the intermediate outcome, and divorce as the final outcome, there are different causal questions we might want to ask. Those of us steeped in a principal stratification and a truncation due to ``death" way of looking things might jump to the conclusion that the idea of divorce makes no sense for people who don’t get married. Thus, the only ``right" way to look at this situation, we might say, is to isolate the set of people who would get married regardless of cohabitation (the treatment). Not so. If what we’re really interested in is avoiding divorce per se (maybe because divorce is stigmatizing, more stigmatizing than not ever having been married), then perhaps we don’t care whether people avoid divorce by not getting married or avoid divorce by getting married and staying that way. In that case, what we’re after is the ITT. If, however, what we want is stable marriages, then we need to do the principal stratification and truncation due to death bit.

I think Felix’s insight has some applicability to the legal context. More on that in a subsequent post.

Posted by James Greiner at 6:00 AM

April 11, 2006

Unstable Racial Identities

Felix Elwert

Race is a surprisingly malleable construct, though it’s usually taken as fixed in statistical models. In a recent paper with Nicholas Christakis (Widowhood and Race, American Sociological Review Vol 71(1), 2006) I had to engage changing racial responses head on.

Assorted previous research has shown that people may change their racial self-description over time because they are multiracial, when they marry somebody of a different racial group, or – not to be neglected – because the answer choices in surveys may change over time.

Most people think that unstable or changing racial self-identification is an issue largely confined to a small group of multiracial individuals. This is a country, after all, of the one-drop rule. But research, including our own, shows that that isn’t so.

In a supplementary analysis of the 2001 Census Quality Survey (CQI), we showed that the racial self-identification of “whites" is also surprisingly unstable. The CQI asks more than 50,000 respondents twice within the span of just a few months to identify their own race. Once they were allowed to select only one race, and the other time they were given the option of selecting multiple races (this gets at the difference between the old and the new Census race questions). The answers were then matched to individual responses from the official 2000 Census.

Depending on whether we compared between consecutive responses to the same race question on the Census and the CQS, or between the different questions asked in the two waves of the CQS, and whether we treat “Hispanic" as a category distinct from black and white, the agreement between answers for whites ranged from 95.6 to 97.5. We obtained really similar answers for blacks.

Meaning, between 2 and 5 percent of people who used to identify as white, would call themselves either something else or a mixture of races when given the chance. And the percentage of “whites" who will change their racial self-description as a function of question wording is about the same as the percentage of “blacks" who will do likewise.

Posted by Felix Elwert at 6:00 AM

April 10, 2006

How many combinations have never been chosen at the ice cream shop?

To follow up a previous post of mine, here's another statistics-related lesson to do with your kids. I came up with it at an ice cream shop with my 10-year-old daughter a couple of weeks ago. The point of the lesson is about the power of combinatorics and really, really big numbers. The result is pretty surprising. Here's the recipe:

INGREDIENTS: An ice cream shop, some money for some ice cream, a kid, and a calculator. [I hear 2 objections. To the first: Don't worry, you're probably already carrying a calculator; look closer at your cell phone. The second is: shouldn't we be requiring kids to make the calculations themselves? The fact is that lots of famous mathematicians and statisticians are pretty bad at arithmetic, even though they are obviously spectacularly good at higher level mathematics. Being able to multiply 2-digit numbers in your head is probably useful for something, but understanding the point of the calculation -- why you're doing it, what the inputs are, and what the result of the calculation means -- is far more important.]

DIRECTIONS: Make your order, sit down, and, while you're eating, pose this question to your kid: Suppose the choices on the menu on the wall have never changed since the shop opened. How many choices do you see that have never been chosen even once?

After thinking about weird but fun options like pouring coffee in an ice cream cone, we try it a little more systematically. So we first set out to figure out how many options there are. So I ask, "how many ice cream flavors are there?" My daughter counts them up; it was 20. So how many combinations of one flavor can you have? 20 obviously. How many combinations of two flavors can you have (where for simplicity, we'll count a cone with chocolate on the bottom and vanilla on the top as different from the reverse)? The answer is 20 x 20 or 400. (Its not 40, its 400. Think of a checkerboard with one flavor down the 20 rows and another across the 20 columns and the individual squares as the combination of the two.)

So how many toppings could we have on that ice cream? She went to the counter and counted: 18. And then did 18*400, which she figured out is 7,200. After that we used the calculator and just continued to multiply and multiply as I point out categories on the menu and she counts each up. The total gets big very fast. We got to numbers in the trillions in just a few minutes.

So we find that the total number of options is a really big number. But what does that say about how many options have been tried?

Let's suppose, I say, that it only takes one second for someone to make their choice and receive their order, and that the shop is open 24 hours a day, 7 days a week, all year round. (You could make more realistic assumptions, and teach some good data collection techniques, by watching people get their orders and timing them.) Then we figure out how long it would take for the shop to have been open (under these wildly optimistic assumptions) in order to serve up all the options. To calculate the number of years, all you do is take the number of options, divide by 60 (seconds a minute), 60 (minutes an hour), 24 (hours in a day), and 365 (days a year). In our case, to serve all the options, the shop would have had to be open for around 43,000 years!

So even if the shop had been open for 100 years, it couldn't have served even a tiny fraction of the available options. So how many choices have never been tried at the ice cream shop? Its not just the few that we can cleverly dream up. In fact, almost all of them (over 99 percent of the possibilities) have never been tried!
(At which point my daughter said, "ok, let's get started!")

Actually, if you go to a deli and try this, you can get much larger numbers. For example, if the menu has about 85 items, and each one can be ordered in 10 different ways, the number of possible orders (10 to the 85) is larger than the number of elementary particles in the universe.

Posted by Gary King at 6:00 AM

Applied Statistics - Matthew Harding

This week the Applied Statistics Workshop will present a talk by Matthew Harding, a Ph.D. candidate in the Department of Economics at MIT. He received his BA from University College London and an M.Phil in economics from the University of Oxford. His work in econometrics focuses on stochastic eigenanalysis with applications to economic forecasting, modeling of belief distributions, and international political economy. He also works on modeling heterogeneity in nonlinear random coefficients models, duration models, and panels.

Matt will present a talk entitled "Evaluating Policy Counterfactuals in Voting Models with Aggregate Heterogeneity." A link to a background paper for the presentation is available from the workshop website. The presentation will be at noon on Wednesday, April 5 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Mike Kellermann at 12:00 AM

April 6, 2006

New Immigrant Survey (NIS 2003) Online

Jens Hainmueller

Great News for people studying immigration: The first-full cohort module of the New Immigrant Survey (NIS 2003) is now online. The NIS is "a nationally representative multi-cohort longitudinal study of new legal immigrants and their children to the United States based on nationally representative samples of the administrative records, compiled by the U.S. Immigration and Naturalization Service (INS), pertaining to immigrants newly admitted to permanent residence."

The sampling frame consists of new-arrival and adjustee immigrants. The Adult Sample covers all immigrants who are 18 years of age or older at admission to the Lawful Permanent Residence (LPR) program. There is also a Child Sample, which covers immigrants with child-of-U.S.-citizen visas who are under 18 years of age and adopted orphans under five years of age. Overall 8,573 adults and 810 children were interviewed. This constiutes a response rate of about 65%.

The NIS features a wide variety of questions regarding demographics, pre-immigration experiences, employment, health, health and life Insurance, health care utilization and daily activities, income, assets, transfers, social variables, migration history, etc. There is also the controversial and much discussed skin color scale test, where the survey measured respondent skin color using an 11-point scale, ranging from zero to 10, with zero representing albinism,
or the total absence of color, and 10 representing the darkest possible skin. The Scale was memorized by the interviewers, so that the respondent never sees the chart. Check out the ten shades of skin color corresponding to the points 1 to 10 and a description of the skin color test here.

Posted by Jens Hainmueller at 6:00 AM

April 5, 2006

Open Season on the Messenger

Felix Elwert

In a previous post, Mike quoted Alan Greenspan, "I suspect greater payoffs will come from more data than from more technique." Not an uncommon opinion. But there are more and less flattering ways of reading such statements.

For what’s behind the sentiment, I sometimes suspect (I’m not picking fights with the Maestro), is not just the desire for better data but a distrust of advanced statistical methods. There’s this perception that more complicated math necessitates more assumptions, ergo less robust results. By this logic, the simpler the method, the more credible the conclusion. Crosstabs rule, ANOVA passes muster. The truth, of course, is the opposite: simple stats in observational data analysis usually require more assumptions. As we move from crosstabs to OLS to GEE for a given analytical goal we are usually trying to relax assumptions. Tragically, the presence of said assumptions often becomes obvious only after the author points them out. And then it’s open season on the messenger.

I witnessed this sort of thinking recently when I reviewed a paper for a leading sociological journal. The author pointed out some serious methodological flaws in one strand of comparative welfare state research, then proposed an alternative to one well regarded analysis by relaxing some offending assumptions. Boom, did he get slammed by one reviewer for allegedly making the very assumptions he had exposed in the first place. The paper was rejected in the first round. (This is sort of a pet peeve of mine, and I might vent again.)

Posted by Felix Elwert at 6:00 AM

April 4, 2006

Academic Ego

Jim Greiner

In a previous post, I brought up the subject of how we quantitative analysts can abuse the trust decision makers (judges, government officials, members of the public) put in us, when they are inclined to trust us at all. Decision makers should be able to depend on us to give them not just a (clearly and understandably stated) summary of inferences we believe are plausible, but also a (clearly and understandably stated) statement of the weak points of those inferences. “No kidding,? you might say. OK. If it’s that obvious, how come none of us is able to do it?

Here’s an exercise, again, something that’s come out of my experience in teaching a class on statistical expert witnesses in litigation. Next time you think you’ve “got it,? that you’ve done the right thing with a dataset and have drawn some solid inferences, step back and ask: “Suppose I was paid $____/hour to convince people that the work I’ve just done is not worthy of credence. What would I say??. If all you can come up with are criticisms that make you laugh (because they’re so silly) or ideas that you can dismiss as unscrupulous babbling motivated by a desire for fees, then you might suffer from a mutilating and disfiguring disease: AE.

In the litigation and expert witnesses class, we’re giving students datasets and assigning them positions (plaintiffs or defendants). One of the refreshing things about this exercise has been that it is forcing the student-experts to think about where attacks on their reports will come from. Perhaps even more importantly, because the sources of those attacks are their friends and peers (i.e., people they respect), students begin to remember something they knew before the academic environment tried to make them forget it: there are weaknesses in everything they do.

I don’t know if all academics suffer from AE. Perhaps I’ve been unlucky in meeting a great many who suffer from especially severe cases. Who knows? Perhaps I’m a carrier myself? (Nah . . .)

Posted by James Greiner at 6:00 AM

April 3, 2006

A Unified Theory of Statistical Inference?

If inference is the process of using data we have in order to learn about data we do not have, it seems obvious that there can never be a proof that anyone has arrived at the "correct" theory of inference. After all, the data we have might have nothing to do with the data we don't have. So all the (fairly religious) attempts at unification -- likelihood, Bayes, Bayes with frequentist checks, bootstrapping, etc., etc. -- each contribute a great deal but they are unlikely to constitute The Answer. The best we can hope for is an agreement, or a convention, or a set of practices that are consistent across fields. But getting people to agree on normative principles in this area is not obviously different from getting them to agree on the normative principles of political philosophy (or any other normative principles).

It just doesn't happen, and even if it does it would have merely the status of a compromise rather than the correct answer, the latter being impossible.

Yet, there is a unifying principal that would represent progress in the sense that would advance the field: we will know that something like unification has occurred when we distribute the same data, and the same inferential question, to a range of scholars with different theories of inference, that go by different names, use different conventions, and are implemented with different software, and yet they all produce approximately the same emprical answer.

We are not there yet, and there are some killer examples where the different approaches yield very different conclusions, but there does appear to be some movement in this direction. The basic unifying idea I think is that all theories of inference require some assumptions, but we should never take any theory of inference so seriously that we don't stop to check the veracity of the assumptions. The key is that conditioning on a model does not work, since of course all models are wrong, and some are really bad. What I notice is that most of the time, you can get roughly the same answers using (1) likelihood or Bayesian models with careful goodness of fit checks and adjustments to the model if necessary, (2) various types of robust, semi-parametric, etc. statistical methods, (3) matching for use as preprocessing data that is later analyzed or further adjusted by parametric likelihood or Bayesian methods, (4) Bayesian model averaging, with a large enough class of models to average over, (5) the related "committee methods'', (6) mixture of experts models, and (7) some highly flexible functional forms, like neural network models. Done properly, these will all usually give similar answers.

This is related to Xiao-Li Meng's self-efficiency result: the rule that ``more data are better'' only holds under the right model. Inference can't be completely automated for most quantities, and we typically can't make inferences without some modeling assumptions, but the answer won't be right unless the assumptions are correct, and we can't ever know that the assumptions are right. That means that any approach has to come to terms with the concept that some of the data might not be right for the given model, or the model might be wrong for the observed data. Each of the approaches above has an extra component to try to get around the problem of incorrect models. This isn't a unification of statistical procedure, or a single unified theory of inference, but it may be leading to a unificiation of results of many diverse procedures, as we take the intuition from each area and apply it across them all.

Posted by Gary King at 6:00 AM

Applied Statistics - L.J. Wei and Tianxi Cai

This week the Applied Statistics Workshop will present a talk by L.J. Wei and Tianxi Cai of the Department of Biostatistics at the Harvard School of Public Health. Professor Wei received his Ph.D. in statistics from the University of Wisconsin at Madison and has served on the faculty of several universities before coming to Harvard in 1991. Professor Cai received her Sc.D. from the Harvard School of Public Health in 1999 and was a faculty member at the University of Washington before returning to HSPH in 2002. Professors Wei and Cai will present a talk entitled "Evaluating Prediction Rules for t-Year Survivors With Censored Regression Models." The presentation will be at noon on Wednesday, April 5 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. The abstract of the paper follows on the jump:

Suppose that we are interested in establishing simple, but reliable rules for predicting future t-year survivors via censored regression models. In this article, we present inference procedures for evaluating such binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values. Specifically, under various working models we derive consistent estimators for the above measures via substitution and cross validation estimation procedures. Furthermore, we provide large sample approximations to the distributions of these nonsmooth estimators without assuming that the working model is correctly specified. Confidence intervals, for example, for the difference of the precision measures between two competing rules can then be constructed. All the proposals are illustrated with two real examples and their finite sample properties are evaluated via a simulation study.

Posted by Mike Kellermann at 12:00 AM

March 27, 2006

Spring break, blog break

This week is spring break for both Harvard and MIT, so as per usual, we will be posting less this week. Enjoy the (sort of) spring sunshine!

Posted by Amy Perfors at 6:00 AM

March 24, 2006

Another classroom demo: the scientific method

Gary's posts about teaching and breakfast cereal reminded me of a teaching experience I had once while teaching in the Peace Corps in Mozambique -- this time regarding the scientific method and hypothesis testing. It might be nothing particularly exciting to those of you who habitually teach pre-college level science, but I was surprised at how well it worked.

My (secondary-level) students were extremely good at memorizing facts, but they had a very hard time learning and applying the scientific method (as many do, I think). Since I see the method as the root of what makes science actually scientific and I didn't want them to have the view that science was just a disconnected collection of trivia, this was deeply problematic -- all the more frustrating to me because I could see that, in real life, they used the scientific method routinely. We all do, whenever we try to explain people's behavior or solve any of the everyday puzzles that confront us. The trick was to demystify it, to make them see that as well.

The next day I brought in an empty coke bottle. It's not vital that this be done with a coke bottle; in fact I imagine if you have more choice of materials than I had in Africa, you could find something even better. Basically I wanted something that was very familiar to them, to underscore the point that scientific reasoning is something they did all the time.

I held up the empty coke bottle. "What do you suppose had been in it?" I asked. This was the PROBLEM. "Coke!" everyone replied. "Okay," said I, "but I could have used it after the coke was gone for something else, right? What else could it have held?" Once again, people had no trouble suggesting possibilities -- water, gasoline, tea, other kinds of soda. I pointed out that they had just GENERATED HYPOTHESES, and wrote them on the board, along with coke.

Now, I asked them, how could you find out if your hypothesis was correct? They'd ask me, they said, and I pointed out that this was one way of TESTING the hypothesis. But suppose I wasn't around, or lied to them - what else could they do? One student suggested smelling it, and another (thinking about the gasoline hypothesis) suggested throwing a match in and seeing if it caught fire. "Both of these are good tests," I said, "and you'll notice that each of them is good for certain specific hypotheses; the match one wouldn't tell the difference between tea and other kinds of soda, for instance, and smelling it wouldn't help if it were water."

Then I asked a volunteer to come up and actually perform the test - to smell it, since we didn't have any matches. He did, and reported back that it smelled like Fanta even though it was a coke bottle. This, I said, was the RESULT, and it enabled the class to draw the CONCLUSION - that I had put Fanta in the bottle after drinking all of the original coke.

The best part of this demo came when a student, seeking to "trap" me, pointed out that I could still have had water or tea in the bottle, just long enough ago that the Fanta smell was stronger. "Exactly!" I replied. This points out the two limitations of the scientific method -- the validity of your conclusion depends on your hypotheses and on how good your methods of testing are. There are always a potentially infinite number of hypotheses you haven't ruled out, and therefore we cannot draw any conclusion with 100% accuracy. Plus, if our test can't tell the difference between two hypotheses, then we can't decide between those two. For this reason it's very important to have hypotheses that you can test, and to work to develop better methods of testing so that you can eliminate more plausible hypotheses.

This led to a good discussion about the pros and cons of the scientific method and how it compared to other ways of understanding the world. If I had had more time, equipment, or room, I had hoped to make it more interactive, with stations where they had to apply the method to lots of simple real-world problems; but even as it was, it was valuable.

I was surprised at how well this demo worked... not only did they immediately understand how to apply the scientific method, but they also understood its limitations in a way that I think many people don't, even by college age. As the semester advanced, I found myself referring back to the lesson often ("remember the empty coke bottle") when I'd try to explain how we knew what we knew. And I think it was very freeing for them to realize that science wasn't some mysterious system of rules passed down from on high, but rather the best explanation we had so far (and the best way we knew of how to get that explanation). My favorite result of this demo was their realization that scientists were people just like themselves, and that they too could do it -- in fact, they already were.

Posted by Amy Perfors at 6:00 AM

March 23, 2006

Control Groups for Breakfast, Revisited

A few months ago, I wrote an entry entitled The Value of Control Groups in Causal Inference (and Breakfast Cereal). It was a report on a fun experiment I did that worked well both in my daughter's kindergarten class and my graduate methods class at Harvard. There were a fair number of comments posted in the blog, and I also received dozens of other notes from parents and school teachers all over the country with many interesting questions and suggestions.

That correspondence covered four main points:

  1. Some people suggested a variety of interesting alternative experiments, which is great, but in designing these many forgot that you must always have a control group. That's the main lesson of the experiment: you often learn nothing without some kind of control group, and teaching this to kids (and graduate students!) is quite important.
  2. Some people didn't squish the cereal enough and the magnet didn't pick up the pieces. It will attract only when squished very well since the bits of iron are very small.
  3. People then asked why the cereal doesn't stick to the magnet without squishing it up. The reason is the same reason a magnet won't pick up a nail driven into a log, but it will pick up the nail if not in the log.
  4. Finally, most people asked for other experiments they could run with their kids. For that, which I'm writing up now, please tune in next time!

Posted by Gary King at 6:00 AM

March 22, 2006

Valid Standard Errors for Propensity Score Matching, Anyone?

Jens Hainmueller

Propensity Score Matching (PSM) has become an increasingly popular method to estimate treatment effects in observational studies. Most papers that use PSM also provide standard errors for their treatment effect estimates. I always wonder where these standard errors actually come from. To my knowledge there still exists no method to calculate valid standard errors for PSM. What do you all think about this topic?

The issue is this: Getting standard errors for PSM works out nicely when the true propensity score is known. Alberto and Guido have developed a formula that provides principled standard errors when matching is done with covariates or the true propensity score. You can read about it here. This formula is used by their nnmatch matching software in Stata and Jasjeet Sekhon’s matching package in R.

Yet, in observational settings we do not know the true propensity score so we first have to estimate it. Usually people regress the treatment indicator on a couple of covariates using a probit or logit link function. The predicted probabilities from this model are then extracted and taken as the estimated propensity score to be matched on in the second step (some people also match on the linear predictor, which is desirable because it does not tend to cluster so much around 0 and 1).

Unfortunately, the abovementioned formula does not work in the case of matching on the estimated propensity score, because the estimation uncertainty created in the first step is not accounted for. Thus, the confidence bounds on the treatment effect estimates in the second step will most likely not have the correct coverage.

This issue is not easily resolved. Why not just bootstrap the whole two-step procedure? Well, there is evidence to suggest that the bootstrap is likely to fail in the case of PSM. In the closely related problem of deriving standard errors for conventional nearest neighbor matching Guido and Alberto show in a recent paper, that even in the simple case of matching on a single continuous covariate (when the estimator is root-N consistent and asymptotically normally distributed with zero asymptotic bias) the bootstrap does not provide standard errors with correct coverage. This is due to the extreme non-smoothness of nearest neighbor matching which leads the bootstrap variance to diverge from the actual variance.

In the case of PSM the same problem is likely to occur unless estimating the propensity score in the first step makes the matching estimator smooth enough for the bootstrap to work. But this is an open question. At least to my knowledge there exists no Monte Carlo evidence or theoretical justification for why the bootstrap should work here. I would be interested to hear opinions on this issue. It’s a critical question because the bootstrap for PSM is often done in practice, various matching codes (for example pscore or psmatch2 in Stata) do offer bootstrapped standard errors options for matching on the estimated propensity score.

Posted by Jens Hainmueller at 6:00 AM

Applied Statistics - Jeff Gill

Today at noon, the Applied Statistics Workshop will present a talk by Jeff Gill of the Department of Political Science at the University of California at Davis. Professor Gill received his Ph.D from American University and served on the faculty at Cal Poly and the University of Florida before moving to Davis in 2004. His research focuses on the application of Bayesian methods and statistical computing to substantive questions in political science. He is the organizer for this year's Summer Methods Meeting sponsored by the Society for Political Methodology, which will be held at Davis in July. He will be a visiting professor in the Harvard Government Department during the 2006-2007 academic year.

Professor Gill will present a talk entitled "Elicited Priors for Bayesian Model Specifications in Political Science Research." This talk is based on joint work with Lee Walker, who is currently a visiting scholar at IQSS. The presentation will be at noon on Wednesday, March 22 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. The abstract of the paper follows on the jump:

We explain how to use elicited priors in Bayesian political science research. These are a form of prior information produced by previous knowledge from structured interviews with subjective area experts who
have little or no concern for the statistical aspects of the project. The purpose is to introduce qualitative and area-specific information into an empirical model in a systematic and organized manner in order to produce parsimonious yet realistic implications. Currently, there is no work in political science that articulates elicited priors in a Bayesian specification. We demonstrate the value of the approach by applying elicited priors to a problem in judicial comparative politics using data and elicitations we collected in Nicaragua.

Posted by Mike Kellermann at 12:01 AM

March 21, 2006

World Health Surveys: Arriving Soon

Sebastian Bauhoff

Good data on health-related issues in developing countries is hard to find, especially if you need large samples and cross-country comparability. The latest round of the World Health Surveys (WHS) is starting to become available to researchers in the next months and might be one of the best surveys out there, in addition to the
Demographic and Health Surveys (DHS).

The current WHS has been conducted in 70 countries in 2000-2001. The survey is standardized and comes with several modules, including measures of health states of populations; risk factors; responsiveness of health systems; coverage, access and utilization of key health services; and health care expenditures. The instruments use several innovative features, including anchoring vignettes and geocoding, and seems to collect more information on income/expenditure than DHS does.

From the looks, WHS could easily become the new standard dataset for cross-country comparisons of health indicators, though for some applications it might be more of a complement than substitute for the DHS. As of now, the questionnaires and some country reports are online, and the micro-data is supposed to be available by the middle of the year at the latest.

Posted by Sebastian Bauhoff at 6:00 AM

March 20, 2006

Making Diagnostics Mandatory

Jim Greiner

Teaching a class (see here) on the interaction between lawyers, most of whom lack quantitative training, and quantitative analysts has me thinking about the danger statistical techniques pose. As is true of those who study any branch of specialized knowledge, statisticians can abuse the trust decision makers (judges, government officials, members of the public) put in us all too easily, and often with impunity. (Of course, “we? all know that “we? would never do any such thing, even though “we? know that “everyone else? does it all the time. Gee.)

If it’s of interest (or perhaps more accurately, unless a barrage of comments tells me I’m being boring), I’ll be blogging about ways “everyone else? abuses trust, and ways “we? can try to stop it. Here’s my first suggestion: make diagnostics mandatory.

Here’s what I mean. I’ve previously blogged (see here) on the double-edged sword posed by the recent trend towards academics’ writing free software to fit models they’ve developed. One way for software-writers to lessen the danger that their models will be abused is to write diagnostics into their programming . . . and make those diagnostics hard to turn off. Suppose, for example, that some analysts are writing code to implement a new model, and the fitting process requires fancy MCMC techniques. These analysts should write MCMC convergence diagnostics into the software, and should set their defaults so that the fitting process produces these diagnostics unless it’s told not to. Perhaps, the analysts should even make it a little tough to turn off the diagnostics. That way, even if the user doesn’t look at the diagnostics, someone else (perhaps an opposing expert in a court case?) might have easier access to them.

The worry, of course, is that the output from all new software will end up looking like it came out of SAS (a package I wouldn’t wish on my worst enemy). Still, as our cognitive psychologist could probably tell us, people are incredibly lazy. Even if a user of software just has to go to a drop-down menu to look at a diagnostic, chances are he/she won’t bother.

Posted by James Greiner at 6:00 AM

March 16, 2006

Are people Bayesian? (and what does that mean?) Part II

In my last post I talked about computational vs. algorithmic level descriptions of human behavior, and I argued that most Bayesian models of reasoning are examples of the former -- and thus make no claims about whether and to what extent the brain physically implements them.

A common statement at this point is that "of course your models don't say anything about the brain -- they are so complicated, how could they? Do people really do all that math?" I share the intuition: the models do look complex, and I am certainly not aware of doing anything like this when I think, but I don't think the possibility can be rejected out of hand. In other words, while it's certainly possible that human brains do nothing like, say, MCMC [insert complicated computational technique here], it's not a priori obvious. Why?

I have three reasons. First of all, we really don't have any good conception of what the brain is capable of computationally -- it has billions of neurons, each of which has thousands of connections, and (unlike modern computers) is a massively parallel computing device. State of the art techniques like MCMC look complicated when written out as mathematical equations -- particularly to those who don't come from that background -- but that doesn't mean, necessarily, that they are complicated in the brain.

Secondly, every model I've seen generally gets its results after running for at most a week, usually for only a few minutes -- much less time than a human has to go about and form theories of the world. If you are studying how long-term theories or models of the world form, it's not at all clear how to compare the time a computer takes to the time a human takes: not only are the scales really different, so is the data they get (models generally have cleaner data, but far less) and so is the speed of processing (computers are arguably faster, but if a human can do in parallel what a computer does serially, this might mean nothing). The point is that comparing a computer after 5 minutes to a human over a lifetime might not be so silly after all.

Thirdly, both the strength and weakness of studying cognitive science is that we have clear intuitions about what cognition and thinking are. It's a strength in that it helps us judge hypotheses and have good intuitions -- but it's a weakness in that it causes us accept or reject ideas based on these intuitions when maybe we really shouldn't. There's a big difference between conscious and unconscious reasoning, and most (if not all) of our intuitions are based on how we see ourselves reason consciously. But just because we aren't aware of, say, doing Hebbian learning doesn't mean we aren't. It's striking to me that people who make Bayesian models of vision rarely have to deal with questions like "but people don't do that! it's so complicated!" This in spite of the fact that it's the same brain. I think this is probably because we don't have conscious awareness of the process of vision, and so don't therefore think we know how it works. But to the extent that higher cognition is unconscious, the same point applies. It's just easy to forget.

Anyway, I'd be delighted to hear objections to any of these three reasons. As I said in the last post, I'm still sorting out these issues to myself, so I'm not really dogmatically arguing any of this.

Posted by Amy Perfors at 6:00 AM

March 15, 2006

Incompatibility: Are You Worried?

Jim Greiner

I’m a teaching fellow for a course in missing data this semester, and one topic keeps coming up peripherally in the course, even though we haven’t tackled it head-on just yet. That topic is incompatible conditional distributions. And here’s my question for blog readers: how much does it bother you?

Reduced to its essence, here’s the issue. Supposed I have a dataset with three variables, A, B, and C. There are multiple missing data patterns, and suppose (although it’s not essential to the problem) that I want to use multiple imputation to create six or seven complete analysis datasets. Suppose also that it’s very difficult to conceive of a minimally plausible joint distribution p(A, B, C). Perhaps A is semi-continuous (e.g., income), B is categorical with 5 possible values, and C has support only over the negative integers. What (as I understand it) is often done in this case is to assume conditional distributions, for example, p*(A|B, C), p*(B|A, C), and p*(C|A, B). The idea is that one does a “Gibbs? with these three conditional distributions, as follows. Find starting values for the missing Bs and Cs. Draw missing As from p*(A|B, C). Then draw new Bs from p*(B|A, C) using the newly drawn As and the starting Cs. Continue as though you were doing a real “Gibbs.? Stop after a certain number of iterations and call the result one of your multiply imputed datasets.

The incompatibility problem is that there may be no joint distribution that has conditional distributions p*(A|B, C), p*(B|A,C), and p*(C|A, B). Remember, (proper) joint distributions determine conditional distributions, but conditional distributions do not determine joint distributions, and in some cases, one can actually prove mathematically that no joint distribution has a particular set of conditionals. If you ran your “Gibbs? long enough, eventually your draws would wander off to infinity or become absorbed into a boundary of the parameter space. In other words, your computer would complain; exactly how it would complain depends on how you programmed it.

I confess this incompatibility problem bothers me more than it appears to bother some of my mentors. If the conditional distributions are incompatible, then I KNOW that the "model" I’m fitting could not have generated the data I see. It seems like even highly improbable models are better than impossible ones. On the other hand, I am sympathetic to the idea of doing the best one can, and what else is there to do in (say) large datasets with multiple, complicated missing data patterns and unusual variable types?

How much does incompatibility bother you?

Posted by James Greiner at 6:00 AM

March 14, 2006

So You Want to Do a Survey?

"I'm doing a survey. I've never done this before, taken any classes on survey research, or read any books on the subject, and a friend suggested that I get some advice. Can you help me? I'm going in the field next week."

Someone has asked me versions of this question almost every month since I was a graduate student, and every time I have to convey the bad news: doing survey research right is extremely difficult. The reason the question keeps coming up is that it seems like a such a reasonable question: what could be hard about asking questions and collecting some answers? What could someone do wrong that couldn't be fixed in a quick conversation? Don't we ask questions informally in casual conversation all the time? Why can't we merely write up some questions, get some quick advice from someone who has tried this before, and go do a survey?

Well, it may seem easy, but survey research requires considerable expertise, not any less than heart surgery or flying military aircraft. Survey research should not be done casually if you care about the results. Survey research seems easy because its possible to learn a little without much expertise, whereas doing a little heart surgery with a dinner knife, or grabbing the keys to a B-2 after seeing Top Gun, wouldn't accomplish anything useful.

Survey research is not easy; in fact, its a miracle it works at all. Think about it this way. When was the last time you had a misunderstanding with your spouse, a miscommunication with your parent or child, or your colleague thought you were saying one thing and you meant another? That's right: you've known these people for decades and your questions are still misunderstood. When was the last time your carefully worded, and extensively rewritten article or book was misunderstood? This happens all the time. And yet you think you can walk into someone's home you've never met, or do a cold call on the phone, and in five minutes elicit their inner thoughts without error? Its hard to imagine a more arrogant, unjustified assumption.

So what's a prospective survey researcher to do? Taking a course, reading some books, etc., would be a good start. Our blog has discussed some issues in survey research before, such as in this entry and this one on using anchoring vignette technology to deal with the problem of survey respondents who may interpret survey questions differently from each other and from the investigator. Issues of missing data arise commonly in survey research too. I'm sure we'll discuss lots of other survey-related issues on this blog in the future as well.

A more general facility for information on the subject is the Institute for Quantitative Social Science's Survey Research Program, run by Sunshine Hillygus. This web site has a considerable amount of information on the art and science of questioning people you don't know on topics they may know. If readers are aware of any resources not listed on this site that may be of help survey researchers, please post a comment!

Posted by Gary King at 6:00 AM

March 13, 2006

Applied Statistics - Felix Elwert

This week, the Applied Statistics Workshop will present a talk by Felix Elwert, a Ph.D. candidate in the Harvard Department of Sociology. Felix received a B.A. from the Free University of Berlin and an M.A. in sociology from the New School for Social Research before joining the doctoral program at Harvard. His article on widowhood and race, joint work with Nicholas Christakis, is forthcoming in the American Sociological Review. He is also a fellow blogger on the Social Science Statistics blog. On Wednesday, he will present a talk entitled "Trial Marriage Reconsidered: The effect of cohabitation on divorce". The presentation will be at noon on Wednesday, March 15 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Mike Kellermann at 9:37 PM

Non-Independence in Competing Risk Models

Felix Elwert

A central assumption in competing risk analysis is the conditional independence of the risks under analysis. Suppose we are interested in cause-specific mortality due to causes A, B, and C. If we assume that the process leading to death from A is independent (conditional on covariates) from the process leading to death from B, then the likelihood factors nicely, and estimation via a series of standard 0/1 hazard models is straightforward. For example, it may be reasonable to assume that death from lung cancer (cause A) is independent of death from being struck by a meteorite (cause B). But it is much less reasonable to assume that death from lung cancer (A) is independent of the risk of dying from emphysema (C), unless we are lucky enough to have, say, appropriate covariate information on smoking history.

The problem is partly rhetorical. The independence assumption in competing risk analysis is the exact same as the assumption of independent censoring in standard hazard models. Few applied papers even mention the latter (unfortunately). In competing risk analysis, however, the assumption becomes quite a bit more visible, and thus harder to hide…

There are a small number of strategies, none particularly popular, to cope with dependence. Sanford C Gordon recently contributed a new strategy in “Stochastic Dependence on Competing Risks? AJPS 46(1), 2002, which builds on an earlier idea of drawing random effects. Rather than drawing individual specific random effects, as has been suggested before by Clayton 1978, Gordon draws risk and individual specific random effects. Thus, a K-risk model on a sample of N individuals may contain up to KxN separate random effects, one for each risk and individual.

The advantage of this strategy is that it allows for the estimation of the direction of dependence (previous work had to assume a specific direction). The disadvantage is that estimation via conditional logit models is very expensive, to the order of several days for moderate size samples of a few thousand cases.

Posted by Felix Elwert at 6:00 AM

March 10, 2006

Are people Bayesian? (and what does that mean?) Part I

Anyone who is interested in Bayesian models of human cognition has to wrestle with the issue of whether people use the same sort of reasoning (and, if so, to what extent this is true, and how our brains do that). I'll be doing a series of posts exploring what I think about this issue (which isn't really set in stone yet -- so think of this as "musing out loud" rather than saying "this is the way it is").

First: what does it mean to say that people are (or are not) Bayesian?

In many ways the question of whether people do the "same thing" as the model is a red herring: I use Bayesian models of human cognition in order to provide computational-level explanations of behavior, not algorithmic-level explanations. What's the difference? A computational-level explanation seeks to explain a system in terms of the goals of the system, the constraints involved, and the way those various factors play out. An algorithmic-level explanation seeks to explain how the brain physically does this. So any single computational explanation might have a number of possible different algorithmic implementations. Ultimately, of course, we would like to understand both: but I think most phenomena in cognitive science are not well enough understood on the computational level to make understanding on the algorithmic level very realistic, at least not at this stage.

To illustrate the difference between computational and algorithmic, I'll give an example. People given a list of words to memorize show certain regular types of mistakes. If the list contains many words with the same theme - say, all having to do with sports, but never the specific word "sport" - people they will nevertheless often incorrectly "remember" seeing "sport". One possible computational-level explanation of what is going on might suggest, say, that the function of memory is to use the past to predict the future. It might further say that there are constraints on memory deriving from limited capacity and limited ability to encode everything in time, and that as a result the mind seeks to "compress" information by encoding the meaning of words rather than their exact form. Thus, it is more likely to "false positive" on words with similar meanings but very different forms.

That's one of many possible computational-level explanations of this specific memory phenomenon. The huge value of Bayesian models (and computational models in general) is that they make this type of explanation rigorous and testable - we can quantify "limited capacity" and what is meant by "prediction" and explore how they interact with each other, so we're not just throwing words around. There is no claim in most computational cognitive science, implicit or explicit, that people actually implement the same computations our models do.

There is still the open question of what is going on algorithmically. Quite frankly, I don't know. That said, in my next post I'll talk about why I don't think we can reject out of hand the idea that our brains are implementing something (on the algorithmic level) that might be similar to the computations our computers are doing. And then in another post or two I'll wrap up with an exploration of the other possibility: that people are adopting heuristics that approximate our models, at least under some conditions. All this, of course, is only true to the extent that the models are good matches to human behavior -- which is probably variable given the domain and the situation.

Posted by Amy Perfors at 6:00 AM

March 9, 2006

Comparative Politics Dataset Award

As you know, Alan Greenspan retired from Fed about a month ago (and already has an $8M book deal, but I digress...). Jens' post below reminded me of one of my favorite Greenspan quotes: "I suspect greater payoffs will come from more data than from more technique." He was speaking to economics about models for forcasting economic growth, but I suspect his comments apply at least as strongly to political science and other social sciences. You might have the most cutting-edge, whiz-bang, TSCS-2SLS-MCMC evolutionary Bayesian beta-beta-beta-binomial model that will tell you the meaning of life and wash your car at the same time, but if the data that you put in is either non-existent or garbage, it isn't going to do you a lot of good. Unfortunately, the incentives in the profession do not seem sufficient to reward the long, tedious efforts required to collect high-quality data and to make it publicly available to the academic community. Most scholars would surely like to have better data; they would just prefer that someone else collect it.

Having said all that, it is worth noting efforts that make data collection and dissemination a more rewarding pursuit. One such effort is the Dataset Award given by the APSA Comparative Politics section for "a publicly available data set that has made an important contribution to the field of comparative politics." This year's request for nominations hits the nail on the head:

The interrelated goals of the award include a concern with encouraging development of high-quality data sets that contribute to the shared base of empirical knowledge in comparative politics; acknowledging the hard work that goes into preparing good data sets; recognizing data sets that have made important substantive contributions to the field of comparative politics; and calling attention to the contribution of scholars who make their data publicly available in a well-documented form.

The section is currently accepting nominations for the 2006 award, with a deadline of April 14. Information about nominating a dataset can be found here.

Posted by Mike Kellermann at 1:32 PM

March 8, 2006

EM And Multi-level Models

Jim Greiner

One of the purposes of this blog is to allow us to share quantitative problems we’re currently considering. Here’s one that arose in my research, and I’d love any comments and suggestions readers might have: can one apply the EM algorithm to help with missing data in multi-level models?

Schematically, the problem I ran into is as follows: A_ij | B_i follows some distribution, call it p1_i, and I had n_i observations of A_ij. A_ij was a random vector, and some parts of some observations were missing. B_i | C follows some other distribution, call it p2. Suppose I’m a frequentist, and I want to make inferences about C. The problem I kept running into was that I couldn’t figure out how to use EM without integrating the B_i’s out of the likelihood, a mathematical task that exceeded my skills. I ended up switching to a Bayesian framework and using a Gibbs sampler, i.e., drawing from the distribution of the missing data given the current value of the parameters, then from the distribution of the parameters given the now-complete data. But I couldn’t help wondering, are hardnosed frequentists just screwed in this situation, do they have to resort to something like Newton-Raphson, or is there an obvious way to use EM that I just missed?

Posted by James Greiner at 6:00 AM

March 7, 2006

Applied Statistics - Roland Fryer

This week, the Applied Statistics Workshop will present a talk by Roland Fryer, a Junior Fellow of Harvard Society of Fellows, resident in the Economics Department. Dr. Fryer received his Ph.D. in economics from The Pennsylvania State University in 2002, and was an NSF post-doctoral fellow before coming to Harvard. His work has appeared in several journals, including the Quarterly Journal of Economics and the Review of Economics and Statistics. Dr. Fryer will present a talk entitled "Measuring the Compactness of Political Districting Plans". The presentation will be at noon on Wednesday, March 8 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Mike Kellermann at 11:34 AM

Data Availability

Sebastian Bauhoff

Currently most students in Gov 2001 are preparing for the final assignment of the course: replicating and then improving on a published article. While scouting for a suitable piece myself, I came across the debate about whether (and how) data should be made available.

It is somewhat surprising that nowadays one can get all sorts of scholarly research off the web, except for the data that produced the results. Given that methods already exist to ensure that data remains proprietary and confidential, omitting the data from publication seems rather antiquated, unnecessary and counter-productive to scientific advance. Some health datasets -- such as AddHealth, which arguably contains some of the most sensitive information -- have successfully been public for a few years already. There's of course an intriguing debate about this which Gary's website partly documents.

It seems that we are slowly coming in reach of universal data publication. Apart from projects like ICPSR, several major journals recently started to request authors to submit data and codes. The JPE explained to me that they expect to have data for some articles from April 2006, and that 'only the rare article will not include the relevant datasets' from early 2007.

Since debating the robustness of existing results seems like good research, making data and codes available could spur quite a lot of articles. I wonder what the effects on journal content will be. Rather than publishing various replications, maybe journals will post those only online? Or will there be specialized journals to do that to keep the major publications from being jammed?

Posted by Sebastian Bauhoff at 6:00 AM

March 6, 2006

An Unintended Potential Consequence of School Desegregation

Felix Elwert

One goal of school desegregation is to promote racial understanding by fostering interracial contact. In an article in the American Journal of Sociology (1998, Vol. 103[5]), Scott Feld and William Carter develop a simple combinatorial argument about a surprising potential consequence of school desegregation.

They argue that under certain (not so outlandish) circumstances, school desegregation may actually decrease rather than increase opportunities for interracial contact.

Here is their argument by way of a stylized example. Suppose there are four schools, one with capacity C1=400, and three schools with capacities C2=C3=C4=200 students. Under segregation, all 100 black students in the district attend the big school. The 900 other students are white. Assuming that students only interact with students in their own school, there are thus 300*100=30,000 possible interracial, intra-school ties. Now desegregate such that the percentage of black students is the same in all four schools. Then there are 360*40 potential interracial, intra-school friendships in the big school, and 180*20 potential interracial, intra-school friendships in each of the three small schools. Hence, the total number of potential interracial friendships post-desegregation is 25,200, as compared to 30,000 pre-desegregation.

Whether this decrease in potential ties will actually result in a decrease in realized ties is an empirical question, dependent on factors spelled out in the article. Feld and Carter go on to show that this particularly instance is an example of the so-called Class Size Paradox, known from various applications in sociology.

Posted by Felix Elwert at 6:00 AM

March 3, 2006

On communication

Jim's entry about the use of the word "parameter" got me thinking about a related issue I wrestle with all the time: communicating the importance and value of computational models in psychology to traditional psychologists.

There is a certain subset of the cognitive science community that is interested in computational/statistical models of human reasoning, stemming from the 70s and the 80s, first with Strong AI and the rise of connectionism. Nowadays, I think more people are becoming interested in Bayesian models, though admittedly it's hard to tell how big this is because of sample bias: since it's what my lab does, I don't have a clear sense of how many people don't know or care about this approach, since they are the very people I'm least apt to converse with.

Nevertheless, I think I can say with some confidence that a not inconsequential number of psychologists just don't see the value of computational models. Though I think some of that is for good reasons (some of which I share), I'm ever more convinced that a lot of this is because we, the computational and quantitative people, do such a lousy job of explaining why they are important, in terms that a non-computationally trained person can understand.

Part of it is word choice: as Jim says, we have absorbed jargon to the point that it is second-nature to us, and we don't even realize how jargony it might be ("parameters", "model", "Bayesian", "process", "generative", "frequentist","likelihood" - and I've deliberately tried put on this list some of the least-jargony terms we habitually use). But I think it also relates to deeper levels of conceptualization -- we have trained ourselves to the point that when something is described mathematically, we can access the intuition fairly easily, and thus forget that the mathematical description doesn't have the same effect for other people. I was recently at a talk geared toward traditional psychologists in which the speaker described what a model was doing in terms of coin flipping and mutation processes. It was perfectly accurate and certainly less vague than the corresponding intuition, but I think he lost a few people right there: since they couldn't capture the intuition rapidly enough, the model felt both arbitrary and too complicated to them. I don't think it's a coincidence that arbitrariness and "too much" complexity are two of the most common criticisms leveled at computational modelers by non-modelers.

The point? Though we shouldn't sacrifice accuracy in order to make vague, handwavy statements, it's key to accompany accurate statistical descriptions with the corresponding intuitions that they capture. It's a skill that takes practice to develop (learning this is one of the reasons I blog, in fact), and it requires being constantly aware of what might be specialized knowledge that your listener might not know. But it's absolutely vital if we want quantitative approaches to be taken seriously by more non-quantitative folks.

Posted by Amy Perfors at 6:00 AM

March 2, 2006

Freaks And "Parameter"

Jim Greiner

In a previous post, I briefly described the joint Law School/Department of Statistics course I’m currently co-teaching in which law students act as lawyers and quantitative students act as experts in simulated litigation. I’ll be writing about some of the lessons learned from this course in blog entries, especially lessons about what is quickly becoming the course’s central challenge for the students: communication between those with quantitative training and those without. Here’s my first lesson for the quantitatively adept: avoid the word “parameter.?

Of course it isn’t the word “parameter? so much as is it is any of the jargon that we in the quantitative social science business use every day. And everyone knows that if you’re speaking to persons from another field, you have to speak in regular English (if that’s what you’re speaking). The hard part is remembering what regular English sounds like. We in quantitative social science don’t realize what freaks we become.

Here’s the vignette. In a recent session of the class, a student sought to explain to some lawyers how simulation can be used to test whether a model is doing what it’s supposed to do. She got as far as explaining how one could use a computer to simulate data, but when she began to explain checking to see whether an interval produced by the model covered the known truth, she used the word “parameter.? The change in expression on the law students’ faces resembled air going out of a balloon.

Of course, every first year statistics undergraduate knows what a “parameter? is, and as far as jargon goes, “parameter? is a lot less threatening than some other terms. But it was enough to cause the lawyers in the room to give up on following her. It the recovery period was longer than it might otherwise have been because this episode occurred early in the class, when the lawyers and experts were still getting a feel for each other. The lesson for us is, when communicating with the rest of the world, even the most seemingly innocuous words can make a difference. We have to recognize that we’ve become freaks.

Posted by James Greiner at 6:00 AM

March 1, 2006

Thoughts on SUTVA (Part II)

Alexis Diamond, guest blogger

In part I (yesterday), I introduced the subject of SUTVA (the stable unit treatment value assumption), an assumption associated with Rubin's causal model. Well, why have SUTVA in the first place? What work is it actually doing? What does it require? "The two most common ways in which SUTVA can be violated appear to occur when (b) there are versions of each treatment varying in effectiveness or (b) there exists interference between units" (Rubin 1990, p. 282).* But this two-step SUTVA shorthand is frequently implausible in the context of many important and interesting causal questions.

SUTVA allows for a precise definition of causal effects for each unit. When SUTVA obtains, the inference under investigation relates to the difference between what would have been observed in a world in which units received the treatment and what would have been observed in a world in which treatment did not exist. SUTVA makes the inference, the causal question under investigation, crystal clear.

But SUTVA is not necessary to perform inference in the context of Rubin's causal model--what is necessary is to precisely define causal effects of interest in terms of potential outcomes and to adhere to the principle that for every set of allowable treatment allocations across units, there is a corresponding set of fixed (non-stochastic) potential outcomes that would be observed. In my peacekeeping analysis, I define units as country-episodes; each unit is an episode during which a country experienced civil war and was either treated/not-treated by a UN peacekeeping mission.

I define my causal effects precisely: I am interested in causal effects for treated units, and I define the causal effect for each treated unit as the difference between the observed outcome and what would have been observed had that unit's treatment been turned-off and peacekeeping had not occurred. There are many other potential outcomes one could contemplate and utilize to make other causal inferences; these others are beyond the scope of my investigation. I don't need SUTVA or other exclusion restrictions to exclude them. I exclude them in the way I pose my causal question.

I am not claiming that all peacekeeping missions are exactly the same—that would be silly. I also do not claim non-interference across units—after all, how could this be true, or even approximately true? History matters. Peacekeeping missions affect subsequent facts on the ground within and across countries. So SUTVA is going to be violated. But what allows me to proceed with an attempt at analysis is that my causal question is, nevertheless, well-defined. Clearly, I mean only one thing when referring to the "estimated effect of peacekeeping": the difference between the observed outcome for each and every treated unit and what would have been observed for each unit under the control regime of no-peacekeeping. I define the average effect for the treated (ATT), my ultimate estimand of interest, to be the average of these estimated unit-level effects.

Three caveats apply: (1) I am not claiming this ATT represents what it does under SUTVA, namely the average difference in potential outcomes that would have been observed given all selected units experiencing treatment vs. all experiencing control; (2) I must assume there is only one version of the control intervention; (3) estimation will require additional assumptions, and if estimating treatment effects under exogeneity (eg., via matching), one must still make the case for ignorable assignment. This last caveat is very different from, and subsequent to, the others, in the sense that estimation and analysis via matching (or any other method) only makes sense if the first two caveats obtain and the causal question is well-defined.

As social science moves increasingly toward adoption of the Rubin causal model, I predict that political scientists (and social scientists more generally) will frame their SUTVA-like assumptions and inferential questions in this way. I think this is consistent with what Gary King and his coauthors were doing in Epstein et al. (2005)**, when they asked about the effect of war on Supreme Court decision-making. They were not claiming that occurrences of treatment (war) had no effect on subsequent Supreme Court decisions; they were asking about what would have happened if each episode of treatment had been turned off, one at a time. And in many cases, this is the only kind of question there is any hope of answering—the only kind of question close enough to the data to allow for plausible inference. As long as these causal questions themselves are interesting, this general approach seems to me to be a coherent and sensible way forward.

*Rubin, Donald B. Formal Modes of Statistical Inference For Causal Effects. Journal of Statistical Planning and Inference. 25 (1990), 279-292.

** Epstein, Lee; Daniel E. Ho; Gary King; and Jeffrey A. Segal. The Supreme Court During Crisis: How War Affects only Non-War Cases, New York University Law Review, Vol. 80, No. 1 (April, 2005): 1-116.

Posted by James Greiner at 6:00 AM

February 28, 2006

Thoughts on SUTVA (Part I)

Alexis Diamond, guest blogger

I gave a talk on Wed, Feb 8 at the IQSS methods workshop where I described my efforts to estimate the effects of UN intervention and UN peacekeeping on peacebuilding success following civil war. One of my goals was to demonstrate how matching-based methods and the Rubin model of causal inference can be helpful for answering questions in political science, particularly in fields like comparative politics and international relations.

An important issue in this context relates to Rubin's SUTVA, the stable-unit-treatment-value assumption typically assumed whenever matching-based methods are performed. SUTVA requires that the potential outcome for any particular unit i following treatment t is stable, "in the sense that it would take the same value for all other treatment allocations such that unit i receives treatment t (Rubin 1990, p. 282). This is a stronger form of a basic assumption at the heart of the Rubin causal model: that for every set of allowable treatment allocations across units, there is a corresponding set of fixed (non-stochastic) potential outcomes that would be observed.

Rubin (1990) goes on to say that "The two most common ways in which SUTVA can be violated appear to occur when (a) there are versions of each treatment varying in effectiveness or (b) there exists interference between units" (ibid., p. 282).* But how exactly do "versions" and "interference" cause violations, and what are the consequences? Don't these violations occur frequently in political science and the other social sciences? In my research agenda, for example, treatment is peacekeeping, and peacekeeping is going to vary in effectiveness from country to country. Moreover, it is ridiculous to suppose a country's potential outcomes are independent of what is happening (or has already happened) to its neighbors, especially in the context of war and political conflict involving refugees, cross-border skirmishes, etc... (although this kind of independence is typically claimed—at least implicitly—whenever regression-based approaches are used.)

Why do multiple versions of treatment pose SUTVA problems? Because SUTVA posits, for each unit and treatment, a single fixed potential outcome, not a distribution of potential outcomes. Thus, if there is a potential outcome for the weak version of treatment A and a different potential outcome for the strong version of treatment A, then one cannot speak of the potential outcome that would have been observed following treatment A: there are in fact two treatments. Note that a causal question framed in terms of a single type of treatment A (eg., "What is the effect of treatment A-strong version?") does not present these problems. Similarly, as long as there is a single version of the control intervention, one could still coherently define causal effects for each unit in terms of the difference between (observed) potential outcomes under heterogeneous treatment interventions and (unobserved) potential outcomes under control. One might wonder if these causal effects are substantively interesting, and if and how they could be reliably estimated…these critically important issues are separate from and subsequent to the question of whether the inferential investigation is well-defined.

The problem posed by interference across units is very similar; if unit i's potential outcome under treatment A depends upon another unit j's assignment status, then there are really multiple (compound) treatments involving A for unit i, each of which involves a different assignment for unit j. Each of these multiple treatments is associated with a corresponding potential outcome. Note that this kind of interference across units does not necessarily present a problem for defining the effect of a single one of these compound treatment As. It just means that asking "What is the effect of treatment A?" makes no sense---it is not a well-posed causal question.

Because SUTVA is so frequently discussed in the context of matching-based methods, people often assume that the two are inextricably linked: that whatever SUTVA is useful for, it is useful only for matching-based analyses. A crucial point often missed is that SUTVA is useful for the discipline it imposes on study-design. Prior to the choice of analytical methodology (eg., regression, matching, etc.), SUTVA works to nail down the precise question under investigation.

Given these issues, can the peacekeeping question be addressed within Rubin's causal model? I return to this question in post II of this series.

*Rubin, Donald B. Formal Modes of Statistical Inference For Causal Effects. Journal of Statistical Planning and Inference. 25 (1990), 279-292.

Posted by James Greiner at 6:00 AM

February 27, 2006

Resources for Multiple Imputation

Jens Hainmueller

As applied researchers, we all know this situation all too well. Like the alcoholic standing in front of the bar that is just about to open, you just downloaded (or somehow compiled) a new dataset. You open your preferred statistical software and begin to investigate the data. And there again you are struck by lightening: Holly cow - I have missing data!! So what do you do about it? Listwise deletion as usual? In the back of your mind you recall your stats teacher saying that listwise deletion is unlikley to result in valid estimates but hitherto you have simply ignored these caveats. Don't be a fool, you can do better -- use multiple imputation (MI).

As is well known in the statistcial literature on the missing data problem, MI is not the silver bullet for dealing with missing values. In some cases, better (primarily more efficent) estimates can be obtained using weighted estimation procedures or specialized numerical methods (EM, etc.) Yet, these methods are often complicated and problem specific and thus not for the faint of heart applied researcher. MI in contrast is relatively easy to implement and works well in most instances. Want to know how to MI? I suggest you take a look at, a website that brings together various ressources regaring the method, software, and literature citations that will help you to add MI to your toolkit. A nice (non-technical) introduction is also provided on Joseph Schafer's multiple imputation FAQ page. Gary and co-authors have also written extensivley on this subject offering lots of practical advice for applied rearchers. Last but not least, I recommend searching for "multiple imputation" on Andrew Gelman's blog; you will find many of interesting entries on the topic. Good luck!

Posted by Jens Hainmueller at 6:00 AM

February 26, 2006

Applied Statistics - Janet Rosenbaum

This week, the Applied Statistics Workshop will present a talk by Janet Rosenbaum, a Ph.D. candidate in the Program on Health Policy at Harvard. She majored in physics as an undergraduate at Harvard College and received an AM in statisics last year. Janet will present a talk entitled " Do virginity pledges cause virginity?: Estimating the efficacy of sexual abstinence pledges". She has a publication forthcoming in the American Journal of Public Health on related research. The presentation will be at noon on Wednesday, March 1 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. The abstract of the paper follows on the jump:

Objectives: To determine the efficacy of virginity pledges in delaying sexual debut for sexually inexperienced adolescents in the National Longitudinal Study of Adolescent Health (Add Health).

Methods: Subjects were virgin respondents without wave 1 pledge who
reported their attitudes towards sexuality and birth control at wave 1
(n=3443). Nearest-neighbor matching within propensity score calipers
was used to match wave 2 virginity pledgers (n=291) with non-pledgers,
based on wave 1 attitudes, demographics, and religiosity. Treatment
effects due to treatment assignment were calculated.

Results (Preliminary): 17% of virginity pledgers are compliant with their pledge, and do not
recant at wave 3 their earlier report of having taken a pledge. Similar
proportions of virginity pledgers and non-pledgers report having had
pre-marital sex (54% and 61%, p=0.16) and test positive for chlamydia
(2.7% and 2.9%, p=0.89).

Conclusions: Five years after taking a virginity pledge, most virginity
pledgers fail to report having pledged. Virginity pledges do not affect
the incidence of self-reported pre-marital sex or assay-determined

Posted by Mike Kellermann at 4:20 PM

February 24, 2006

Unobservable Quantities in Competing Risks

Felix Elwert

As I remarked in an earlier entry, some researchers are troubled by the potential outcomes framework of causality because it makes explicit reference to unobservable quantities. The implication, of course, is that science should stick to what’s observable.

This position strikes me as needlessly restrictive. In any case, unobservble quantities are by no means exclusive to the potential outcomes framework of causal inference.

I hasten to add, of course, that I’m a stranger to the philosophical discourse on the issue. Interestingly, A.P Dawid has advanced the argument that many results from the potential outcomes framework of causality can be obtained without reference to unobservable quantities by sticking to conditional probabilities. Doing that, however, the math gets quite bit uglier than in the standard potential-outcomes way of presenting these results. Not coincidentally, I suppose, this is why some statisticians like Jamie Robins stress the pedagogic and heuristic value of thinking in potential outcomes, which appears to be uncontested even among those with philosophical objections to causal inference.

Heuristics aside, I’m a bit at a loss over the steadfast opposition to dealing with unobservable quantities in certain quarters. Didn’t we ditch the insistence on (and belief in) direct observation with the Wiener Kreis? And don’t references to unobservable quantities suffuse the way we think? Take, for example, the irrealis, or hypothetical subjunctive mood in English (If my wife were queen of Thebes…). Or, even more glaringly, the Konjunktiv II mood in German. Is the notion of potential outcomes really such a stretch?

Interestingly, unobservable quantities also pop up in other areas of statistics, not just in causal inference. Competing risk analysis, a branch of survival analysis, has been dealing in unobservables more or less since its inception in the 1960s. Within the first two or three pages of any treatment of competing risk analysis, the authors will discuss the interpretation of risk specific failure times, hazards, and survival functions. The most popular interpretation of risk specific survival times is “the time a case would fail due to this risk, if it hadn’t failed due to some other risk before.? An unobservable eventuality if I’ve ever seen one.

This is not to say that everybody is happy with this interpretation. Kalbfleisch and Prentice (2002), for example, in what’s easily the most authoritative text on survival analysis, ban this interpretation to a supplementary section because they want to “consider primarily statistical models for observable quantities only and avoid reference to hypothetical and unobserved times to failure? (p.249). Too bad. But even they seem to consider the interpretation a helpful heuristic.

Posted by Felix Elwert at 6:00 AM

February 23, 2006

Making Votes Honest: Part I

Drew Thomas

First, apologies for my delay in posting to the blog. I've spent most of the last two months involved in the Canadian federal election as a candidate in my home riding. That I lost wasn't unexpected, nor was winning necessarily my goal. I wanted to talk about ideas that weren't being brought up by other candidates. First and foremost on the list was how an election shapes the debate - and why electoral reform is necessary to allow more ideas into the public forum.

While it's clear to me that, first and foremost, Canadians value our right to vote, how that valuation takes place depends directly on what a vote means. As in many party systems, there are two main interpretations for what a vote represents: a belief in the best candidate for the local job, and a belief in the best national party to lead the country. Quite often these two goals do not coincide.

In addition, "tactical" voting, in which a second-choice candidate is chosen merely to block a (much) less desirable candidate, reflects neither of these qualifications.

These problems, among others, anchor my belief that electoral reform is a must for Canada, as well as any multiparty democracy using single member districts and First Past the Post. But band-aid solutions, like the addition of proportionally allocated at-large seats to a FPTP single-member district scheme, would do little to explore the issue. The question before electoral reform revolves not around which of the two focuses - the candidate or the party - is most important to the voters, but rather whether the public can truly express their will through a system that encourages dishonest voting.

So here is my first quantitative question: How does one measure the "strategic effect" on vote counts alone? Survey data is commonly taken, but in comparison to the Ecological Inference problem, drawing this tactical inference from the data themselves would be a huge step towards determining how to reduce it - and what level we could consider acceptable.

Posted by Andrew C. Thomas at 6:00 AM

February 22, 2006

Experimental prudence in political science (Part II)

Mike Kellermann

As I posted the other day, experiments in political science have great potential, but they have some unique risks as well, particularly when the manipulation may change the output of some political process. What happens if your experiment is so successful (in the sense of having a large causal effect) that it changes the outcome of some election? How would you explain such an outcome when you report your results? "The manipulation produced an estimated increase in party A's support of 5000 votes, with a standard error of 250. (Party A's margin of victory was 2000 votes. Sorry about that.)" This seems like a good way to alienate the public once word got out, not to mention your colleagues working with observational data who now have another variable that they have to account for in their studies.

Having said that, I am just an observer in this field, and I'm sure that many people reading this blog have thought a lot more about these issues than I have. So, to continue the conversation, I'd like to propose the following questions:

At what point does an experimental manipulation become so significant that researchers have an obligation to inform subjects that they are, in fact, subjects?

Do researchers have an obligation to design experiments such that the net effect of any particular experimental manipulation on political outcomes is expected to be zero?

Would it be appropriate for a researcher to work consistently with one party on a series of experiments designed to determine what manipulations increase the probability that the party will win elections? Do the researcher's personal preferences matter in this regard?

To what extent are concerns mitigated by the fact that, in general, political actors could conduct these experiments on their own initiative? What if those actors agree to fund the research themselves, as was the case in the 2002 Michigan experiments?

If a university were to fund experimental research that was likely to promote one political outcome over another, would it risk losing its tax-exempt status? This one is for our resident lawyer....

Posted by Mike Kellermann at 6:00 AM

February 21, 2006

IQ and Risk-taking

A recent study by Shane Frederick at MIT, published in the Journal of Economic Perspectives [pdf], has gotten press attention in the last few weeks for its claim that performance on a simple math test predicted risk-taking behavior. I'm a bit skeptical about the conclusions Frederick's draws (and I'll explain why), but regardless, the study itself is quite interesting.

The study begins by asking subjects to take the Cognitive Reflection Test (CRT), which consists of three simple math questions:

1. A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?
2. If it takes 5 machines 5 minutes to make 5 widgets, how long does it take 100 machines to make 100 widgets?
3. In a lake, there is a patch of lily pads. Every day the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half the lake?

Then subjects are asked two other types of questions:

(a) Would you rather have $3400 now or $3800 in two weeks?
(b) Would you rather have a guaranteed $1000, or a 90% chance of $5000?

Questions of type (a) provide some measure of your "time preference" - how patient you are when it comes to money matters - while questions of type (b) provide a measure of your degree of risk-taking; people who prefer the more certain but lower-expected-value item are more risk-averse than those who choose the opposite. Interestingly, Frederick found that subjects who scored well on the CRT also tended to be more "patient" on questions like (a) and more risk-taking on questions like (b). Much of the discussion in the paper is centered around why and to what extent cognitive abilities, as measured by the CRT, would have an impact on these two things.

It's fascinating work, except it seems to me that there's an alternative explanation for these results that has little to do with cognitive abilities. One strand of such an explanation (which Frederick mentions himself) is that, in addition to mathematical skills, the test measures the ability to overcome impulsive answers. Each of the questions has an "obvious" answer (10 cents, 100 minutes, 24 days) that is incorrect; high-scorers thus need to be able to inhibit the wrong answer as well as calculate the correct one; they tend to be more patient and methodical as well as better at math. It's easy to see how these abilities, not cognitive ability per se, might account for the differential performance on questions like (a).

The deeper problem is that the study failed to control for socioeconomic differences between subjects. The high-performing subjects were taken from universities like Harvard, MIT, and Princeton; the lower-performing subjects were taken from universities like University of Michigan and Bowling Green. People at the latter universities are likely to be in a far more precarious financial situation than those at the former. Why does this matter? One of the principle findings of Kahneman & Tversky's prospect theory is that as you have less money, you become more risk averse. Thus it seems entirely possible to me that the difference between subjects was because of differences in their financial situation, and had nothing to do with cognitive abilities at all (except possibly indirectly, as mediated through socioeconomic factors). I'd be interested in seeing if this finding still holds up even when SES is controlled for.

Posted by Amy Perfors at 6:00 AM

February 20, 2006

Applied Statistics - Rustam Ibragimov

This week, the Applied Statistics Workshop will present a talk by Rustam Ibragimov of the Harvard Department of Economics. Professor Ibragimov received a Ph.D. in mathematics from the Institute of Mathematics of Uzbek Academy of Sciences in 1996 and a Ph.D. in economics from Yale University in 2005 before joining the Harvard faculty at the beginning of this academic year. Professor Ibragimov will present a talk entitled " A tale of two tails: peakedness properties in inheritance models of evolutionary theory . The presentation will be at noon on Wednesday, February 22 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. The abstract of the paper follows on the jump:

In this paper, we study transmission of traits through generations in multifactorial inheritance models with sex- and time-dependent heritability. We further analyze the implications of these models under heavy-tailedness of traits' distributions. Among other results, we show that in the case of a trait (for instance, a medical or behavioral disorder or a phenotype with significant heritability affecting human capital in an economy) with not very thick-tailed initial density, the trait distribution becomes increasingly more peaked, that is, increasingly more concentrated and unequally spread, with time. But these patterns are reversed for traits with sufficiently heavy-tailed initial distributions (e.g., a medical or behavioral disorder for which there is no strongly expressed risk group or a relatively equally distributed ability with significant genetic influence). Such traits' distributions become less peaked over time and increasingly more spread in the population.

In addition, we study the intergenerational transmission of the sex ratio in models of threshold (e.g., polygenic or temperature-dependent) sex determination with long-tailed sex-determining traits. Among other results, we show that if the distribution of the sex determining trait is not very thick-tailed, then several properties of these models are the same as in the case of log-concave densities analyzed by Karlin (1984, 1992). In particular, the excess of males (females) among parents leads to the same pattern for the population of the offspring. Thus, the excess of one sex over the other one accumulates with time and the sex ratio in the total alive population cannot stabilize at the balanced sex ratio value of 1/2. We further show that the above properties are reversed for sufficiently heavy-tailed distributions of sex determining traits. In such settings, the sex ratio of the offspring oscillates around the balanced sex ratio value and an excess of males (females) in the initial period leads to an excess of females (males) offspring next period. Therefore, the sex ratio in the total living population can, in fact, stabilize at 1/2. Interestingly, these results are related, in particular, to the analysis of correlation between human sex ratios and socioeconomic status of parents as well as to the study of the variation of the sex ratio due to parental hormonal levels.

The proof of the results in the paper is based on the general results on majorization properties of heavy-tailed distributions obtained recently in Ibragimov (2004) and several their extensions derived in this work.

Posted by Mike Kellermann at 12:54 PM

February 17, 2006

Do People Think like Stolper-Samuelson? Part III

Jens Hainmueller and Michael Hiscox

In two previes entries here and here we wrote about a recent paper that re-examines the available evidence for the prominent claim that public attitudes toward trade follow the Stolper-Samuelson theorem (SST). We presented evidence that is largely at odds with this hypothesis. In this posting, we take issue with the last specific finding in this literature that has been interpreted as strong support for the SST: The claim that the skill effect of trade preferences is proportional to a country’s factor endowment. What the heck does this mean?

Recall that the according to the SST, skilled individuals will gain in terms of real wages (and thus should be likely to favor trade openness) in countries that are abundantly endowed with skilled labor, but the size of those gains should be proportional to the degree of skill abundance in each country. Of course, in countries that are actually poorly endowed with skilled labor relative to potential trading partners, those gains should become losses.

The seminal paper on this topic, Rodrik and Mayda (2004), shows evidence supporting this idea that the skill effect (proxied by education) is proportional to a country’s factor endowment: they find the largest positive effects in the richest (i.e. most skill adundant) and smaller positive effects in the somewhat poorer (skill scare) countries in their sample. For the only really poor country in their survey sample, the Philippines, they even find a (significant) negative effect (i.e. more educated are less likely to support trade liberalization). This finding constitutes R&M's smoking gun evidence that preferences do indeed follow the SST - the finding very often cited in the literature.

The central problem with the R&M findings, which are mainly based on data from the International Social Survey Programme (ISSP), is the lack of skill scare countries in their sample. Their data thus does not allow for a comprehensive test of the claim that the skill effect of trade preferences is proportional to a country’s factor endowment, simply because most countries in their sample are skill abundant, relatively rich economies. In the supplement to a recent paper we specifically reexamine the R&M claim, using data from the Global Attitudes Project survey administered by Pew in 2002. The PEW data has not been examined by scholars interested in attitudes toward trade, although it has some key advantages compared to the other datasets that have been used (ISSP, etc.). Most importantly, it covers a much broader range of economies that are very heterogeneous in terms of their levels of skill endowments. The PEW data does not only covers the Philippines, but additionally 43 countries, many of which are skill scare.

This figure summarizes our results from the PEW data. It plots the estimated marginal effect of an additional year of schooling on the probability of favouring free trade (evaluated at the sample means, using country specifc ordered probit models) against skill endowment as measured by the log of GDP per capital in 2002 (PPP). The solid diamonds decode the point estimates and the dashed lines shows the .90 confidence envelopes.

Two main findings emerge here: First, there is no clear relationship between the marginal effect of education on support for trade among respondents and their countries’ skill endowments. The pattern more resembles that of a drawing by expressionist painter Jackson Pollack than that of a clear upwards sloping line (what one would predict based upon a simple application of Stolper-Samuelson). Second, in all countries increased schooling has either a positive or zero effect on the probability of supporting free trade. This includes the Philippines which is the only case of a country abundant in low-skills for which Mayda Rodrik and found a negative relationship. Moreover, even most of the point estimates are positive, except for Canada, Ivory Coast, Mali, and Nigeria; not quite a cluster of countries with common skill endowments!

Overall these results strongly suggest that the impact of education levels on support for trade among individuals is not driven by differences in skill endowments across countries (and individual concerns about wage levels) as suggested by a simple application of the Stolper-Samuelson theorem.

Posted by Jens Hainmueller at 6:00 AM

February 16, 2006

Two Objections to the Potential Outcomes Framework of Causality

Felix Elwert

Agreement with the Potential Outcomes Framework of Causality (counterfactual approach, Rubin model) is spreading like wildfire, but is still far from unanimous. Over the past few years I’ve had several conversations with friends in sociology, economics, statistics, and epidemiology who expressed considerable unease with the notion of potential outcomes, or even causality itself.

Two problems keep coming up.

The first is more of a public relations issue than an intellectual problem: Counterfactualists – I at any rate – apparently come on a bit strong at times. I’ve heard the term “counterfascism? (and left the room). I am told that this has to do with offering a simple operational definition for a notion – causality – that has defied a concise discourse for a few centuries too many. How can humble statistics propose a cure where respectable philosophy rails in confusion?

The second, more serious, issue relates to how far we want to go in dealing with the unobservable. The potential outcomes framework clearly and avowedly locates causal effects in the difference between potential outcomes, at least one of which remains unobservable (the “counterfactual' outcome). Direct observation of causal effects thus is impossible, although estimation is possible under certain well-defined circumstances. The exchange between A.P. Dawid (“Causal Inference without Counterfactuals?), Don Rubin, Jamie Robins, Judea Pearl, and others in JASA 1999 considers the problem at its most sophisticated. My conversations, shall we say, rarely reach such heights. But it’s eminently clear that many researchers are troubled to various degrees by admitting unobservable quantities into “science.? Positions here range from moderate empiricism to Vienna style positivism: “you either observe directly or you lie."

I’m in no place to offer solutions. But I do offer this complaint whenever the two issues are combined into a single charge--that counterfactualist potential outcomers are arrogant because they fancy themselves scientists when they deal in unobservable quantities. I’d say that the opposite is true: the potential outcomes framework of causality offers a cutting lesson in humility because it demonstrates the necessity of relying on unobservable (but not necessarily unestimable) quantities, not to mention strong prior theory, for a great many tasks dear to the scientific enterprise.

Posted by Felix Elwert at 6:00 AM

February 15, 2006

Simulated Goats?

Sebastian Bauhoff

In this week's Gov 2001 class, Gary was showing how to get around difficult statistical problems by simulation rather than using complex analytics. That got me thinking about the trade-offs between the two approaches.

One class example was the Monte Hall game that you can probably recite backwards in your sleep: a contestant is asked to choose between 3 doors, 1 of which has a car behind it. Once the choice is made, the game show host opens one of the remaining doors that only has a goat. The contestant is offered to switch from her initial choice to the remaining door, and the question is whether that's a good strategy.

One can solve this analytically by thinking hard about the problem. Alternatively one can simulating the conditional probabilities of getting the prize given switching or not switching, and use this to get the intuition for the result.

During the debate in class I was wondering whether simulations are really such a good thing. Sure, they solve the particular problem at hand and it may be the only way to handle very complex problems fast. But it doesn't contribute to solving even closely related problems whereas one could glean insights from the analytic approach.

Maybe the simulation is still useful since writing code structures one's thoughts. But it also seems like it might depreciate critical skills. (Apart from the very real possibility that one makes a mistake in the code and tries to convince oneself of the wrong result.) Imagine you show up at Monty's show and they changed the game without telling you. It won't help if you would know how to implement a new simulation if you can't actually run it. Having solid practice in the analytical approach might be more useful.

I don't want to suggest that simulations are generally evil, but maybe they come at a cost. Oh, and the answer is yes, switch.

Posted by Sebastian Bauhoff at 6:00 AM

February 14, 2006

Is Military Spending Justified by Security Threats?

You, Jong-Sung

In the recent ASSA meeting in Boston, Linda Bilmes, a Kennedy School lecturer, and Joseph Stiglitz, Columbia professor and a Nobel prize-winning economist, presented an interesting paper, “The Economic Costs of the Iraq War.? They estimated the total economic costs of the war, including direct costs and macroeconomic costs, lie between $1 and $2 trillion. Interestingly, the “$2 trillion? figure was already projected by William Nordhaus, Yale professor of economics, even before the war. In his paper, “The Economic Consequences of a War With Iraq?(2002), he predicted the costs of Iraq war would reach from $99 billion, if the war is short and favorable, to $1,924 billion, if the war is protracted and unfavorable.

In the same ASSA meeting, Nordhaus raised important questions about excessive military spending in his paper entitled “The Problem of Excessive Military Spending in the United States.? I am providing some excerpts from the paper below.

Nordhaus notes, “The U.S. has approximately half of total national security spending for the entire world. Total outlays for ‘defense’ as defined by the Congressional Budget Office were $493 billion for FY2005, while the national accounts concept of national defense totaled around $590 billion for 2005. It constitutes about $5000 per family. By comparison, the Federal government current expenditures in 2004 were $14 billion for energy, $4.7 billion for recreation and culture, and $1.8 billion for transit and railroads.? The question is whether the US is earning a good return on its national-security ‘investment,’ for it is clearly an investment in peace and safety. The bottom line he argues, is probably not.

Nordhaus asks whether it is plausible that the United States faces a variety and severity of objective security threats that are equal to the rest of the world put together. Then he points the following facts. “Unlike Israel, no serious country wishes to wipe the U.S. off the face of the earth. Unlike Russia, India, China, and much of Europe, no one has invaded the U.S. since the nineteenth century. We have common borders with two friendly democratic countries with which we have fought no wars for more than a century.?

He raises the issue of strategic and budgetary inertia. “Many costly programs are still in place a decade and a half after the end of the cold war. The U.S. has around 6000 deployed nuclear weapons, and Russia has around 4000 weapons. There can be little doubt that the world and the U.S. are more vulnerable rather than less vulnerable with such a large stock of weapons, yet they survive in the military budget. There is a kind of security Laffer curve in nuclear material, where more is less in the sense that the more nuclear material floating around the more difficult it is to control it and the more like it is that it can be stolen.? He argues that today’s slow decline in spending on obsolete systems arises largely because there are such weak budgetary and virtually non-existent political pressures on military spending – the ‘loose budget constraints.’

He suggests that an excessive military budget is not just economic waste but also causes problems rather than solving them by tempting leaders to use an existing military capability. “Countries without military capability cannot easily undertake ‘wars of choice’ or wars whose purposes evolve, as in Iraq, from dismantling wars of mass destruction to promoting democracy. To the extent that Vietnam and Iraq prove to be miscalculations and strategic blunders, the ability to conduct them is clearly a cost of having a large military budget.?

A final concern he raises is that the large national-security budget leads to loose budget constraints and poor control over spending and programs. “Congress exercises no visible oversight on defense spending and a substantial part is secret. Some of the abuses in recent military activities arise because Congress cannot possibly effectively oversee such a large operation where programs involving $24 billion are enacted as a single line item. Even worse, how can citizens or ordinary members of Congress understand the activities of an agency like the National Security Agency, whose spending level and justification are actually classified??

Posted by Jong-sung You at 6:00 AM

February 13, 2006

Applied Statistics - Mike Kellermann

This week, I will be giving the talk at the Applied Statistics Workshop; as they say, turnabout is fair play. The talk is entitled "Estimating Ideal Points in the British House of Commons." I've blogged a bit about this project here. An abstract of the talk appears on the jump:

Estimating the policy preferences of individual legislators is important for many studies of legislative and partisan politics. Unfortunately, existing ideal point methods do not perform well when applied to legislatures characterized by strong party discipline and oppositional politics, such as the British House of Commons. This project develops a new approach for estimating the preferences of British legislators, using Early Day Motions as an alternative data source. Early Day Motions are petitions that allow MPs to express their opinions without being bound by party whips. Unlike voting data, however, EDMs do not allow legislators to express opposition to a particular policy. To deal with the differences between voting data and EDMs, I adapt existing Bayesian ideal point models to allow for the possibility (supported in the data) that some Members of Parliament are more likely to sign EDMs than others, regardless of policy content. The estimates obtained have much greater face validity than previous attempts to estimate ideal points in the House of Commons, and have the usual benefits associated with Bayesian ideal point models, including natural estimates of uncertainty and the ability to calculate auxiliary quantities of interest directly from the posterior distribution.

Posted by Mike Kellermann at 9:37 PM

Experimental prudence in political science (Part I)

Mike Kellermann

We've talked a fair bit on the blog about the use of experimental data to make causal inferences. While the inferential benefits of experimental research are clear, experiments raise prudential questions that we rarely face in observational research; they require "manipulation" in more than one sense of that word. As someone who is an interested observer of the experimental literature rather than an active participant, I wonder how well the institutional mechanisms for oversight have adapted to field experimentation in the social sciences in general (and political science in particular). In medical experiments, the ability in principle to obtain informed consent from subjects is critical in determining what is ethically acceptable, but this is often not possible in a political context; external validity may depend on concealing the experimental nature of the manipulation from the "subjects." Moreover, the effects of the manipulation may be large enough to change large-scale political outcomes, thus affecting individuals outside of the nominal pool of subjects.

As an example, consider the turnout experiments I discussed here and here. The large-scale phone experiments in Iowa and Michigan are typical in that they involve non-partisan GOTV (get out the vote) efforts. Treated voters are contacted by phone (or by mail, or in person) and urged to vote, while control voters are not contacted; neither group, as far as I can tell, know that they are experimental subjects. Such a design is possible because the act of voting is a matter of public record, and thus the cooperation of the subjects is not required to obtain the relevant data.

While the effects of such manipulations may provide some insight for political scientists as to the causes of voter turnout, their practical significance is a bit hard to measure; there are not that many genuinely non-partisan groups out there with both the means and the motivation to conduct large-scale voter mobilization efforts. There have been some recent efforts to study partisan voter mobilization strategies using field experiments. David Nickerson, Ryan Friedrichs, and David King have a forthcoming article reporting on an experiment in the 2002 Michigan gubernatorial campaign, in which a youth organization of the Michigan Democratic Party agreed to randomize their partisan GOTV efforts aimed at voters believed to be Democrats or independents. The authors find positive effects for all three of the common GOTV manipulations (direct literature, phone calls, and face-to-face canvassing). In the abstract, obtaining data from manipulations that are clearly relevant in the real world is good for the discipline. I have no doubt that both party activists and party scholars would love to do more such research, but it all makes me slightly uncomfortable. As researchers, should we be in a position where we are (potentially) influencing political outcomes not only through arguments based on the evidence that we collect, but through the process of collecting evidence as well?

Posted by Mike Kellermann at 6:00 AM

February 10, 2006

What’s an Effect?

Felix Elwert

Though it hardly comports with my own views, there are plenty of people in the social sciences and economics that are troubled by the potential outcomes framework of causality. What intrigues me about this opposition is that most of those who object to the notion of causality appear comfortable with talk about regression “effects.?

If you object to talk about causality, what do you mean by “effect??

By way of preemptive self-defense, this question isn’t about my inability to understand that regression coefficients provide a neat summary of the sample data in a purely descriptive sense (I do get that). But if the goal is getting descriptives, why call regression coefficients “effects?? Doesn’t “effect? imply agency? Sure, the predicted Y might increase by b units if we change X by one unit (agency! ha!) but then that’s really the analyst’s doing (we shift X by one unit) - and didn’t we want the analysis to speak to what’s happening in the world outside of that scatter plot print out?

Here’s the task: Can anybody provide an interpretation of the word “effect? that (a) doesn’t just refer to what the analyst can do with that scatter plot on the desk, and that (b) does not take recourse to a manipulability (counterfactualist or potential outcomes) account of causality?

What’s your preferred non-causal explanation for why one might call regression coefficients “effects??

Posted by Felix Elwert at 6:00 AM

February 9, 2006

Implicit learning and race

Since Martin Luther King Day was somewhat recent (okay - a month ago; stil...), I thought I'd blog about human statistical learning and its possible implications for racism. Some of this is a bit speculative (and I'm no sociologist) but it's a fascinating exploration of how cutting-edge research in cognitive science has implications for deep real-world problems.

In today's society racism is rarely so blatant as it was 50 or 100 years ago. More often it refers to subtle but ubiquitous inconsistencies in how minorities are treated (or, sometimes, perceive themselves to be treated). Different situations are probably different mixtures of the two. Racism might often be small effects that the person doing the treating might not even notice -- down to slight differences in body language and tone of voice -- that could nevertheless have large impacts on the outcome of a job interview or the likelihood of being suspected of a crime.

One of the things studying statistical learning teaches us is that almost everyone has subtly different, usually more negative, attitudes to minorities than to whites - even minorities themselves. Don't believe me? Check out the online Implicit Association Test, which measures the amount of subconscious connection you make between different races and concepts. The premise is simple and has been validated over and over in psychology: if two concepts are strongly linked in our minds, we will be faster to say so than if they are only weakly associated. For instance, you're faster to say that "nurse" and "female" are similar than "nurse" and "male", even though men can be nurses, too. I'm oversimplifying here, but in the IAT you essentially are called upon to link pictures of people of different races with descriptors like good/bad, dangerous/nice, etc. Horrifyingly, even knowing what the experiment measures, even taking it over and over again, most people are faster to link white faces with "good" words, black with bad.

Malcolm Gladwell's book "Blink" has an excellent chapter describing this, and it's worth quoting one of his paragraphs in detail: "The disturbing thing about this test is that it shows that our unconscious attitudes may be utterly incompatible with our stated values. As it turns out, for example, of the fifty thousand African Americans who have taken the Race IAT so far, about half of them, like me, have stronger associations with whites than with blacks. How could we not? We live in North America, where we are surrounded every day by cultural messages linking white with good." (85)

I think this is yet another example of where learning mechanisms that are usually helpful -- it makes sense to be sensitive to the statistical correlations in the environment, after all -- can go devastatingly awry in today's world. Because the media and gossip and stories are a very skewed reflection of "the real world", our perceptions formed by those sources (our culture, in other words) are also skewed.

What can we do? Two things, I think. #1: Constant vigilance! Our associations may be unconscious, but our actions aren't. If we know about our unconscious associations, we're more likely to watch ourselves vigilantly to make sure they don't come out in our actions; as enough people do that, slowly, the stereotypes and associations themselves may change. #2: This is the speculation part, but it may be possible to actually change our unconscious associations: not consciously or though sheer willpower, but by changing the input our brain receives. The best way to do that, I would guess, is to get to know people of the minority group in question. Suddenly your brain is receiving lots of very salient information about specific individuals with wholly different associations than the stereotypes: enough of this and your stereotype itself might change, or at least grow weaker. I would love to see this tested, or if someone has done so, what the results were.

Posted by Amy Perfors at 6:00 AM

February 8, 2006

New Author's Committee Chair

I'd like to announce a change today in our Blog Author's Committee Chair from Jim Greiner to Amy Perfors. Amy was a Stanford undergrad and is now a 3rd year graduate student at MIT. She is interested in using Bayesian technology as models of how humans think, evolutionary linguistics, how humans learn, and a variety of other interesting topics. See her web site for lots more info. In addition to writing some of our most interesting blog entries, I especially recommend this great picture of her winning a line out inrugby!

Jim, the first chair of our author's committee, led this group from a pretty good idea to, in my view and judging from our large and fast growing readership, an enormously successful and informative blog. He will continue on as a member of our Author's Committee, but he's busy this semester running his innovative class in the Law School and Statistics Department, Quantitative Social Science, Law, Expert Witnesses, and Litigation.

Jim Greiner graduated with a B.A. in Government from the University of Virginia and received a J.D. from the University of Michigan Law School in 1995. He clerked for Judge Patrick Higginbotham on the U.S.
Court of Appeals and was a practicing lawyer in the Justice Department and private practice before joining the Harvard Statistics Department.

Posted by Gary King at 6:00 AM

February 7, 2006

Do People Think like Stolper-Samuelson? Part II

Jens Hainmueller and Michael Hiscox

Last week, we introduced the question of whether the Stolper-Samuelson theorem, i.e., that more educated people favour trade because it will increase their factor returns, accurately reflects the way people think. We also introduced our recent paper on this subject, “Learning to Love Globalization: Education and Individual Attitudes Toward International Trade“, in which we examine the alternative theory that more educated respondents tend to be more exposed to economic ideas about the overall efficiency gains for the national economy associated with greater trade openness, and tend to be less prone to nationalist and anti-foreigner sentiments often linked with protectionism.

Which of the very different interpretations of the education-pro trade link is more correct? We re-examine the available survey data on individual attitudes toward trade, conducting a simple test of the effects of education on support for trade that distinguishes clearly between the Stolper-Samuelson interpretation of this relationship and alternative ideational and cultural accounts. We find that the impact of education on attitudes toward trade is almost identical among respondents currently in the active labor force and among those who are not (even those who are retired). That the effects of education on trade policy preferences are not mediated by whether individuals are actually being paid for the employment of their skills strongly suggests that it is not primarily a product of distributional concerns.

The analysis also reveals clear non-linearities in the relationship between education and trade preferences: while individuals who have been exposed to college or university education are far more likely to favor trade openness than those who have not, other types of educational attainment have no significant effects on attitudes and some even reduce the likelihood that individuals support trade even though they clearly contribute to skill acquisition. These findings indicate that the particular ideational and/or cultural effects associated with college education, and not the gradual accumulation of skills, are critical in shaping individual attitudes toward trade.

We conclude that the impact of education on how voters’ think about trade and globalization has more to do with exposure to economic ideas, and information about the aggregate and varied effects of these economic phenomena, than it does with individual calculations about how trade affects personal income or job security. This is not to say that the latter types of calculations are not important in shaping individuals’ views of trade – just that they are not being manifest in the simple association between education and support for trade openness. As we discuss in the concluding section, we think it is likely that concerns about the effects of trade on personal income and job security might actually hinge on the particular impact of trade openness in specific industries. One of the key implications of our findings is that future empirical tests of the determinants of individual trade preferences need to be substantially refined to identify the impact of distributional concerns on attitudes towards trade and globalization and distinguish these from the impact of ideational and cultural factors.

Posted by James Greiner at 6:00 AM

February 6, 2006

Applied Statistics - Alexis Diamond

This week, the Applied Statistics Workshop will present a talk by Alexis Diamond, a Ph.D. candidate in Political Economy and Government. The talk is entitled "The Effect of UN Intervention after Civil War." An abstract of the talk appears on the jump:

A basic goal of political science is to understand the effects of political institutions on war and peace. Yet the impact of United Nations peacebuilding following civil war remains very much in doubt following King and Zeng (2006), which found that prior conclusions about these causal effects (Doyle and Sambanis 2000) had been based more on indefensible modeling assumptions than evidence. This paper revisits the Doyle and Sambanis causal questions and answers them using new matching-based methods that address issues raised by King and Zeng. The methods are validated for the Doyle and Sambanis data via their application to a dataset with similar features for which the correct answer is known. These new methods do not require assumptions that plagued prior work and are broadly applicable to important inferential problems in political science and beyond. When the methods are applied to the Doyle and Sambanis data, there is a preponderance of evidence to suggest that UN peacebuilding has a positive effect on peace and democracy in the aftermath of civil war.

Posted by Mike Kellermann at 11:41 AM

Another paradox of turnout? (Part II)

Mike Kellermann

Last week I highlighted a new article by Arceneaux, Gerber, and Green that suggests that matching methods have difficulty in replicating the experimentally estimated causal effect of a phone-based voter mobilization effort, given a relatively rich set of covariates and a large control pool from which to draw matches. Matching methods have been touted as producing experiment-like estimates from observational data, so this result is kind of disheartening. How might advocates of matching methods respond to this claim?

Let's assume that the results in the paper hold up to further scrutiny (someone should - and I have no doubt will - put this data through the ringer, although hopefully it won't suffer the fate of the NSW dataset). Why should turnout be problematic? Explaining voter turnout has presented quandaries and paradoxes in other branches of political science, so it is hardly surprising that it mucks up the works here. Turnout has been called "the paradox that ate rational choice," due to the great difficulty in finding a plausible model that can justify turnout on instrumental terms. To my mind, the most reasonable (and least interesting) rational choice models of turnout resort to the psychic benefits of voting or "civic duty" - the infamous "D" term - to account for the fairly solid empirical generalization that some people do, in fact, vote. What, exactly, the "D" term represents is something of a mystery, but it seems reasonable that people who feel a duty to go to the polls are also more likely to listen to a phone call urging them to vote, even conditional on things like age, gender, and voting behavior in the previous two elections.

The authors are somewhat pessimistic about the possibility of detecting such problems when researchers do not have an experimental estimate to benchmark their results (and, hence, when matching or some other technique is actually needed). They ask, "How does one know whether matched observations are balanced in terms of the unobserved causes of the dependent variable?" That is indeed the question, but I think that they may be a little too skeptical about the ability to ferret out such problems, especially in this particular context. If the matched data is truly balanced on both the observed and unobserved outcomes, then there should be no difference in expected value of some auxiliary variable (excluded from the matching process) that was observed before the treatment was applied, unless we want to start thinking in terms of reverse temporal causation. The authors could have dropped, say, turnout in 2000 from their matching procedure, matched on the other covariates, and then checked for a difference in the turnout in 2000 between the treatment and control groups in 2002. My guess is that they would find a pretty big difference. Of course, since these matches are not the same as those used in the analysis, any problems that result could be "fixed" by the inclusion of 2000 voter turnout in the matching procedure, but that is putting a lot of weight on one variable.

Even if the prospects for identifying bias due to unobserved covariates are better than Arceneaux, Gerber, and Green suggest, it is not at all apparent that we can do anything about it. In this case, if we knew what "duty" was, we might be able to find covariates that would allow us to satisfy the unconfoundedness constraint. On the other hand, it is not obvious how we would identify those variables from observational studies, since we would likely have similar problems with confoundedness. No one said this was supposed to be easy.

Posted by Mike Kellermann at 6:00 AM

February 3, 2006

To Your Health

Sebastian Bauhoff

A common excuse for wine lovers is that "a few glasses of wine are good for the heart". Well maybe for warming your heart but possibly not for preventing heart attacks.

A recent note in The Lancet (Vol 366, December 3, 2005, pages 1911-1912) suggests that earlier reports that light to moderate alcohol consumption can lower the risk of ischaemic heart disease were severely affected by confounders in non-randomized trials.

Some people believed that the early results were due to misclassification of former drinkers with cardio-vascular diseases ("CVD") as never-drinkers. This raised the CVD rate among the non-drinkers group. Another possible story is that the studies didn't properly control for confounders -- apparently some risk factors for CVD are more prevalent among non-drinkers, and the non-randmized studies didn't control well enough for those. But as the note points out, confounding could bias results both in favor or against a protective effect. Heavy drinking offers really good protection but those people don't live healtily lifes, and the health benefits would be obscured.

But don't fear, the British Heart Foundation says that low to moderate alcohol consumption probably doesn't do your heart any harm. For protection against CVD you should really quit smoking, do sports, and eat a balanced diet. Not quite as appealing as a good glass of wine, of course.

In any case, food for thought and a great 2-page piece for your next causal inference class. Cheers to that.

Posted by Sebastian Bauhoff at 6:00 AM

February 2, 2006

Bayesian vs. frequentist in cogsci

Bayesian vs. frequentist - it's an old debate. The Bayesian approach views probabilities as degrees of belief in a proposition, while the frequentist says that a probability refers to a set of events, i.e., is derived from observed or imaginary frequency distributions. In order to avoid the well-trod ground comparing these two approaches in pure statistics, I'll consider instead how the debate changes when applied to cognitive science.

One of the main arguments made against using Bayesian probability in statistics is that it's ill-grounded and subjective. If probability is just "degree of belief", then even a question like "what is the probability of heads or tails" can change depending on who is asking the question and what their prior beliefs about coins are. Suddenly there is no "objective standard", and that's nerve-wracking. For this reason, most statistical tests in most disciplines rely on frequentist notions like confidence intervals rather than Bayesian notions like the relative probability of two hypotheses. However, there are drawbacks to doing this, even in non-cogsci areas. To begin with, many things we want to express statistical knowledge about don't make sense in terms of reference sets, e.g., the probability that it will rain tomorrow (since it will only rain once). For another, some argue that the seeming objectivity of the frequentist approach is illusory, since we can't ever be sure that our sampling process hasn't biased or distorted the data. At least with a Bayesian approach, we can explicitly deal with and/or try to correct that.

But it's in trying to model the mind that we can really see the power of Bayesian probability. Unlike as for other social scientists, this sort of subjectivity isn't a problem: we cognitive scientists are interested in degree of belief. In a sense, we study subjectivity. In making models of human reasoning, then, an approach that incorporates subjectivity is a benefit, not a problem.

Furthermore, (unlike many statistical models) the brain generally doesn't just want to correctly capture the statistical properties of the world. Actually, its main goal is generalization -- prediction, not just estimation, in other words -- and one of the things people excel at is generalization based on very little data. Incorporating the Bayesian notion of prior beliefs, which act to constrain generalization in ways that go beyond the actual data, allows us to formally study this in ways that we couldn't if we just stuck to frequentist ideas of probability.

Posted by Amy Perfors at 6:00 AM

February 1, 2006

Do People Think like Stolper-Samuelson? Part I

Jens Hainmueller and Michael Hiscox

In face of the fierce political disagreements over free trade taking place in the US and elsewhere, it's critical we try to understand how people think about trade policies. A growing body of scholarly research has examined survey data on attitudes toward trade among voters, focusing on individual determinants of protectionist sentiments. These studies have converged upon one central finding: fears about the distributional effects of trade openness among less-educated, blue-collar workers lie at the heart of much of the backlash against globalization in the United States and other advanced economies. Support for new trade restrictions is highest among respondents with the lowest levels of education (e.g., Scheve and Slaughter 2001a, 2001b; Mayda and Rodrik 2005; O’Rourke and Sinnott 2002). These findings are interpreted as strong support for the Stolper-Samuelson theorem, a classic economic treatment of the income effects of trade. It predicts that trade openness benefits those owning factors of production with which their economy is relatively well endowed (those with high skill levels in the advanced economies) while hurting others (low skilled and unskilled workers).

But is it really true that people think like Stolper-Samuelson (i.e. that more educated people favour trade because it will increase their factor returns)? The positive relationship between education and support for trade liberalization might also – and perhaps primarily – reflect the fact that more educated respondents tend to be more exposed to economic ideas about the overall efficiency gains for the national economy associated with greater trade openness, and tend to be less prone to nationalist and anti-foreigner sentiments often linked with protectionism. In our recent paper “Learning to Love Globalization:
Education and Individual Attitudes Toward International Trade“
we try to shed light on this issue. More on this in a subsequent post tomorrow.

Posted by Jens Hainmueller at 6:00 AM

January 31, 2006

Another paradox of turnout? (Part I)

Mike Kellermann

Those of you who have followed this blog know that making reasonable causal inferences from observational data usually presents a huge challenge. Using experimental data where we "know" the right answer, in the spirit of Lalonde (1986), provides one way for researchers to evaluate the performance of their estimators. Last month, Jens posed the question (here and here) "What did (and do we still) learn from the Lalonde dataset?" My own view is that we have beaten the NSW data to death, buried it, dug it back up, and whacked it around like a piñata. While I'm sure that others would disagree, I think that we would all like to see other experiment-based datasets with which to evaluate various methods.

In that light, it is worth mentioning "Comparing experimental and matching methods using a large-scale voter mobilization experiment" by Kevin Arceneaux, Alan Gerber, and Donald Green, which appears in the new issue of Political Analysis. Much in the spirit of Lalonde's original paper, they base their analysis on a voter turnout experiment in which households were randomly selected to receive non-partisan phone calls encouraging them to vote in the 2002 mid-term elections. This type of mobilization experiment suffers from a classic compliance problem; some voters either don't have phones or refuse to take unsolicited calls. As a result, in order to determine the average causal effect of the treatment on those who would receive it, they need to find a method to compare the compliers who received treatment to compliers in the control group. Since assignment to treatment was randomly assigned, they use assignment as an instrument in the spirit of Angrist, Imbens, and Rubin (1996). Using a 2SLS regression with assignment in the first stage, their estimates of the ATT are close to zero and statistically insignificant. While one might quibble with various choices (why not a Bayesian estimator instead of 2SLS?), it is not obvious that there is a problem with their experimental estimate, which in the spirit of this literature we might call the "truth".

The authors then attempt to replicate their experimental results using both OLS and various matching techniques. In this context, the goal of the matching process is to pick out people who would have listened to the phone call had they been contacted. The authors have a set of covariates on which to match, including age, gender, household size, geographic location, whether the voter was newly registered, and whether the voter turned out in each of the two previous elections. Because the control sample that they have to draw from is very large (almost two million voters), they don't have much difficulty in finding close matches for the treated group based on the covariates in their data. Unfortunately, the matching estimates don't turn out to be very close to the experimental baseline, and in fact are much closer to the plain-vanilla OLS estimates. Their conclusion from this result is that the assumptions necessary for causal inferences under matching (namely, unconfoundedness conditional on the covariates) are not met in this situation, and (at least by my reading) they seem to suggest that it would be difficult to find a dataset that was rich enough in covariates that the assumption would be met.

As a political scientist, I have to say that I like this dataset, because (a) it is not the NSW dataset and (b) it is not derived from a labor market experiment. What do these results mean for matching methods in political science? I'll have some thoughts on that tomorrow.

Posted by Mike Kellermann at 6:00 AM

January 30, 2006

Applied Statistics - Jim Greiner

This week, the Applied Statistics Workshop resumes for the spring term with a talk by Jim Greiner, a Ph.D. candidate in the Statistics Department. The talk is entitled "Ecological Inference in Larger Tables: Bounds, Correlations, Individual-Level Stories, and a More Flexible Model," and is based on joint work with Kevin Quinn from the Government Department. Jim graduated with a B.A. in Government from the University of Virginia in 1991 and then received a J.D. from the University of Michigan Law School in 1995. He clerked for Judge Patrick Higginbotham on the U.S. Court of Appeals for the Fifth Circuit and was a practicing lawyer in the Justice Department and private practice before joining the Statistics Department here at Harvard. As chair of the author's committee, he is a familiar figure to readers of this blog.

As a reminder, the Applied Statistics Workshop meets in Room N354 in the CGIS Knafel Building (next to the Design School) at 12:00 on Wednesdays during the academic term. Everyone is welcome, and lunch is provided. We hope to see you there!

Posted by Mike Kellermann at 12:30 PM

Ecological Inference in the Law, Part III

Jim Greiner

In two previous posts here and here, I discussed the ecological inference problem as it relates to the legal question of racially polarized voting in litigation under Section 2 of the Voting Rights Act. In the latter of these two posts, I suggested that this field needed greater research into the case of R x C, as opposed to 2 x 2, tables.

Here's another suggestion from the courtroom: we need an individual level story.

The fundamental problem of ecological inference is that we do not observe data at the individual level; instead, we observe row and column totals for a set of aggregate units (precincts, in the voting context). This fact has led to some debate about whether a model or a story or an explanation about individual level behavior is necessary to make ecological inferences reliable, or at least as reliable as they can be. On the one hand, Achen & Shively, in their book Cross-Level Inference, have argued that an individual level story is always necessary to assure the coherence of the aggregate model and to assess its implications. On the other hand, Gary King, in his book A Solution to the Ecological Inference Problem, has argued that because we never observe the process by which ecological data are aggregated from individual to group counts, we need not consider individual level processes, so long as the row counts (or percentages) are uncorrelated with model parameters.

From a social science point of view, this question is debatable. From a legal point of view, we need an individual level story, regardless of whether such a story produces better statistical results. When judges and litigators encounter statistical methods in a litigation setting, they need to understand (or, at least, to feel that they understand) something about those methods. They know they will not comprehend everything, or perhaps even most things, and they have no interest in the gritty details. But they will not credit an expert witness who says, in effect, "I ran some numbers. Trust me." What can quantitative expert witnesses offer in an ecological inference setting? The easiest and best thing is some kind of individual level story that leads to the ecological model being used.

Posted by James Greiner at 6:01 AM

January 27, 2006

Why does repeated lying work?

It's a common truism, familiar to most people by now thanks to advertising and politics, that repeating things makes them more believable -- regardless of whether they're true or not. In fact, even if they know at the time that the information is false, people will still be more likely to believe something the more they hear it. This phenomenon, sometimes called the reiteration effect, is well-studied and well-documented. Nevertheless, from a statistical learning point of view, it is extremely counter-intuitive: shouldn't halfway decent learners learn to discount information they know is false, not "learn" from it?

One of the explanations for the repetition effect is related to source confusion -- the fact that, after a long enough delay, people are generally much better at remembering what they learned rather than where they learned it. Since a great deal of knowing that something is false means knowing that its source is unreliable, forgetting the source often means forgetting that it's not true.

Repetition increases the impact of source confusion for two reasons. First, the more often you hear something, the more sources there are to remember, and the more likely you are to forget at least some of them. I've had this experience myself - trying to judge the truth of some tidbit of information, actually remembering that I first read it somewhere that I didn't trust, knowing that I've read it somewhere else (but not remembering the details) and concluding that since there was some chance that this somewhere else was trustworthy, it might be true.

The second reason is that the more sources there are the more unlikely it seems that all of them believe it if it's false. This strategy makes some evolutionary and statistical sense. Hearing (or experiencing) something from two independent sources (or two independent events) makes it more likely that you can generalize on them than if you only experienced it once. This idea is the basis of getting large sample sizes: as long as the samples are independent, more samples means more evidence. Unfortunately, in the mass media today few sources of information are independent. Most media outlets get things from AP wire services and most people get their information from the same media outlets, so even if you hear item X in 20 completely different contexts, chances are that all 20 of them stem from the same one or two original reports. If you've ever been the source of national press yourself, you will have experienced this firsthand.

I tried to think of a way to end this entry on a positive note, but I'm having a hard time here. It's a largely unconscious byproduct of how our implicit statistical learning mechanisms operate, so even being aware of this effect is only somewhat useful: we know consciously not to trust things simply because we've heard them often, but so much of this is unconscious it's hard to fight. Education about it is therefore worthwhile, but better still would be solutions encouraging a more heterogeneous media with more truly independent sources.

Posted by Amy Perfors at 6:00 AM

January 26, 2006

Stats Games

Jens Hainmueller

January is exam period at Harvard. Since exams are usually pretty boring, I sometimes get distracted from learning by online games. Recently, I found a game that may even be (partly) useful for exam preparation, at least for an intro stats class. Yes, here it is a computer game about statistics: StatsGames. StatsGame is a collection of 20 games or challenges designed to playfully test and refine your statistical thinking. As the author of StatsGames, economist Gary Smith, admits: "These games are not a substitute for a statistics course, but they may give you an enjoyable opportunity to develop your statistical reasoning." The program is free and runs on all platforms. Although the graphical makeup somewhat reminds me of the days when games were still played on Atari computers, most of the games in the collection are really fun. My favorites include the Batting Practice (a game to teach students to use the binomial distribution to test the hypothesis whether you are equally likely to swing late or early) and the Stroop effect (a game featuring a simple cognitive science type experiment which is then evaluated using the F-test). I also enjoyed the simulation of Galton's Apparatus. Go check it out! But don't waste too much exam preparation time of course - and good luck if you have any exams soon! I also wonder whether there are other computer games about statistics out there. Any ideas?

Posted by Jens Hainmueller at 6:00 AM

January 25, 2006

Is This a First?

Jim Greiner

This Spring, Harvard will be the site of something that has never been attempted before . . . I think. Matthew Stephenson of the Harvard Law School, Don Rubin of the Harvard Department of Statistics, and I will teach a seminar entitled Quantitative Social Science, Law, Expert Witnesses, and Litigation. The course will be offered jointly in the Law School and the Statistics Department and will, we hope, include students from the both places as well as other Departments in the Graduate School of Arts & Sciences (Government, Sociology, Economics, etc.).

In the course, the quantitatively trained students will act as expert witnesses by analyzing datasets relating to a given fact scenario. The experts will draft expert reports and testify at depositions, which will be taken by the law students acting as (what else?) lawyers. The lawyers will then use the transcripts and expert reports to draft cross motions for summary judgment and responses to those motions. By the way: A very big thanks to New England Court Reporting Institute for agreeing to provide court reporters free of charge to assist the course!

Our hope is that by forcing law students and quantitatively trained students to communicate effectively under the pressure-cooker conditions of pre-trial litigation, we can teach them something about the critical process of communicating with one another generally. In my view, this communication process is underemphasized in both law schools and quantitative departments around the nation. For example, how often does the average law student have to communicate with a person with greater knowledge of another field (anything from construction to exporting fruit)? How often are students trained in quantitative fields required to explain methods and conclusions to those not so trained?

When I began putting together this course a year ago, I searched for analogs in academic websites around the country but found none. My question: are there other for-credit classes like this one out there? By "like this one" I mean courses in which quantitative and law students are in the same classroom, forced to work with each other effectively?

Either way, I'll be sharing some of the lessons learned from this effort throughout the upcoming semester.

Posted by James Greiner at 6:00 AM

January 24, 2006

Methods Classes in Spring 06

Sebastian Bauhoff

With the end of the Fall semester comes the happy time of shopping for (applied) quantitative methods courses for the Spring. Here's a partial list for currently planned offerings around Cambridge, and their descriptions.

Bio 503 Introduction to Programming and Statistical Modeling in R (Harezlak, Paciorek and Houseman)

An introduction into R in 5 3-hour sessions combining demonstration, lecture, and laboratory components. It will be graded pass/fail on the basis of homework assignments. Taught in the Winter session at HSPH.

Gov 2001 Advanced Quantitative Research Methodology (King)

Introduces theories of inference underlying most statistical methods and how new approaches are developed. Examples include discrete choice, event counts, durations, missing data, ecological inference, time-series cross sectional analysis, compositional data, causal inference, and others. Main assignment is a research paper to be written alongside the class.

Econ 2120. Introduction to Applied Econometrics (Jorgenson)

Introduction to methods employed in applied econometrics, including linear regression, instrumental variables, panel data techniques, generalized method of moments, and maximum likelihood. Includes detailed discussion of papers in applied econometrics and computer exercises using standard econometric packages. Note: Enrollment limited to certain PhD candidates, check the website.

MIT 14.387 Topics in Applied Econometrics (Angrist and Chernozhukov)
Click here for 2004 website

Covers topics in econometrics and empirical modeling that are likely to be useful to applied researchers working on cross-section and panel data applications.
[It's not clear whether this class will be offered in Spring 06. Check the MIT class pages for updates.

KSG API-208 Program Evaluation: Estimating Program Effectiveness with Empirical Analysis (Abadie)
Accessible from here (click on Spring Schedule)

Deals with a variety of evaluation designs (from random assignment to quasi-experimental evaluation methods) and teaches analysis of data from actual evaluations, such as the national Job Training Partnership Act Study. The course evaluates the strengths and weaknesses of alternative evaluation methods.

KSG PED-327 The Economic Analysis of Poverty in Poor Countries (Jensen)
Accessible from here (click on Spring Schedule)

Emphasizes modeling behavior, testing economic theories, and evaluating the success of policy. Topic areas include: conceptualizing and measuring poverty, inequality, and well-being; models of the household and intra-household allocation; risk, savings, credit, and insurance; gender and gender inequality; fertility; health and nutrition; and education and child labor.

Stat 221 Statistical Computing Methods (Goswami)

Advanced methods of fitting frequentists and Bayesian models. Generation of random numbers, Monte Carlo methods, optimization methods, numerical integration, and advanced Bayesian computational tools such as the Gibbs sampler, Metropolis Hastings, the method of auxiliary variables, marginal and conditional data augmentation, slice sampling, exact sampling, and reversible jump MCMC.

Stat 232 Incomplete Multivariate Data (Rubin)

Methods for handling incomplete data sets with general patterns of missing data, emphasizing the likelihood-based and Bayesian approaches. Focus on the application and theory of iterative maximization methods, iterative simulation methods, and multiple imputation.

Stat 245 Quantitative Social Science, Law, Expert Witnesses, and Litigation (Stephenson and Rubin)

Explores the relationship between quantitative methods and the law via simulation of litigation and a short joint (law student and quantitative student) research project. Cross-listed with Harvard Law School.

Stat 249 Generalized Linear Models (Izem)

Methods for analyzing categorical data. Visualizing categorial data, analysis of contingency tables, odds ratios, log-linear models, generalized linear models, logistic regression, and model diagnostics.

Posted by Sebastian Bauhoff at 6:00 AM

January 20, 2006

Questionnaire Design: The Weak Link?

Sebastian Bauhoff

In a 3-day conference at IQSS, Jon Krosnik is currently presenting chapters of a forthcoming 'Handbook of Questionnaire Design: Insights from Social and Cognitive Psychology'. Applied social scientists have put a lot of effort into improving research methods once the data is collected. However some of the evidence that Krosnik discusses shows that those efforts may be frustrated: getting the data may be a rather weak link in the chain of research.

Everyone who collected data themselves will know about those issues. The Handbook might be good way to get a structured review and facilitate more throrough thinking.

PS: The conference is this years' Eric M. Mindich 'Encounters with Authors' symposium. An outline is here.

Posted by Sebastian Bauhoff at 6:00 AM

January 19, 2006

Born to be Bayesian

Sebastian Bauhoff

The Economist recently featured an intestesting article on forthcoming research by Griffiths and Tenenbaum on how the brain works ("Bayes Rules", January 7, 2006).

Their research reportedly analyses how the brain makes judgements by using prior distributions. Griffiths and Tenenbaum gave individuals a piece of information and asked them to draw general conclusions. Apparently the answers to most issues correspond well to a Bayesian approach to reasoning. People generally make accurate predictions, and pick the right probability distribution. And it seems that if you don't know the distribution, you can just make experiments and find out.

The interesting question of course is, where does the brain get this information from? Trial and error experience? Learning from your parents or others?

At any rate the results suggest what many readers of this blog already know: real humans are Bayesians. Tell a frequentist next time you meet one.

PS: Andrew Gelman also posted about this article on his blog. See here.

Posted by Sebastian Bauhoff at 4:12 AM

January 18, 2006

Social Science as Consulting

Mike Kellermann

Regular visitors to this blog have read (here, here, and here) about the recent field research conducted by Mike Hiscox and Nick Smyth of the Government Department on consumer demand for labor standards. After they described the difficulties that they faced in convincing retailers to participate in their experiments, several workshop participants remarked that the retailers should be paying them for the market research done on their behalf. Indeed, bringing rigorous experimental design to bear in such cases should be worth at least as much to corporations as the advice that they receive from consulting firms - and all we want is their data, not their money!

This discussion reminded me of an Applied Statistics talk last year given by Sendhil Mullainathan of the Harvard Economics Department on randomization in the field. He argues that there are many more opportunities for field experiments than we typically assume in the social sciences. One of the projects that he described during the talk was a field experiment in South Africa, in which a lender (unidentified for reasons that should become clear) agreed to several manipulations of its standard letter offering credit to relatively low-income, high-risk consumers. These manipulations included both economic (varying the interest rate offered) and psychological (altering the presentation of the interest rate through frames and cues of various sorts) treatments. Among the remarkable things about this experiment is the sheer number of subjects - over 50,000 individuals (all of whom had previously borrowed from the lender) received letters. It is hard to imagine a field experiment of this magnitude funded by an academic institution. Of course, the motives of the lender in this case had little to do with scientific progress; it hoped that the lessons learned from the experiment would help the bottom line. The results from the experiment suggest that relatively minor changes in presentation dramatically affected the take-up rate of loans. As one example, displaying a single example loan amount (instead of several possible amounts) increased demand by nine percent.

So, the question is why don't we do more of these kinds of experiments? One answer is obvious; social science is not consulting. The whole project of social science depends on our ability to share results with other researchers, something unlikely to please companies that would otherwise love to have the information. Unfortunately, in many cases, paying social scientists in data is probably more expensive than paying consultants in dollars.

Posted by Mike Kellermann at 12:44 AM

January 17, 2006

Network Analysis and Detection of Health Care Fraud

You, Jong-Sung

In my earlier entries on “Statistics and Detection of Corruption? and “Missing Women and Sex-Selective Abortion,? I demonstrated that examination of statistical anomaly can be a useful tool for detection of crime and corruption. In these cases, binomial probability distribution was a very useful tool.

Professor Malcolm Sparrow at the Kennedy School of Government shows how network analysis can be used to detect health care fraud in his book, License to Steal: How Fraud Bleeds America's Health Care System (2000). He gives an example of network analysis performed within Blue Cross/Blue Shield of Florida in 1993.

An analyst explored the network of patient-provider relationships with twenty-one months of Medicare data, treating a patient as linked to a provider if the patient had received services during the twenty-one-month period. The resulting patient-provider network had 188,403 links within it. The analyst then looked for unnaturally dense cliques within that structure. He found a massive one. “At its densest core, the cluster consisted of a specific set of 122 providers, linked to a specific set of 181 beneficiaries. The (symmetric) density criteria between these sets were as follows:
A. Any one of these 122 providers was linked with (i.e., had billed for services for) a minimum of 47 of these 181 patients.
B. Any one of these 181 patients was linked with (i.e., had been “serviced? by) a minimum of 47, and an average of about 80, of these providers.?
After the analyst found this unnaturally dense clique, field investigations confirmed a variety of illegal practices. “Some providers were indeed using the lists of patients for billing purposes without seeing the patients. Other patients were being paid cash to ride a bus from clinic to clinic and receive unnecessary tests, all of which were then billed to Medicare.?

Professor Sparrow suggests that many ideas and concepts from network analysis can be useful in developing fraud-detection tools, in particular for monitoring organized and collusive multiparty frauds and conspiracies.

Posted by Jong-sung You at 2:36 AM

January 16, 2006

Against All Odds-Ratios

Jens Hainmueller

I've decided this blog needs more humor: Check out some Statistics Movies that Never Made it to the Big Screen. Or even better: What's the question the Cauchy distribution hates the most? "Got a moment?"

Posted by Jens Hainmueller at 1:25 AM

January 13, 2006

Ecological Inference in the Law, Part II

Jim Greiner

In a previous post, I introduced a definition of the ecological inference problem as applied to the legal difficulty of drawing inferences about racial voting patterns from precinct-level data on candidate support and racial makeup of the voting-age-population. As I mentioned as a previous post, very few lawyers and judges have ever contributed to the expansive literature on this question, despite the fact that ecological inference models are often used in high-profile courtroom cases.

Here's an initial contribution from the courtroom: forget about two by two tables.

The overwhelming majority of publications on the ecological inference problem concern methods for sets of two by two contingency tables. In the Voting Rights Act context, a two by two table problem might correspond to a jurisdiction in which almost every potential voter is African-American or Caucasian, and all we care about is who votes, not who the voters supported. In that case, the rows of each table are black and white, while the columns are vote and no-vote. For each precinct, we need only predict one internal cell count, and the others are determined.

This two by two case is of almost no interest in the law. The reason is that in jurisdictions in this country, the voters have three options in any electoral contest of interest: Democrat, Republican, and not voting. That means we have a minimum of three columns. In most jurisdictions of interest these days, we also have more than two rows. Hispanics constitute an increasingly important set of voters in the United States, and their voting patterns are rarely similar enough to those of African-Americans or Caucasians to allow an expert witness to combine Hispanics with one of these other groups.

Thus far, scant research exists into the R x C problem. Before a few years ago, one had two options: (i) run a set of C-1 linear models, a solution that often led to logically inconsistent predictions (such as 115 percent of Hispanic voters supported the Democrat), or (ii) pick a two by two model that includes information from the precinct-level bounds, and also available statistical information, and apply it in some way to the problem set of R x C tables at hand, perhaps by collapsing cell counts down to a two by two shape, perhaps by applying the two by two method repeatedly to draw inferences about the R x C problem at hand. Neither approach is very appealing.

A few years ago, Rosen et al. proposed a variant of a Dirichlet-Multinomial model, a serious improvement in this area. This model was and is a large step forward in the analysis of R x C ecological inference tables. Nevertheless, there is always room for improvement. The model does not respect the bounds deterministically, and it does not allow a great deal of flexibility in modeling intra-row and inter-row correlations. On the latter point, an example may clarify: Suppose we are analyzing a primary in which four candidates are running, two African-American and two Causacian. Would we expect, among (say) black voters, for the vote counts or fractions (by precinct) for the two African-American candidates to be positvely correlated?

I look forward to contributing to this research soon.

Posted by James Greiner at 5:57 AM

January 12, 2006

Language And Rule-learning

Amy Perfors

Two of the most enduring debates in cognitive science can be summarized baldly as the "rules vs statistics" debate and the "language: innate or not?" debate. (I think these simple dichotomies are not only too simple to capture the state of the field and current thought, but also actively harmful in some ways; nevertheless, they are still a good first approximation for blogging purposes). One of the talks at the BUCLD conference, by Gary Marcus at NYU, leapt squarely into both debates by examining simple rule-learning in seven-month old babies and arguing that the subjects could only do this type of learning when the input was linguistic.

Marcus built on some earlier studies of his (e.g., pdf here) in which he familiarized seven-month infants with a list of nonsense "words" like latala or gofifi. Importantly, all of the words heard by any one infant had the same structure, such as A-B-B ("gofifi") or A-B-A ("latala"). The infants heard two minutes of these type of words, and then were presented with a new set of words using different syllables, half of which followed the same pattern as before, half of which followed a new pattern. Marcus found that infants listened longer and paid more attention to the words with the unfamiliar structure, which they could have done only if they successful abstracted that structure (not just the statistical relationships between particular syllables). Thus, for instance, an infant who heard many examples of words like "gofifi" and "bupapa" would be more surprised to hear "wofewo" than "wofefe"; they have abstracted the underlying rule. (The question of how and to what extent they abstracted the rule is rather debated, and I'm not going to address it here).

The BUCLD talk focused instead on another question: did it matter at all that the stimuli they heard were linguistic rather than, say, tone sequences? To answer this question, Marcus did the same experiment with sequences of various kinds of different tones and tambors in the place of syllables (e.g. "blatt blatt honk" instead of "gofifi"). His finding? Infants did not respond differently in testing to the structures they had heard - that is, they didn't seem to be abstracting the underlying rule this time.

There is an unfortunately large confounding factor, however: infants have a great deal more practice and exposure to language than they do to random tones. Perhaps the failure was rather one of discrimination: they didn't actually perceive different tones to be that different, and therefore of course could not abstract the rule. To test this, Marcus trained infants on syllables but tested them on tones. His reasoning was that if it was a complete failure of discrimination, they shouldn't be able to perceive the pattern in tones when presented in testing any more than they could when presented in training. To his surprise, they did respond differently to the tones in testing, as long as they were trained on syllables. His conclusion? Not only can infants do cross-modal rule transfer, but they can only learn rules when they are presented linguistically, though they can then apply them to other domains. Marcus argued that this was probably due to an innate tendency in language, not a learnt effect.

It's fascinating work, though rather counterintuitive. And, quite honestly, I remain unconvinced (at least about the innate tendency part). Research on analogical mapping has shown that people who have a hard time perceiving underlying structure in one domain can nevertheless succeed in perceiving it if they learn about the same structure in another and map it over by analogy. (This is not news to good teachers!) It's entirely possible - and indeed a much simpler hypothesis - that babies trained on tones lack the experience they have with language and hence find it more difficult to pick up on the differences between the tones and therefore the structural rule they embody. But when first trained on language - which they do have plenty of practice hearing - they can learn the structure more easily; and then when hearing the tones, they know "what to listen for" and can thus pick out the structure there, too. It's still rule learning, and even still biased to be easier for linguistically presented things; but that bias is due to practice rather than some innate tendency.

Posted by Amy Perfors at 2:13 AM

January 11, 2006

Applying Spatial Statistics to Social Science Research

Drew Thomas

Spatial Statistical methodology is beginning to gain popularity as a methodological tool in the natural and social sciences. At Harvard, Prof. Rima Izem is leading the way towards the use of these techniques across many disciplines. This semester, Prof. Izem debuted her Spatial Statistics seminar, which met Wednesday afternoons in the Statistics Department.

Of those topics discussed in the seminar, lattice data analysis proves to be invaluable to the analysis of well-defined electoral districts. The principle of lattice data is that our land area can be divided into mutually exclusive, complete and contiguous divisions; interactions between the divisions can then be analyzed through various covariance methods.

A full understanding of spatial interaction may prove to be valuable to electoral analysis. Determining the interdependence of districts through means other than traditional covariates may suggest the presence of a true "neighbor effect." How one determines the covariance of districts may prove to be more art than science, but the depth of work yet to be done in this field should give many opportunities for meaningful investigation.

Posted by Andrew C. Thomas at 1:41 AM

January 10, 2006

Statistics and Detection of Corruption

You, Jong-Sung

Duggan and Levitt's (2002) article on "corruption in sumo wrestling" demonstrates how statistical analysis may be used to detect crime and corruption. Sumo wrestling is a national sport of Japan. A sumo tournament involves 66 wrestlers participating in 15 bouts each. A wrestler with a winning record rises up the official ranking, while a wrestler with a losing record falls in the rankings. An interesting feature of sumo wrestling is the existence of a sharp nonlinearity in the payoff function. There is a large gap in the payoffs for seven wins and eight wins. The critical eighth win garners a wrestler roughly four times the value of the typical victory.

Duggan and Levitt uncover striking evidence that match rigging takes place in the final days of sumo tournaments. They find that the wrestler who is on the margin for an eighth win is victorious with an unusually high frequency, but the next time those same two wrestlers face each other, it is the opponent who has a very high win percentage. This suggests that part of the currency used in match rigging is promise of throwing future matches in return for taking a fall today. They present a histogram of final wins for the 60,000 wrestler-tournament observations between 1989 and 2000, in which a wrestler completes exactly 15 matches. Approximately 26.0 percent of all wrestlers finish with eight wins, compared to only 12.2 percent with seven wins. The binomial distribution predicts that these two outcomes should occur with an equal frequency of 19.6 percent. The null hypothesis that the probability of seven and eight wins is equal can be rejected at resounding levels of statistical significance. They report that two former sumo wrestlers have made public the names of 29 wrestlers who they allege to be corrupt and 14 wrestlers who they claim refuse to rig matches. Interestingly, they find that wrestlers identified as "not corrupt" do no better in matches on the bubble than in typical matches, whereas those accused of being corrupt are extremely successful on the bubbles.

A similar kind of empirical study of corruption dates to 1846, when Quetelet documented that the height distribution among French men based on measurements taken at conscription was normally distributed except for a puzzling shortage of men measuring 1.57–1.597 meters (roughly 5 feet 2 inches to 5 feet 3 inches) and an excess number of men below 1.57 meters. Not coincidentally, the minimum height for conscription into the Imperial army was 1.57 meters (recited from Duggan and Levitt 2002). These examples show that detection of statistical anomaly can give compelling evidence of corruption.

Corruption in conscription has been a big political issue in South Korea. Examination of anomaly in the distributions of height, weight, eyesight at each physical examination site for conscription may provide evidence of cheating and/or corruption. This kind of statistical evidence will fall short of proving crime or corruption, but will make a sufficient case for thorough investigation.

Posted by Jong-sung You at 6:39 AM

January 9, 2006

Tools for Research (A Biased Review)

Sebastian Bauhoff and Jens Hainmueller

A perfect method for adding drama to life is to wait until a paper deadline looms large. So you're finding yourself at the eve of a deadline, "about to finish" for the last 4 hours, and not having formatted any of the tables yet? Still copying STATA tables into MS Word? Or just received the departmental brainwash regarding statistical software and research best practice? Here are some interesting tools you could use to make life easier and your research more effective. On the Big Picture level, which tools to use is as much a question of philosophy as of your needs: open-source or commercial package? At Harvard, students often use one of the two combos: MS Word and Stata (low tech) or LaTeX and R (high tech). What type are you?

If you're doing a lot of data-based research, need to type formulas and often change your tables, you might want to consider learning LaTeX. Basically, LaTeX is a highly versatile type-setting environment to produce technical and scientific documents with the highest standards of typesetting quality. It's for free and LaTeX implementations are available for all platforms (Linux, Mac, Windows, etc). Bibliographies are easily managed with Bibtex. And you can also produce cool slides using ppower4. At the Government Department, LaTeX is taught to all incoming graduate students and many of them hate it at the beginning (it's a bit tricky to learn), but after a while many of them grew true LaTeX fetishists (in the metaphorical sense, of course).

Ever wondered why some papers look nicer than Word files? They're done in LaTex. A drawback is that they all look the same, of course. But then, some say having your papers in LaTeX-look is a signal that you're part of the academic community...

LaTeX goes well with R, an open-source statistical package modeled on S. R is both a language and an environment for statistical computing. It's very powerful and flexible; some say the graphical capabilities are unparalleled. The nice thing is that R can output LaTeX tables which you can paste directly into your document. There are many ways to do this, one easy way is to use the "LaTeX" function in the design library. A mouse-click later, your paper shines in pdf format, all tables looking professional. As with LaTeX, many incoming graduate students at the Government Department suffer learning it, but eventually most of them never go back to their previous statistical software.

But you are actually looking for a more user friendly modus vivid? Don't feel like wasting your nights writing code and chasing bugs like a stats addict? Rather, you like canned functions, and an easy-to-use working environment. Then consider the MS Word and STATA combo. Getting STATA output to look nice in Word is rather painful unless you use a little tool called outreg or alternatively estout (the latter also produces copy and paste-able LaTeX tables). Outreg is an ado-file that produces a table in Word format, and you can simply apply the normal formatting functions in Word. The problem is that outreg outputs only some of the tables that STATA produces, and so you're stuck having to format at least some. But of course there are many formatting tools available in Word.

So you make your choice depending on how user-friendly and or flexible you like it. But whether you're using Word/STATA or LaTeX/R, one tool comes in handy anyway: WinEdt is a shareware that can be used to write plain text, html, LaTeX etc. (WinEdt automatically comes with a LaTeX engine, so you won't need to install that.) The software can also serve as do-file editor for STATA and R. You can download configuration files that will highlight your commands in WinEdt, do auto-saves whenever you like (ever lost your STATA do-file??) and send your code to STATA or R just like the built-in editors would do. Alternative are other powerful word editors like Emacs, etc.

Confused? Can't decide? Well, your're certainly not the only one. In the web, people fight fervent LaTeX vs Word wars (google it!). We (the authors) recommend using LaTeX and R. This is the way we work, because, as Gary uses to say "if we knew a better way of working we would use it" -- is that what's called a tautology?! :-).

Posted by Sebastian Bauhoff at 2:44 AM

January 6, 2006

Ecological Inference in the Law, Part I

Jim Greiner

Alchemists' gold. The perpetual motion machine. One might also think of cold fusion and warm superconductors. These are some of the great mythical aims of the so-called "hard" sciences. A few of these concepts have also been compared to attempts at ecological inference, the search for accurate predictions about the internal cell counts of a set of contingency tables (such as one for each precinct) when only the row and column totals of table are observed. The fundamental problem of ecological inference is, of course, that radically different internal cell counts can lead to identical row and column totals, and because we only get to see the row and column totals, we cannot distinguish among these different sets of counts. Another way of saying this is that the problem is impossible to solve deterministically (since the relationship between the cell entries and row and column marginals is not one-to-one), causing some to label ecological inference an "ill-posed inverse problem". In fact, without making some statistical assumptions, the estimation problem would not be identified, although it would be bounded because some values for the cell entries are are ruled out for each precinct's contingency table by the observed column and row totals (these are called "the bounds").

Ecological inference arises in the legal setting in cases litigated under the Voting Rights Act. Section 2 of the VRA prohibits a state or municipality from depriving a citizen, on account of race or ethnicity, of an equal opportunity to participate in the political process and to elect candidates of his/her choice. The Delphic statute has been interpreted to disallow districting schemes that have the effect of diluting minority voting strength. In practice, to succeed in a vote dilution claim, a plaintiff must almost always prove that voting in the relevant jurisdiction is racially polarized, meaning that whites vote differently from blacks who vote differently from Hispanics. Because the secret ballot prevents direct observation of voting patterns, expert witnesses are forced to attempt the dangerous task of drawing inferences about racial voting patterns from precinct-level candidate support counts (column totals) and precinct-level racial voting-age-populations (row totals).

A large literature exists on the ecological inference problem. Bizarrely, one constituency has rarely if ever contributed to this debate: the lawyers and judges who consume a great deal of what the literature produces. I'll be attempting to start to fill this gap in subsequent entries.

Posted by James Greiner at 5:56 AM

January 5, 2006

Election Noise

Drew Thomas

My home country is in chaos - of a sort. With the dissolution of Parliament on November 29, Canada is heading into a federal election.

As a multiparty parliamentary democracy, predicting political outcomes in Canada isn't simply a matter of reading a thermometer. Of course, it isn't even that simple in a two-party system, but it gets me thinking about prediction methods.

I've been working with Gary on JudgeIt, a program used to evaluate voting districts for a variety of conditions, designed for a two-party system. With an emphasis on Bayesian simulation, its methods make use of uniform partisan swing -- a shift in the percentage of voters moving from one party to the other, and in the same proportion in each district -- to determine the likely outcomes given a set of covariates and a history of behaviour in the particular system.

What caught my attention was a series of election prediction websites, making use only of previous election information, which allows the user to input what they expect to be either the vote shares or swings in support. This by itself is mathematically unremarkable, but may keep political junkies up hours.

The real question of interest remains: by what process can a system predict who will gain whose votes in a shift in support? In most Canadian ridings (districts), seats are contested by three parties: from left to right, the socialist New Democrats, the incumbent Liberals and the opposition Conservatives. For the most part, votes lost by an outer party would naturally flow to the Liberals. In this election, however, a scandal which led to the election call may prove to cost the Liberals a good deal of support.

Since geography -- and hence, demography -- dictate much of the Canadian political climate, I have no doubt that the appropriate covariates are out there, waiting to be measured and/or analyzed. In the meantime, I'm keeping my head away from election speculation and looking to see if this problem has already been solved. Anyone out there have any suggestions?

Posted by Andrew C. Thomas at 3:57 AM

January 4, 2006

Experts and Trials IV: Why?

John Friedman

In my previous posts on this subject (see here for the most recent), I have explored our legal system's reliance on expert witnesses from game-theoretic and personal perspectives. In this post, I take an entirely different approach, and ask the question: why is our system structured so?

The first question by many might be: what are the alternatives? The traditional example is the French system, known as the Civil Law system (as opposed to the British-based Common Law system). In France, a government judge acts as would the lawyers, judge, and jury in the American system. This judge calls witnesses suggested by the parties (plus some others of his choosing), questions them himself, and then decides upon the proper course of action. Trials often finish in one day; justice is summarily, if crudely, dispensed.

So why did these two systems develop differently, separated by less than 100 miles of the English Channel? Though many answers surely exist in the historical literature, I offer one theory presented by Edward Glaeser and Andrei Shleifer, both in the Harvard Economics Department. They place the roots of the two legal systems in the political circumstances in England and France in the 12th and 13th centuries, when the first characteristics of these procedures emerged.

The key element of a legal system, argue Glaeser and Shleifer, is its ability to limit the influence of corruption and coercion. Viewed from this perspective, the strengths and weaknesses of juries versus government (then royal) judges become clear. Juries, composed mostly of local commoners, would be subject to much coercion by local feudal lords. Royal magistrates, on the other hand, would be far less susceptible to such forceful persuasion, but would be far more easily bribed by the king. A country's choice between these two systems should depend on which problem is more dire: The threat of regional "bullies" or of royal domination.

Glaeser and Shleifer survey the historical record to argue that exactly this difference existed between England and France in the late middle ages. England, recently conquered by and still under the rule of the Normans, had a much stronger monarchy, which imposed order on the countryside. The smaller lords, with whom King John negotiated the Magna Carta, feared royal domination far more than they feared each other, and were willing to accept the possibility of local bias in juries so that the king would not interfere. France, on the other hand, was far more violent, torn between many competing barons. These dukes feared each other most of all, and knew that any jury would quickly fall under the sway of the local ruler; thus, they were willing to cede control of the legal system to the king.

I am not an historian, and so I cannot know whether these arguments accurately reflect the genesis of our legal system. But even if the true explanation lies elsewhere, surely it will have the same historical feel. These institutions have great inertia, and so it does not surprise me that factors so long ago have explanatory power. Nonetheless, is this the best we can do? Does our legal system reduce to an historical anachronism?

Posted by James Greiner at 3:07 AM

January 3, 2006

Sampling naturalistic data: how much is enough?

Amy Perfors

An issue inherent in studying language acquisition is the sheer difficulty of acquiring enough accurate naturalistic data. In particular, since many questions hinge on what language input kids hear - and what language mistakes and capabilities kids show - it's important to have an accurate way of measuring both of these things. Unfortunately, short of following a child around all day with a tape recorder (which people have done!), it's hard to get enough data to have an accurate record of low-frequency items and productions; it's also hard to know what would be enough. Typically, researchers will record a child for a few hours at a time for a few weeks and then hope that this represents a good "sample" of their linguistic knowledge.

A paper by Caroline Rowland at the University of Liverpool, presented at the BUCLD conference in early November, attempts to assess the reliability of this sort of naturalistic data by comparing it to diary data. Diary data is obtained by having the caregiver write down every single utterance produced by the child over a period of time; as you can imagine, this is difficult to persuade someone to do! There are clear drawbacks to diary data, of course, not least of which is that as the child speaks more and more it becomes less and less accurate. But because it has a much better likelihood of incorporating low-frequency utterances, it provides a good baseline comparison in that respect to naturalistic, tape-recorded data.

What Rowland and her coauthor found is perfectly in line with what is known about statistical sampling. As the subsets of tape-recorded conversations got smaller, estimates of low-frequency terms became increasingly unreliable, and single segments less than three hours were nearly completely useless (as they said in the talk, they were "rubbish." Oh how I love British English!) It is also more accurate to use, say, four one-hour chunks from different conversations rather than one four-hour segment, as the former avoids "burstiness effects" that come from conversations and settings predisposing to certain topics.

Though this result isn't a surprise from a statistical sampling point of view, it is nice for the field to have some estimates of how little is "too little" (though of course how little depends somewhat on what you are looking for). And the paper highlights important methodological issues for those of us who can't trail after small children with our notebooks 24 hours a day.

Posted by Amy Perfors at 2:19 AM

December 21, 2005

End-of-Year Hiatus

Jim Greiner

With universities out of session and many students away from their offices, the Social Science Statistics Blog will reduce the frequency of its postings. We will resume our at-least-one-per-day schedule in early January. Until then, check back periodically for the occasional entry.

Happy New Year!

Posted by James Greiner at 4:36 AM

December 20, 2005

BUCLD: Statistical Learning in Language Development

Amy Perfors

The annual Boston University Conference on Language Development (BUCLD), this year held on November 4-6th, consistently offered a glimpse into the state of the art in language development. The highlight this year for me was a lunchtime symposium titled "Statistical learning in language development: what is it, what is its potential, and what are its limitations?" It featured a dialogue between three of the biggest names in this area: Jeff Elman at UCSD, who studies connectionist models of many aspects of language development; Mark Johnson at Brown, a computational linguist who applies insights from machine learning and Bayesian reasoning to study human language understanding; and Lou-Ann Gerken at the University of Arizona, who studies infants' sensitivity to statistical aspects of linguistic structure.

I was most interested in the dialogue between Elman and Johnson. Elman focused on a number of phenomena in language acquisition that connectionist models capture. One of them is "the importance of starting small," an argument that says essentially that beginning with limited capacities of memory and perception might actually be a helpful way of learning ultimately very complex things because it "forces" the learning mechanism to notice only the broad, consistent generalizations first and not to be led astray by local ambiguities and complications too soon. Johnson seconded that argument, and pointed out that models learning using Expectation Maximization embody this just as well as neural networks do. Another key insight of Johnson's was that statistical models implicitly extract more information from input than purely logical or rule-based models. This is because statistical models generally assume some underlying distributional form, and therefore when you don't see data from that distribution, that is a valuable form of negative evidence. Because there are a number of areas in which people appear not to receive much negative evidence, they must either incorporate or use statistical assumptions or be innately biased toward the "right" answer.

The most valuable aspect of the symposium, however, was the clarification of many of the issues in statistical learning and cognitive science in general that statistical learning can help to answer. Some of these important questions: in any given problem, what are the units of generalization that human learners (and hence our models) should and do use? [i.e., sentences, bigram frequencies, words, part of speech frequencies, phoneme transitions, etc] What is the range of computations that the human brain is capable of (possibly changing at different stages of development)? What statistical and computational models capture these? What is the nature of the input (the data) that human learners see; to what extent does this depend on factors external to them (the world) and to what extent is it due to internal factors (attentional biases, mental capacities, etc)?

If we can answer these questions, we will have answered a great many of the difficult questions in cognitive science. If we can't, I'd be very surprised if we make much real progress on them.

Posted by Amy Perfors at 2:09 AM

December 19, 2005

Beyond Standard Errors, Part II: What Makes an Inference Prone to Survive Rosenbaum-Type Sensitivity Tests?

Jens Hainmueller

Continuing from my previous post on this subject, sensitivity tests are still somewhat rarely (yet increasingly) used in applied research. This is unfortunate, I think, because, at least according to my own tests on several datasets, observational studies do vary considerably in their sensitivity to hidden bias. Some results go away once you allow for only a tiny amount of hidden bias, others are rock solid weathering very strongest hidden bias. One should always give the reader this information I think.

One (and maybe not the most important) reason for why these tests are infrequently used is that they take time and effort to compute. So I was thinking, instead of computing the sensitivity tests each time, maybe it would be good to have some quick rules of thumbs to judge whether a study is insensitive to hidden bias.

Imagine you have two studies with identical estimated effect size and standard errors. Now, which one would you trust more regarding their insensitivity to hidden bias? In other words, are there particular features of the data, which makes an inference drawn from this data to excel on Rosenbaum type sensitivity tests? The literature I have read thus far provides little guidance on this issue.

We have a few ideas about this (which are still underdeveloped). For example, ceteris paribus, one could think that it’s better to have a rather imbalanced vector of treatment assignments (like only a few treated or only a few control). Another idea is that, ceteris paribus, inferences obtained from a smaller (matched) dataset should be less prone to get knocked over by hidden bias tests. Or, in the case of propensity score methods, one would like covariates that strongly predict treatment assignment so that an omitted variable cannot tweak the results much.

This is very much still work in progress; comments and feedback are highly appreciated.

Posted by James Greiner at 6:06 AM

December 16, 2005

Redistricting and Electoral Competition: Part II

John Friedman and guest blogger Richard Holden

Yesterday, we blogged about whether gerrymandering or something else a principal cause of low turnover in the House of Representatives and other elected bodies. We continue that discussion today.

How can we determine whether gerrymandering is the culprit, given that any number of reasons could account for the increase in the incumbent reelection rate? The key is that redistricting usually happens only once each decade (at least until the recent controversies in Texas.) Other factors, such as money or electoral polarization, tend to change more smoothly over time. One can tease these factors apart with a
"regression discontinuity" approach, separating the time series into 1) a smooth function and 2) jumps at the time of gerrymandering.

In a recent paper (available at here), we find that redistricting has actually slightly reduced incumbent reelection rates over time. We also look to see if there are systematic differences between "bipartisan" gerrymanders, designed to protect incumbents from both parties, and "partisan" gerrymanders, in which one party attempts to leverage its support into more representation in the state's Congressional delegation. There is no evidence that the incumbent reelection rate responds differently after any of these forms of redistricting.

This research suggests that factors other than redistricting are the more important culprits in today's lack of electoral competition. In some sense, this isn't all that surprising. While the technology available has become more advanced, so have the constraints on gerrymanderers. Supreme Court decisions interpreting the 14th amendment and the Voting Rights Act have consistently narrowed the bounds within which redistricting must occur.

There may, of course, be other reasons to support independent commissions. For instance, they tend to create more geographically compact districts. Neutral bodies also help to avoid the most extreme cases of partisan gerrymandering, in which the neighborhood of an incumbent is grouped with distant voters in a tortuously shaped district. Perhaps most importantly, independent commissions may be able to ensure minority representation - though the Voting Rights Act also plays a fundamental role in this area.

The basic premise of supporters of non-partisan commissions - that political competition is important - is a sound one. But the evidence suggests that these advocates are focused in the wrong place. The redistricting process is far from the only cause of limited competition.

To increase competition in elections for Congress and state legislatures, we must pay more attention to other potential causes of the increase in the incumbent reelection rate. We must better understand how factors such as money, television, and candidate quality impact elections. But if we can direct towards these aspects of democracy the same spirit of reform that now supports the drive towards independent redistricting commissions, new and more promising solutions can't be far away.

Posted by James Greiner at 2:53 AM

December 15, 2005

Redistricting and Electoral Competition: Part I

John Friedman and guest blogger Richard Holden

On Election Day, 2005, more than 48 million people in three states voted on whether non-partisan commissions, rather than elected state politicians, should conduct legislative redistricting. Though these initiatives were defeated, the popular movement towards non-partisan redistricting is gaining strength. Activists point to the systems in Iowa and Arizona - currently the only states without serious legislative involvement - as the way of future redistricting in this country.

The the non-partisan commission ballot initiatives - Proposition 77 in California and Issue Four in Ohio - were major policy items in the respective states. (The initiative in Florida was citizen-sponsored and attracted less attention). California Governor Arnold Schwarzenegger placed the issue at the heart of his plan of reform, commenting that "Nothing, absolutely nothing, is more important than the principle of 'one person - one vote.'" These measures have also received broad bipartisan support from politicians, organized interest groups, and grassroots organizations. Though partisan political interest has played a role, many of these groups support the proposed move to independent commissions out of a sincere desire to increase competitiveness in the political system. Unfortunately, the latest academic research suggests that this well-intentioned effort is misplaced: Gerrymandering has not caused the increasing trend of low legislative turnover in the Congress.

Proponents of independent commissions argue that redistricting by politicians has led to a vast rise in incumbent reelection rates. For instance, in the US House of Representatives, members are reelected at a
staggering 98% rate. Prior to World War II, that rate hovered around 85%. Many in favor of independent commissions argue that new technologies available to redistricters, such as sophisticated map drawing software,
has allowed bi-partisan gerrymandering. Incumbents band together to protect each other's electoral prospects, creating impregnable districts packed with supporters. As Bob Stern of the non-partisan Center for
Governmental Studies has said, "This lack of competition is due significantly to the legislature's decision to redraw electoral districts to protect party boundaries."

There are, however, a number of other factors that might explain the increase in incumbent reelection rates. For instance, there is a lot more money in politics than in the past. Incumbents, who usually have greater fund raising ability, raise large war chests for their campaigns. A more polarized electorate can also increase incumbent reelection rates because there are fewer swing voters for a potential challenger to persuade. Growing media penetration in the post-war period provides incumbents with free advertising, further increasing their prospects. All of these effects are magnified when more qualified challengers choose not to run against incumbents benefiting from these factors.

Tomorrow, we'll continue with our discussion of alternative explanations of low electoral turnover, plus a little about what we might do about it.

Posted by John N. Friedman at 2:25 AM

December 14, 2005

Consumer Demand for Labor Standards, Part III

Michael Hiscox and Nicholas Smyth, guest bloggers

Continuing our discussion begun yesterday and the day before on labor standards labeling, perhaps the most important comments we received at the workshop had to do with how we might design our next set of experiments. It is very difficult to do anything fancy when it comes to in-store experiments. It could never be practical (and ABC would never give permission) for us to randomize treatments to individual items or brands on, say, a daily basis. The manner in which products are displayed (grouped by brand), the costs associated with altering labels and prices, and the potential problems for sales staff (and the potential complaints from frequent customers) impose severe constraints. Several workshop participants suggested that we conduct the next set of experiments through an online retailer. That way we might be able to randomly assign labels (and prices) to customers when they view product information on a website and decide whether or not to make a purchase. There would still be plenty of difficulties to iron out, as was quickly noted (e.g., making allowances for customers who attempt to return to the same product page at a later point in time, and for customers who "comparative shop" for products at multiple retailers). But this seems like the way to proceed in the future.

On a related theme, we noted that, an online retailer run by Harvard students, is already planning to track a variety of economic data on its customers. Ezaria has a mission which involves providing markets for independent artisans from the developing world and donating 25% of profits to charity. At a minimum, looking at data on whether a customer is more likely to make a purchase after being shown the company's "mission" page (that explains their policies) would provide some measure of consumer demand for companies that source from high-standard producers. Perhaps we can persuade Ezaria to cooperate with us in a future experimental project. Or perhaps we can arrange the experiment with an even larger online retailer, with customers who are not so obviously self-selected as socially conscious.

Posted by James Greiner at 3:49 AM

December 13, 2005

Consumer Demand for Labor Standards, Part II

Michael Hiscox and Nicholas Smyth, guest bloggers

We continue yesterday's entry discussing questions that arose during our recent presentation of our paper on consumer demand and labor standards labeling.

Another excellent question that was raised in the discussions concerned the evidence that sales of our labeled items actually rose (relative to sales of unlabeled control products) when their prices were raised. We have been interpreting this as evidence that consumers regarded the label as more credible when the product was more expensive relative to alternatives, since they expect to pay more for higher labor standards. One question was whether relative sales would have risen with price increases for any good (labeled or unlabeled) just because higher prices can signal better quality. Since we did not raise the price of unlabeled items, we cannot address this concern directly. It is not critical to one of our main findings: sales of labeled items increased markedly relative to sales of unlabeled alternatives when the labels were put in place (before prices were adjusted). But we will try to track down the research on the price-quality issue in the literature on consumer psychology. Our basic assumption is that the existing (equilibrium) product prices and sales levels at ABC (in the "baseline" period) accurately reflected the relative quality of treatment and control products.

Other questions raised concerned the evidence we discussed in the paper about the marked increase in sales of Fair Trade Certified coffee. It was pointed out that, to the extent that retailers like Starbucks are marketing only fair trade coffee as the brewed "coffee of the day" this seems more like a general CSR strategy by the firm and not a sign of demand for improved standards. We were really talking about sales of certified coffee beans, rather than brewed coffee. The labeled beans are sold in direct competition with similar (unlabeled) beans at both Starbucks and Peets. But it is important that we check the data and see if we can discriminate clearly between sales in different categories.

In general, we felt we have to do better in accounting for seasonal patterns in demand for home furnishings at ABC and how they might bear on our findings. This is obviously not a problem for our core results that hinge on the ratio of sales of labeled brands to unlabeled brands during each phase of the experiment. But for measuring price elasticities using changes in absolute sales of labeled items over time we would like to allow for the fact that sales of home furnishings were expected to dip during the summer months. To do this, we will probably need to estimate weekly sales for each brand using all the data we have from ABC prior to the start of our experiment (covering sales in 2004 and the first half of 2005). The relevant covariates would probably include recorded levels of total foot traffic in the store, total sales of other store products, some national or regional measures of economic activity and consumer confidence, variables accounting for any special sales and promotional campaigns, and seasonal dummies. We can then compare actual (absolute) sales of labeled brands with out-of-sample predictions based upon the estimations and thereby gauge the impact of our experimental treatments.

We will conclude our discussion in tomorrow's post.

Posted by James Greiner at 4:46 AM

Applied Statistics - No Meeting

There will be no session of the Applied Statistics workshop on Wednesday, December 14; the talk originally scheduled for this date will be rescheduled for next semester. Our next session will be held on Wednesday, February 1. We hope to see you then!

Posted by Mike Kellermann at 12:00 AM

December 12, 2005

Consumer Demand for Labor Standards, Part I

Michael Hiscox and Nicholas Smyth, guest bloggers

We are very grateful to all the members of the Applied Statistics Workshop for inviting us to present our paper (abstract here) in the workshop this week. Thanks, especially, to Mike Kellerman for organizing everything and playing host.This was the first time we have presented the results from our experiments, and we received some very valuable feedback and suggestions for future work on this topic. One important question that was raised was why we do not simply assume that firms already know how much consumer demand there is for good labor standards? That is, if firms could make a buck doing this sort of thing, why not assume they would already be doing it? We think there are probably a couple of answers to this question. As we noted at the workshop (and in the paper), credible labeling would require cooperation from, and coordination with, independent non-profit organizations that could certify labor standards in factories abroad. So part of the issue here for firms is the uncertainty surrounding whether such organizations would be willing and able to take on such a role. The uncertainty about establishing a credible labeling scheme with cooperation from independent groups, on top of the uncertainty about consumer demand itself, may explain why firms are not doing as much research in this area as (we think) is warranted.

The other answer, or part of the answer, is that many firms may consider it too risky to do market research on labor standards labeling. We talked a little about how many firms refused to participate in our labeling experiments because they could not vouch for labor standards in all the factories from which they source and they were anxious about negative publicity if consumers or activist groups became curious about unlabeled items in their stores. Note that this is not evidence that labeling strategies must also be too risky for firms to ever contemplate. The risks of doing research on this issue are not identical
to the risks attached with actually adopting a labeling strategy (which depend on what the research can tell us about consumer demand, and on whether a firm decides to switch to selling only labeled products or some combination of labeled and unlabeled products, etc).

More on our paper and the questions that arose in the presentation tomorrow.

Posted by James Greiner at 2:38 AM

December 9, 2005

What Did (and Do We Still) Learn from the La Londe Dataset (Part II)?

Jens Hainmueller

I ended yesterday's post about the famous LaLonde dataset, with the following two questions: (1) What have we learned from the La Londe debate? (2) Does it makes sense to beat this dataset any further or have we essentially exhausted the information that can be extracted from this data and need to move one to new datasets?

On the first point, VERY bluntly summarized, the comic strip history goes somewhat like this. First, La Londe showed that regression and IV do not get it right. Next, Heckman's research group released a string of papers in the late 80s and 90s trying to defend conventional regression and selection-based methods. Enter stage Dehija and Wahba (1999). They showed that apparently, propensity score methods (sub-classification and matching) get it right if one controls for more than one year of pre-intervention earnings. Smith and Todd (2002, 2004) are next in line, claiming that propensity score methods do not get it right. Once one slightly tweaks the propensity score specification, the results are again all over the place. The ensuing debate spawned more than five papers as Rajeev Dehejia replied to the Smith and Todd findings (all papers of this debate can be found here). Then last but not least, Diamond and Sekhon (2005) argue that matching does get it right, if it’s done properly, namely if one achieves a really high standard of balance (we’ve already had quite a controversy about balance on this very blog. See for example here).

So what does this leave applied researchers with? What do we take away from the La Londe debate? Does anyone still think that regression (or maximum likelihood methods more generally) and/or 2-stage least squares IV produce reliable causal inferences in real world observational studies? In all seriousness, where is the validation? . This is the $1 million-dollar question, because MLE and IV methods represent the great majority of what is taught and published across the social sciences. Also, can we trust propensity score methods? How about other matching methods? Or is there little hope for causal inference from observational data in any case (in which case I fear we are all out of a job, and the philosophers get the last laugh?) This is not necessarily my personal opinion, but I would be interested to hear people’s opinion. [The evidence is of course not limited to La Londe; there is ample evidence from other studies with similar findings. For example see Friedlander and Robins (1995), Fraker and Maynard (1987), Agodini and Dynarski (2004), Wilde and Hollister (2002) and various Rubin papers to name just a few].

On the second point, let me play the devil’s advocate again and ask: What can we still learn from the La Londe data? After all it’s just one single dataset, the standard errors even for the experimental dataset are large, and once we match in the observational data, why would we even expect to get it right? There is obviously a strong case to be made for selection on unobservables in the case of the job training experiment. So even if we manage to adjust observed differences, why in the world should we get the estimate right? [Again, this is not my personal opinion, but I have heard a similar contention both at a recent conference and in Stat 214.] Maybe instead of a job training experiment, we should first use experimental and observational data on something like plants or frogs, where hidden bias may (!) be less of a problem (given this is actually the case)? Finally, what alternatives do we have—how would we know what the right answer was if we were not working with a La Londe-esque framework? Again, I would be interested in everybody’s opinion on this point.

Posted by James Greiner at 6:14 AM

December 8, 2005

What Did (and Do We Still) Learn from the La Londe Dataset (Part I)?

Jens Hainmueller

In a pioneering paper, Bob La Londe (1986) used experimental data from the National Supported Work Demonstration Program (NSW) as well as observational data from the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID) to evaluate the reliability with which conventional estimators recover the experimental target estimate. He utilized the experimental data to establish a target estimate of the average treatment effect, then replaced the experimental controls with several control groups built from the general population surveys. Then he re-estimated the effects using conventional estimators. His crucial finding was that conventional regression as well as tweaks such as instrumental variables etc. get it wrong, i.e. they do not reliably recover the causal effects estimated in the experimental data. This is troubling, of course, because usually we do not know what the correct answer is, so we simply accept the estimates that our conventional estimators spit out, not knowing how wrong we may be.

This finding (and others) sparked a fierce debate in both econometrics and applied statistics. Several authors have used the same data to evaluate other estimators, such as several matching estimators and related techniques. In fact, today the La Londe data is THE canonical dataset in the causal inference literature. It has not only been used for many articles, it has also been widely distributed as a teaching tool. I think it’s about time we stand back for a second and ask two essential questions: (1) What have we learned from the La Londe debate? (2) Does it makes sense to beat this dataset any further or have we essentially exhausted the information that can be extracted from this data and need to move one to new datasets? I wholeheartedly invite everybody to join the discussion. I will provide some suggestions in a subsequent post tomorrow.

Posted by Jens Hainmueller at 4:33 AM

December 7, 2005

Applied Statistics - Michael Hiscox and Nicholas Smyth

Today, the Applied Statistics Workshop will present a talk by Michael Hiscox and Nicholas Smyth of the Harvard Government Department. Professor Hiscox received his Ph.D from Harvard in 1997 and taught at the University of California at San Diego before returning to Harvard in 2001. His research interests focus on political economy and international trade, and his first book, International Trade and Political Conflict, won the Riker Prize for the best book in political economy in 2001. Nicholas Smyth is a senior in Harvard College concentrating in Government. He is an Undergraduate Scholar in the Institute for Quantitative Social Science. Hiscox and Smyth will present a paper entitled "Is There Consumer Demand for Improved Labor Standards? Evidence from Field Experiments in Social Labeling," based on joint research conducted this summer with the support of IQSS. The presentation will be at noon on Wednesday, December 7 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. The abstract of the paper follows on the jump:

A majority of surveyed consumers say they would be willing to pay extra for products made under good working conditions abroad rather than in sweatshops. But as yet there is no clear evidence that enough consumers would actually behave in this fashion, and pay a high enough premium, to make “social labeling? profitable for firms. Without clear evidence along these lines, firms and other actors (including independent groups that monitor and certify standards) may be unwilling to take a risk and invest in labeling. We provide new evidence on consumer behavior from experiments conducted in a major retail store in New York City. Sales rose dramatically for items labeled as being made under good labor standards, and demand for these products was very inelastic for price increases of up to 20% above baseline (unlabeled) levels. Estimated elasticities of demand for labeled towels, for example, ranged between -0.36 and -1.78. Given the observed demand for labor standards, it appears that many retailers could raise their profits by switching to labeled goods. If adopted by a large number of firms, this type of labeling strategy has the potential to markedly improve working conditions in developing nations without slowing trade, investment, and growth.

Posted by Mike Kellermann at 10:27 AM

Fun with R2

Mike Kellermann

This semester, I have been one of the TFs for Gov 2000 (the introductory statistics course for Ph.D. students in the Government Department). It the first time that I've been on the teaching staff for a course, and it has been quite an experience so far. We've spent the past month or so introducing the basic linear model. Along the way, Ryan Moore (the other TF) and I have had some fun sharing the best quotes that we've come across about everyone's favorite regression output, R2:

Nothing in the CR model requires that R2 be high. Hence a high R2 is not evidence in favor of the model, and a low R2 is not evidence against it. Nevertheless, in empirical research reports, one often reads statements to the effect that "I have a high R2, so my theory is good," or "My R2 is higher than yours, so my theory is better than yours." (Arthur Goldberger, A Course in Econometrics, 1991)
Thus R2 measures directly neither causal strength nor goodness of fit. It is instead a Mulligan Stew composed of each of them plus the variance of the independent variable. Its use is best restricted to description of the shape of the point cloud with causal strength measured by the slopes and goodness of fit captured by the standard error of the regression. (Chris Achen, Interpreting and Using Regression, 1982)
Q: But do you really want me to stop using R2? After all, my R2 is higher than all of my friends and higher than those in all the articles in the last issue of APSR!
A: If your goal is to get a big R2, then your goal is not the same as that for which regression analysis was designed. The purpose of regression analysis and all of parametric statistical analyses is to estimate interesting population parameters....
If the goal is just to get a big R2, then even though that is unlikely to be relevant to any political science research question, here is some "advice": Include independent variables that are very similar to the dependent variable. The "best" choice is the dependent variable; your R2 will be 1.0. (Gary King, "How not to lie with statistics,"AJPS, 1986).

So this is old news, right? Maybe not. Quite possibly the thing that has surprised me the most so far is just how much students want R2 to tell them how good their model is. You could almost see the anguish in their faces as we read these quotes to them, particularly among those who have taken some statistics in the past. The question I want to throw out is, why is R2 such an attractive number? Why do we want to believe it? Maybe our cognitive science colleagues have some insight....

Posted by Mike Kellermann at 5:00 AM

December 6, 2005

The BLOG inference engine

Amy Perfors

There are two ways of thinking about almost anything. Consider family and kinship. One the one hand, we all know certain rules about how people can be related to each other -- that your father's brother is your uncle, that your mother cannot be younger than you. But you can also do probabilistic reasoning about families -- for instance, that grandfathers tend to have white hair, that it is extremely unlikely (but possible) for your mother to also be your aunt, or that people are usually younger than their uncles (but not always). These aren't logical inferences; they are statistical generalizations based on the attributes of families you have experienced in the world.

Though the statistics-rule dichotomy still persists in a diluted form, today many cognitive scientists are not only recognizing that people can do both types of reasoning much of the time but also beginning to develop behavioral methods and statistical and computational models that can clarify exactly how they do it and what that means. The BLOG inference engine, whose prototype was released very recently by Stuart Russell's computer science group at Berkeley, is one of the more promising computational developments for this goal.

BLOG (which stands for Bayesian LOGic, alas, not our kind of blog!) is a logical language for generating objects and structures, then doing probabilistic inference over those structures. So for instance, you could specify objects, such as people, with rules for how those objects could be generated (perhaps a new person (a child) is generated with some probability from two opposite-gender parents), as well as how attributes of these objects vary. For example, you could specify that certain attributes of people depend probabilistically on family structure - if you have a parent with that attribute, you're more likely to have that attribute yourself. Other attributes might also be probabilistically distributed, but not based on family structure: we know that 50% of people are male and 50% are female regardless of the nature of their parents.

The power of BLOG is that it allows you both to specify quite complex generative models and interesting logical rules and to do probabilistic inference given the rules you've set up. Using BLOG, for instance, you could ask things such as the following. If I find a person with Bill's eyes, what is the probability that this person is Bill's child? Is it possible for Bill's son to also be his daughter?

Though a few things are unexpectedly difficult in BLOG - reasoning about symmetric relations like "friend," for instance - I think it promises to be a tremendously valuable tool for anyone interested in how people do probablistic reasoning over structures/rules, or in doing it themselves.

Posted by Amy Perfors at 3:01 AM

December 5, 2005

Anchoring Vignettes (II)

Sebastian Bauhoff

In my last post I mentioned how differences in expectations and norms could affect self-rated responses in surveys. One fix is to use anchoring vignettes that let the interviewer control the context against which ratings are made.

For example, in a 2002 paper on the use of vignettes in health research, Salomon, Tandon and Murray ask respondents to rank their own difficulty in mobility on a scale from 'no difficulty' to 'extreme difficulty'. Then they let respondents apply the same scale to some hypothetical persons using descriptions like these:

"Paul is an active athlete who runs long distances of 20km twice a week and plays soccer with no problems."

"Mary has no problems walking, running or using her hands, arms, and legs. She jogs 4km twice a week."

Using the difference in how people assess these controlled scenarios, one can adjust the rating of people's own health. Doing this across or within various populations then allows to examine systematic differences across groups. These vignettes have been used in recent World Health Surveys in a number of countries.

King, Murray, Salomon and Tandon introduced the vignettes approach and used the measured differences to correct responses to self-rated questions on political efficacy. The idea is that applying the vignettes to a sub-sample is cheap and sufficient to understand systematic differences in self-reports. Their methods are laid out in the paper, but the results show how much difference the vignettes method can make: instead of suggesting that there is a higher level of political efficacy in China than in Mexico (as self-reports would indicate), the vignette method shows the exact opposite because the Chinese have lower standards for efficacy and thus understand the scale differently.

Intuitively that's what we do all the time: once you talked to enough Europeans and Americans about their (and other peoples') well-being you use your mental model to adjust responses and stop taking the European's minor complaints too seriously. Using this insight in survey-based research can make a huge difference too.

Posted by James Greiner at 6:41 AM

December 2, 2005

Questions about Free Software

Jim Greiner

This past spring at Harvard, a group of students from a variety of academic disciplines agitated for a course in C, C++, and R focusing on implementating iterative statistical algorithms such as EM, Gibbs sampling, and Metropolis-Hastings. The result was an informal summer class sponsored by IQSS and taught by recent Department of Statistics graduate Gopi Goswami. Professor Goswami created (from scratch) class notes, problem sets, and sample programs as well as compiling lists of web links and other useful materials. Course participants came from, among other places, Statistics, Biostatistics, Government, Japanese Studies, the Medical School, the Kennedy School, and Health Policy. For those interested in the lecture slides and other materials Professor Goswami compiled, the link is here. Principal among the subjects taught in the course was how to marry R's data-processing and display capabilities to an iterative inferential engine (try saying that phrase quickly three times) such as an EM or a Gibbs, with the latter written in C or C++ so as to increase (vastly) the speed of runs. In other words, we learned how to have R do the front end (data manipulation, data formatting) and back end (analysis of results, graphics) of an analysis while letting a faster language do the hard work in the middle.

The course both demonstrates and facilitates a growing trend in the quantitative social sciences toward making open-source software stemming from scholarly publications freely available to the academic community. Two examples from the ever-expanding field of ecological inference are Gary King's EI program, based on a truncated bivariate normal model and implemented in GAUSS, and Kosuke Imai and Ying Lu's implementation of a Dirichlet-process-based model), implemented with an R-C interface.

The trend toward freely available, model-specific software has obvious potential upsides. Previously written code can save the time of a user interested in applying the model. Moreover, if the code is used often enough and potential bugs are reported and fixed, the software may become better than what a potential user could write on his or her own. After all, few of us interested in answers to real-world issues want to spend the rest of our lives coding in C.

Nevertheless, I confess to a certain amount of apprehension. For me at least, freely available, model-specific software provides a temptation to use models I do not fully understand. Relatedly, I often think that I do understand a model fully, that I grasp all of its strengths and weakness, only to discover otherwise when I sit down to program it. Finally, oversight, hubris, or a desire to make accompanying documentation readable may cause the author of the software not to describe fully details of implementation or compromises made therein. Thus, while I am excited by the possibilities freely available social science software holds, I worry about the potential for misuse as well.

Posted by James Greiner at 6:00 AM

December 1, 2005

Anchors Down (I)

Sebastian Bauhoff

"How's it going?" If you ever tried to compare the answer to this question between the average American ("great") and European ("so-so" followed a list of minor complaints), you hit directly on a big problem in measuring self-reported variables.

Essentially the responses to questions on self-reported health, political voice and so on are determined not only by differences in actual experience, but also by differences in expectations and norms. For a European "so-so" is a rather acceptable status of wellbeing whereas for Americans it might generate serious worries. Similarly people's expectations about health may change with age and responses can thus be incomparable within a population (see this hilarious video on Gary King's website for an example).

A way to address this problem in surveys is to use "anchoring vignettes" that let people compare themselves on some scale, and then also ask them to assess hypothetical people on the same scale. The idea is that ratings of the hypothetical persons reflect the respondents' norms and expectations similarly to the rating of their own situation. Since the hypothetical scenarios are fixed across the respondents any difference in response for the vignettes is due to the interpersonal incomparability.

Using vignettes is better than asking people to rank themselves on a scale from "best" to "worst" health because it makes the context explicit and puts it in control of the experimenter. Gary and colleagues have done work on this issue which shows that using vignettes can lead to very different results than self-reports (check out their site). I will write more on this in the next entry.

Posted by Sebastian Bauhoff at 2:21 AM

November 30, 2005

Missing Women and Sex-Selective Abortion

You, Jong-Sung

The problem of “missing women? in many developing countries reflects not just the gender inequality but serious violation of human rights, as Amartya Sen reported in his book Development as Freedom (1999). It refers to the phenomenon of excess mortality and artificially lower survival rates of women. Particularly disturbing is the practice of sex-selective abortion, which has become quite widespread in China and South Korea.

Statistical analysis, in particular examination of anomalies in a distribution of interest, can give compelling evidence of crime or corruption. If nine out of ten babies delivered at a hospital are boys, we must have a strong suspicion that the doctor(s) in the hospital conduct(s) sex-selective abortion. It may not be evidence sufficient for a conviction, but it probably is sufficient grounds for investigation. Then, what will be a good guide to decision for investigation? Applying a threshold of a certain percentage will not be a good idea, because the probability of 6 or more boys out of 10 babies is much larger than the probability of 600 or more boys out of 1000 babies. So, an appropriate guide may be the use of binomial probability distribution.

Suppose the probability of producing boy or girl is exactly 50 percent. Then, the probability of producing six or more boys out of ten babies 37.7 percent, while the probability of producing 60 or more boys out of 100 babies is only 2.8 percent in the absence of some explanatory factor (probably sex-selective abortion). The probability of producing 55 or more boys out of 100 babies is 18.4 percent, but the probability of producing 550 or more boys out of 1000 babies is only 0.09 percent in the absence of some explanatory factor (again, probably sex-selective abortion). If the police decide to investigate the hospitals with more than a certain percentage of boy-birth rate, say 60 percent, then many honest small hospitals will get investigation, while large hospitals that really engage in sex-selective abortion may avoid the investigation.

Posted by Jong-sung You at 5:59 AM

November 29, 2005

Beyond Standard Errors, Part I: What Makes an Inference Prone to Survive Rosenbaum-Type Sensitivity Tests?

Jens Hainmueller

Stimulated by the lectures in Statistics 214 (Causal Inference in the Biomedical and Social Sciences), Holger Kern and I have been thinking about Rosenbaum-type tests for sensitivity to hidden bias. Hidden bias is pervasive in observational settings and these sensitivity tests are a tool to deal with it. When done with your inference, it seems constructive to replace the usual qualitative statement that hidden bias “may be a problem? with a precise quantitative statement like “in order to account for my estimated effect, a hidden bias has to be of magnitude X.? No?

Imagine you are (once again) estimating the causal effect of smoking on cancer and you have successfully adjusted for differences in observed covariates. Then you estimate the “causal? effect of smoking and you’re done. But wait a minute. Maybe subjects who appear similar in terms of their observed covariates actually differ in terms of important unmeasured covariates. Maybe there exists a smoking gene that causes cancer and makes people smoke. Did you achieve balance on the smoking gene? You have no clue. Are your results sensitive to this hidden bias? How big must the hidden bias be to account for your findings? Again, you have no clue (and so neither does the reader of your article).

Enter stage Rosenbaum type sensitivity tests. These come in different forms but the basic idea is similar in all of them. We have a measure, call it (for lack of latex in the blog) R, which gives the degree to which your particular smoking study may deviate from a study that’s free of hidden bias. You assume that two subjects with the same X may nonetheless differ in terms of some unobserved covariates, so that one subject has an odds of receiving the treatment that is up to Gamma ≥ 1 times greater than the odds for another subject.. So, for example, Gamma=1 would mean your study is indeed free of hidden bias (like a big randomized experiment), and Gamma=4 means that two subjects who are similar on their observed X can differ on unobservables such that one could be four times as likely as the other to receive treatment.

The key idea of the sensitivity test is to specify different values of Gamma and check if the inferences change. If your results break down at Gamma values just above 1 already, this is bad new. We probably should not trust your findings, because the difference in outcome data you found may not be caused by your treatment but may instead be due to an unobserved covariate that you did not adjust for. But if the inferences hold at big values of Gamma, let’s say 7, then your results seems very robust to hidden bias. (That’s what happened in the smoking on cancer case btw). Sensitivity tests allow you to shift the burden of proof back to the critics who laments about hidden bias: Please, Mr. Knows it all, go and find me this “magic omitted variable? which is so extremely imbalanced but strongly related to treatment assignment that it is driving my results.

More on this subject in a subsequent post.

Posted by Jens Hainmueller at 4:24 AM

November 28, 2005

Bayesian Models of Human Learning and Reasoning: A Recap

Drew Thomas

An MIT tag team of Prof. Josh Tenenbaum and his graduate student Charles Kemp presented their research to the IQSS Research Workshop on Wednesday, October 19. The overlaying topic of Prof. Tenenbaum's research is machine learning; one major aspect of this is their method of categorizing the structure of the field to be learned.

For example, it has made sense for hundreds of years that forms of life could be taxonomically identified according to a tree structure so as to compare the closeness of two species, and it also makes some sense to rank them on an ordered scale by some other characteristic (one example presented was how jaw strength could be used to generalize to total strength.) The presenters then showed how Bayesian inference could be used to determine what organizational structures are best suited to which systems, based on a set of covariates corresponding to certain observable features, which could then be used to make other comparisons that might not be as evident, such as immune system behaviour.

What confused me for much of the time was their insistence that they could use the data to decide on a prior distribution, an idea that set some alarms off. I have been under the strongest of directives from professors to keep the prior distribution limited to prior knowledge. My current understanding is that the following method is used:

1. Choose a family to examine, such as the tree, ring or clique structure (all of which, notably, can be learned by kindergarteners rather quickly.)

2. Conduct an analysis where the prior distribution is an equal likelihood of structure corresponding to all possible formations of this type.

3. Repeat this with the other relevant families. Those analyses with the most favorable results would then correspond to the most likely structure.

4. Conduct further research on the system with the knowledge that one structure family is superior for this description.

While I'm not as comfortable with their use of a data-driven prior distribution as they'd like, it seems that the authors are sensitive enough to this concern to keep actual structures separate, and using the data only to confirm their heuristic interpretations of the structures at hand, which sets me more at ease.

Now, the key to this research is that this is a model for human learning - and wouldn't you know it, we're better at it than computers. But I'm still very encouraged at the direction in which this is heading, and am looking forward to later reports from the Tenenbaum group.

Posted by Andrew C. Thomas at 2:23 AM

November 23, 2005

AIDS And African Economies

Eric Werker (guest author)

I enjoyed the chance to present a work in progress that attempts to measure the impact of AIDS on the economies and populations in Africa at the Applied Statistics Workshop on Wednesday, November 9. Given the possibility for some omitted variable to influence both the national AIDS rate and economic performance or some other outcome variable, I chose to pursue an instrumental variable strategy using variations in the male circumcision rate (which the bulk of the medical literature on this subject believes to have a causal impact on the spread of HIV/AIDS). Comments from the audience were useful and illuminating, and the debate was most interesting around potential violations of the exclusion restriction as well as the use of 2SLS in a small sample setting.

(Blogger's note: For more on this talk, see here and here.)

Posted by James Greiner at 5:47 AM

"Harvard for Less"

This recent article in the New York Times talks about the growing number of traditional college-aged students pursuing degrees through continuing education programs, including the Harvard Extension School. This year, for the first time some of the Gov Department methods courses are being offered through DCE. I can't speak for the rest of the university, but at least in our little corner of it, there is no difference between the distance and traditional versions of the course that I TF (Gov 2000). In the spring term, Gov 2001 (Advanced Quantitative Research Methodology) will be offered through DCE as well.

Posted by Mike Kellermann at 5:00 AM

November 22, 2005

Experts and Trials III: More Noise

John Friedman

In my previous two posts here and here, I discussed some of the game-theoretic reasons why lawyers' choice of experts in cases might only add noise to the process. In this post, I will draw on my own experience on a jury, evaluating expert witnesses, to speak to further pitfalls in our system.

First, some background on my case: I was on a jury for a medical malpractice trial, essentially deciding whether a tumor, which later killed the patient, should have been spotted on an earlier X-ray. The "standard of care" to which we were to hold the doctors in question was a completely relative metric: Did the doctors provide the level of care "expected" from the "ordinary" practicing radiologist. Predictably, radiologists testified for both the plaintiff and the defense, each claiming that it was obvious that the defendants violated/met the relevant standard of care.

My position, as might be expected given my earlier posts, was that these two experts, on net, provided very little information on the culpability of the defendants. For all I knew, 99% of qualified doctors could have believed these doctors were negligent, or not negligent - how would I ever know? Since my prior was uninformative in this case, I had no choice but to find for the defendants for lack of evidence in either direction.

My fellow jurors, however, had far stronger opinions. Many tended to believe or disbelieve an expert witness for irrelevant reasons. For instance, the physical attractiveness, speech pattern, and general "likeability" played a great role. Furthermore, the experts usually made or lost ground on their ability to explain the basics of the science underlying the issue at hand - the mechanics of an X-ray, for instance - to the jury. Of course, these basics were not in dispute by any party in the case. And, as any student at Harvard University knows, a witness's ability to clearly and succinctly explain the basics need not be related at all to her expertise in the field! That these facts influence juries should be of no surprise to anyone familiar with trials; the existence of an entire industry of "jury consultants," the legal equivalent of marketing professionals, should be evidence enough that these issues of presentation matter a great deal.

Finally, even after the experts presented their cases, the priors of some jurors seemed to greatly affect their opinions of the case. Though jurors are screened for such biases, the test cannot be perfect. I often found jurors relating personal experiences with radiologists as evidence for one side or another. Given my arguments above about the lack of information from experts, perhaps it is not surprising that priors mattered as they did, but this seemed to further add noise into the process.

In the end, I supported my jury's decision in this case. But I could not help feeling that it was simply by random chance, by a peculiar confluence of misinterpretation and biases, that we had reached the right decision.

Posted by James Greiner at 4:03 AM

November 21, 2005

Occam's Razor And Thinking about Evolution

Amy Perfors

I'm fascinated by the ongoing evolution controversy in America. Part of this is because as a scientist I realize how important it is to defend rational, scientific thinking -- meaning reliance on evidence, reasoning based on logic rather than emotion, and creating falsifiable hypotheses. I also recognize how deeply important it is that our students are not crippled educationally by not being taught how to think this way.

But from the cognitive science perspective, it's also interesting to try to understand why evolution is so unbelievable and creationism so logical and reasonable to many fairly intelligent laypeople. (I doubt it's just ignorance or mendacity!) What cognitive heuristics and ways of thinking cause this widespread misunderstanding?

There are probably a number of things. Two I'm not going to talk about include emotional reasons for wanting not to believe in evolution as well as the tendency for people who don't know much about either sides of an issue to think the fair thing to do is "split the middle" and "teach both sides." The thing I do want to talk about today-- the one that's relevant to a statistical social science blog -- concerns people's notions of simplicity and complexity. My hypothesis is that laypeople and scientists probably apply Occam's Razor to the question of evolution in very different ways, which is part of what leads to such divergent views.

[Caveat: this is speculation; I don't study this myself. Second caveat: I am neither saying that it's scientifically okay to believe in creationism, nor that people who do are stupid; this post is about explaining, not justifying, the cognitive heuristics we use that make evolution so difficult to intuitively grasp].


Occam's Razor is a reasoning heuristic that says, roughly, that if two hypotheses both explain the data fairly well, the simpler is likely to be better. Simpler hypotheses, generally formalized as those with fewer free parameters, don't "overfit" the data too much and thus generalize to new data better. Simpler models are also better because they make a strong predictions. Such models are therefore falsifiable (one can easily find something they don't predict, and see if it is true) and, in probabilistic terms, put a lot of the "probability mass" or "likelihood" on a few specific phenomena. Thus, when such a specific phenomenon does occur, simpler models explain it better than a more complex theory, which spread the probability mass over more possibilities. In other words, a model with many free parameters -- a complicated one -- will be compatible with many different types of data if you just tweak the parameters. This is bad because it then doesn't "explain" much of anything, since anything is consistent with it.

When it comes to evolution and creationism, I think that scientists and laypeople often make exactly the opposite judgments about which hypothesis is simple and which is complex; therefore their invokation of Occam's Razor results in opposite conclusions. For the scientist, the "God" hypothesis (um, I mean, "Intelligent Designer") is almost the prototypical example of a hypothesis that is so complex it's worthless scientifically. You can literally explain anything by invoking God (and if you can't, you just say "God works in mysterious ways" and feel like you've explained it), and thus God scientifically explains nothing. [I feel constrained to point out that God is perfectly fine in a religious or spiritual context where you're not seeking to explain the world scientifically!] This is why ID is not approved by scientists; not because it's wrong, but because it's not falsifiable -- the hypothesis of an Intelligent Designer is consistent with any data whatsoever, and thus as theories go ... well, it isn't one, really.

But if you look at "simplicity" in terms of something like number of free parameters, you can see why a naive view would favor ID over evolution. On a superficial inspection, the ID hypothesis seems like it really has only one free parameter (God/ID exists, or not); this is the essence of a simple hypothesis. By contrast, evolution is complicated - though the basic idea of natural selection is fairly straightforward, even that is more complicated than a binary choice, and there are many interesting and complicated phenomena arising in the application of basic evolutionary theory (simpatric vs. allopatric speciation, the role of migration and bottlenecks, asexual vs sexual reproduction, different mating styles, recessive genes, junk DNA, environmental and hormonal affects on genes, accumulated effects over time, group selection, canalization, etc). The layperson either vaguely knows about all of this or else tries to imagine how you could get something as complicated as a human out of "random accidents" and concludes that you could only do so if the world was just one specific way (i.e. if you set many free parameters just exactly one way). Thus they conclude that it's therefore an exceedingly complex hypothesis, and by Occam's Razor one should favor the "simpler" ID hypothesis. And then when they hear scientists not only believe this apparently unbelievable thing, but refuse to consider ID as a scientific alternative, they logically conclude that it's all just competing dogma and you might as well teach both.

This is a logical train of reasoning on the layperson's part. (Doesn't mean it's true, but it's logical given what they know). The reason it doesn't work is twofold: (a) a misunderstanding of evolution as "randomness"; seeing it as a search over the space of possible organisms is both more accurate and more illuminating, I think; and (b) misunderstanding the "God" hypothesis as the simple one.

If I'm right that these are among the fundamental errors the layperson makes in reasoning about evolution, the the best way to reach the non-mendacious, intelligent creationist is by pointing out these flaws. I don't know if anybody has studied whether this hunch is correct, but it sure would be fascinating to find out what sorts of arguments work best, not just because it would help us argue effectively on a national level, but also because it would reveal interesting things about how people tend to use Occam's Razor in real-life problems.

Posted by Amy Perfors at 4:04 AM

November 18, 2005

British Ideal Points

Mike Kellermann

We have talked a bit on the blog (here and here) about estimating the ideal points of legislators in different political systems. I've been doing some work on this problem in the United Kingdom, adapting an existing Bayesian ideal point model in an attempt to obtain plausible estimates of the preferences of British legislators.

The basic Bayesian ideal point model assumes that politicians have quadratic preferences over policy outcomes; this implies that they will support a proposal if it implements a policy closer to their ideal point than the status quo. Let qi be the ideal point of legislator i, mj be the location of proposal j, and sj be the location of the status quo that proposal j seeks to overturn. The (random) utility for legislator i of voting for proposal j can thus be written as:

sj2 - mj2 + 2qi(mj - sj) + eij

Or re-written as

aj + bjqi + eij

With the appropriate assumptions on the stochastic component, this is just a probit model with missing data in which the legislator votes in favor of the proposal when the random utility is positive and against when the random utility is negative. Fitting a Bayesian model with this sampling density is pretty easy, given some restrictions on the priors.

Unfortunately, applying this model to voting data in the British House of Commons produces results that lack face validity. The estimates for MPs known to be radical left-wingers are located in the middle of the political spectrum. Party discipline is the problem; the influence of the party whips (which is missing from the model) overwhelms the policy utility.

I try to address this problem by moving to a different source of information about legislative preferences. Early Day Motions allow MPs to express their opinions without being subject to the whips. EDMs are not binding, and can be introduced by any legislator. Other legislators can sign the EDM to indicate their support. There are well over 1000 EDMs introduced every year, which greatly exceeds the number of votes in the Commons.

We can't just apply the standard ideal point model to EDM data, however, because there is no way for MPs to indicate opposition to the policy proposed in an EDM. Instead of 'yea' and 'nay', one observes 'yea' or nothing. In particular, it is clear that some Members of Parliament are less likely to sign EDMs, regardless of their policy content. I model this by adding a cost term ci to legislator i's random utility.

sj2 - mj2 + 2qi(mj - sj) + ci + eij

This is a more realistic model of the decision facing legislators in the House of Commons. In this model, the proposal parameters are unidentified; I restrict the posterior distribution for these parameters by assuming a prior distribution that assumes the sponsors of EDMs make proposals that are close to their ideal points.

I'm still finalizing the results using data from the 1997-2001 Parliament, but the results on a subset of the data seem promising; left-wingers are on the left, right-wingers are on the right, and the (supposed) centrists are in the center. These estimates have much greater face validity than those generated from voting data.

If you are interested in this topic, I am going to be presenting my preliminary results at the G1-G2 Political Economy Workshop today (Friday, November 18) at noon in room N401. By convention, it is grad students only, so I hope there are not too many disappointed faculty out there (sure...).

Posted by Mike Kellermann at 3:09 AM

November 17, 2005

Social Science and Litigation, Part IV

Jim Greiner

In previous blog entries here, here, and here, I discussed the fundamental questions about the objectivity of expert witnesses raised by Professor of History Morgan Kousser's article entitled "Are Expert Witnesses Whores?".

In my view, Professor Kousser's article suggests that expert witnesses are not fully aware of the threat to their objectivity that the litigation poses. For example, despite acknowledging that lawyers "peform[ed] most of the culling of primary sources" in the cases in which he offered testimony, Professor Kousser argues, for a number of reasons, that there was no threat to objectivity. Primary among these reasons was the adversarial process, which gave the other side an incentive to find adverse evidence and arguments, and thus an incentive for an expert's own attorneys to share such evidence and arguments.

Professors Kousser's reasoning dovetails with private conversations I've had with social scientists about litigation experiences, who also insisted that they retained their objectivity throughout. Invariably, they support this contention by describing critical moments during pre-trial preparation in which they refused requests from their attorneys to testify to something, saying that the requests pushed the data too far or contradicted their beliefs.

My response: think about what the attorneys had already done to your objectivity before you reached these critical moments. Might they even have pushed you into refusing so as to convince you of your own virtue?

Professor Kousser and other social scientists have misperceived the nature of the threat. Professor Kousser is correct when he suggests that lawyers, upon encountering a potentially damaging piece of source material or evidence within an expert's area, are unlikely to suppress it (in the hope that the other side is negligent). But we lawyers do accompany our transmission of the potentially damaging item with rhetoric about its lack of reliability, importance, or relevance. Similarly, when we prepare experts for deposition and trial, we do not avoid adverse arguments or potential weaknesses in reasoning. Instead, we raise them in a way so as to minimize their impact. Often, we (casually) use carefully tailored, ready-made rhetorical phrases about the issue, hoping to hear those phrases again at trial. Before conducting pretrial meetings with important experts, we meet amongst ourselves to decide how best to ask questions and discuss issues to "prop up" expert' resolve.

Social scientists have long known that the way a questioner phrases an inquiry affects the answer received, that the way in which a conversational subject is raised affects the opinions discussants will form. Perhaps social scientists believe that their knowledge of these phenomena makes them immune to such effects. My experience in prepping social scientist expert witnesses suggests that such is not the case.

Posted by SSS Coauthors at 2:54 AM

November 16, 2005

Fun with bad graphs

Just read this nice entry on Andrew Gelman's blog about junk graphs . Somebody complemented the entry by posting a link to another site by Karl Broman in the Biostatistics department of Johns Hopkinson. In case you missed this please take a look. We all make these mistakes, but it's actually really funny...

Posted by Jens Hainmueller at 2:09 PM

Experts and Trials II: True Opinion & Slant

John Friedman

I ended my last post by showing, in the context of the brief model I sketched, what the optimal outcome would look like. In practice, though, the court suffers from two problems.

First, it cannot conduct a broad survey, but must instead rely on those testimonies presented in court. Each side will offer an expert whose "true opinion" is as supportive of their argument as possible, regardless of whether that expert is at all representative of commonly accepted views in the field. Second, the court cannot distinguish between an expert's true opinion and her "slant." Experts probably suffer some cost for slanting their views away from their true opinions, so one should not expect most slants to be large. But the legal parties will look to pick experts who suffer as little a cost from slanting as possible, so that, in equilibrium, the slants could be quite large.

Given these strategies from the legal parties, what does the court see? Each side presents an expert (or slate of experts) with the most favorable combination of "true opinion" and "slant." Even if the court could disentangle the two components of testimony, the court would only see the endpoints of the distribution of "true opinions" among the potential pool of experts. But since they cannot even distinguish the slant, the court actually sees only a noisy signal of the extremes of the distribution.

Finally, I have already argued that the experts chosen will be those most able (or willing) to slant their opinions, so that the ratio of signal to noise – or of "true opinion" to "slant" - for the experts will be very low, in expectation. When the court performs the required signal extraction problem, very little signal remains. Because of the optimizing action of each party, the court will draw very little inference from any of the witnesses in many cases, ironically nullifying the effect of the efforts of the experts. No one deviates from this strategy, though; if one side presented a more representative expert, while the other played the old strategy, the evidence would appear lopsided.

I noted in my last post that the "first-best," or socially optimal solution, would be for the court to collect a representative sample of the opinions of experts for their decision. Even when the parties present their own experts, each side would be better off if they could somehow commit not to use "slant" in their expert's opinions, since the decision in the case would be less noisy. But the structure of the problem makes such an agreement impossible.

Jim is correct when he remarks that, given the adversarial nature of the legal system, expert testimony could not happen any other way. We should not celebrate this fact, though; rather, we should mourn it. We are stuck in a terrible equilibrium.

Posted by James Greiner at 4:59 AM

November 15, 2005

Spatial Error

Sebastian Bauhoff

This entry follows up on earlier ones here and here on spatial statistics and spatial lag, and discusses another consequence of spatial dependence. Spatial error autocorrelation arises if error terms are correlated across observations, i.e., the error of an observation affects the errors of its neighbors. It is similar to serial correlation in time series analysis and leaves OLS coefficients unbiased but renders them inefficient. Because it's such a bothersome problem, spatial errors is also called "nuisance dependence in the error."

There are a number of instances in which spatial error can arise. For example, similar to what can happen in time series, a source of correlation may come from unmeasured variables that are related through space. Correlation can also arise from aggregation of spatially correlated variables and systematic measurement error.

So what to do if there is good reason to believe that there is spatial error? Maybe the most famous test is Moran's I which is based on the regression residuals and is also related to Moran's scatterplot of residuals which can be used to spot the problem graphically. There are other statistics like Lagrange multiplier and likelihood ratio tests, and each of them has different ways of getting at the same problem. If there is good reason to believe that spatial error is a problem, then the way forward is either model the error directly or to use autoregressive methods.

In any case it's probably a good idea to assess whether spatial error might apply to your research problem. Because of it's effect on OLS, there might be a better way to estimate the quantity you are interested in, and the results might improve quite a bit.

Posted by James Greiner at 3:54 AM

November 14, 2005

Applied Statistics - You, Jong-Sung

This week, the Applied Statistics Workshop will present a talk by You, Jong-Sung, a PhD candidate in Public Policy at the Kennedy School of Government. Jong-Sung’s dissertation on “corruption, inequality, and social trust? explores how corruption and inequality reinforce each other and erode social trust. His dissertation chapter on cross-national study of causal effect of inequality on corruption was published in ASR (February 2005) as an article with S. Khagram. His research interests include comparative politics and political sociology of corruption and anti-corruption reform and political economy of inequality and social policy. Before coming to Harvard, he worked for an NGO in Korea, “Citizens’ Coalition for Economic Justice?, as Director of Policy Research and General Secretary. He spent more than two years in prison because of democratization movement under military regimes. He has a BA in social welfare from Seoul National University, and a MPA from KSG. He is also one of the authors of this blog.

The talk is entitled “A Multilevel Analysis of Correlates of Social Trust: Fairness Matters More Than Similarity,? and draws on Jong-Sung’s dissertation research. The abstract follows on the jump:

I argue that the fairness of a society affects its level of social trust more than does its homogeneity. Societies with fair procedural rules (democracy), fair administration of rules (freedom from corruption), and fair (relatively equal and unskewed) income distribution produce incentives for trustworthy behavior, develop norms of trustworthiness, and enhance interpersonal trust. Based on a multi-level analysis using the World Values Surveys data that cover 80 countries, I find that (1) freedom from corruption, income equality, and mature democracy are positively associated with trust, while ethnic diversity loses significance once these factors are accounted for; (2) corruption and inequality have an adverse impact on norms and perceptions of trustworthiness; (3) the negative effect of inequality on trust is due to the skewness of income rather than its simple heterogeneity; and (4) the negative effect of minority status is greater in more unequal and undemocratic countries, consistent with the fairness explanation.

Posted by Mike Kellermann at 11:55 AM

Considering Spatial Dependence in Lattice Data: Two Views

Drew Thomas

Last year during Prof. Rima Izem's Spatial Statistics course, I started to wonder about different analytical techniques for comparing lattice data (say voting results, epidemiological information, or the prevalence of basketball courts) on a map with distinct spatial units such as counties.

A set of techniques had been demonstrated to determine spatial autocorrelation through the use of a fixed-value neighbour matrix, with one parameter determining the strength of the autocorrelation. The use of the fixed neighbour matrix perturbed me somewhat, since the practice of geostatistics uses a tool called the empirical variogram - a functional estimate of variance between sample sites through a regression, based on taking each possible pair of points and computing the difference between squared values - which might give a more reasonable estimate of autocorrelation than a simpler model.

As it turned out, this same question was asked by Prof. Melanie Wall from the Biostatistics Department at the University of Minnesota about a year before I got around to it. In her paper "A close look at the spatial structure implied by the CAR and SAR models" (J. Stat. Planning and Inference, v121, no.2), Prof. Wall tests the idea of using a variogram approach to model spatial structure on SAT data against more common lattice models. And what do you know - the variogram approach holds up to scrutiny. In some cases it outperforms the lattice model, such as in the extreme case of Tennessee and Missouri, which have a bizarrely low correlation due to the fact that each state has eight neighbours.

As well as feeling relief that this difficulty with the model wasn't just in my imagination, I'm glad to see that this type of inference crosses so many borders.

Posted by Andrew C. Thomas at 3:37 AM

November 10, 2005

Experts and Trials I: Game Theory

John Friedman

No sooner had the recent posts on this blog by Jim Greiner about the use of statistics and expert witnesses in trials
(see here and here, as well as yesterday's' post) piqued my curiosity than I was empanelled on a jury for a 5-day medical malpractice trial. This gave me ample time to think through some of the issues of statistics and the law. I will spend my next posts discussing these issues from three different perspectives: the game-theoretic, the experiential, and the historical.

I first approach this problem from a game-theoretic framework. In Jim's second post, he spoke about how, in our adversarial legal system, an expert for one side tends to interpret the facts in the way most favorable for that side, without compromising her "academic integrity." He then listed several reasons why this might actually be best for the system. I tend to disagree on this final point; instead, I believe the adversarial nature of the system pushes us into a very bad situation.

To give my argument focus, we must first pin down the concept of "equilibrium." An equilibrium of a game is a strategy for each player such that, given the other players' strategies, the player is maximizing her return from the game. In this case, the game is relatively simple: Two parties to a lawsuit are the players, each with a set of expert testimonials interpreting the relevant statistics in the case (which makes up the strategy). We can represent the net message from the expert testimony for each side as a number on the real line: The more positive the number, the more pro-plaintiff the testimony.

We must make some simplifying assumptions to analyze this problem. Let us assume that the testimony for each side comprises two components: the "true opinion" and the "slant." When added together, "true opinion" + "slant" = testimony. (For simplicity, let us assume that these numbers are the actual impact of the testimony. Thus, if a testimony seems too biased and is discounted, the true number would not lie far from zero). In an ideal world, the court (either judge or jury) would survey the "true opinions" of many experts in the field; if enough opinions were positive, the case would go for the plaintiff. Economists often refer to such a case as the "first-best," the socially optimal outcome.

Many games do not yield the socially optimal outcome, though. Both parties can even be worse off playing the equilibrium strategies than if each played some other strategy, despite the fact that each party maximizes her payoff given the other players strategy. A classic example of such a situation is the "Prisoner's Dilemma." In my next post, I will explore how, in this legal setting, exactly this tragedy occurs.

Posted by James Greiner at 5:50 AM

November 9, 2005

Social Science and Litigation, Part III

Jim Greiner

Continuing with the theme of quantitative social science expert witnesses in litigation introduced here and here, I shift gears to consider the experts' view of lawyers. Several expert witnesses with whom I have spoken confided that they often form low opinions of the lawyers who retain them. One common complaint is that the attorneys do not take the time to understand the guts of the issue experts were hired to examine. Another is that lawyers are uncommunicative and provide poor guidance as to their preferences for the testimony of experts.

Without question, poor lawyering is common, and some of what experts experience can be safely attributed to this source. But as was the case with lawyers' complaints about experts, experts' complaints about lawyers have their genesis partly in the structural rules that govern litigation. In most courts and jurisdictions, communications between a testifying expert and any other participant in the case (lawyer, fact witness, another expert) are discoverable. That means that, before trial, the other side is entitled to request, for example, copies of all email communications between lawyer and expert. In deposition, an expert may be questioned on telephone and other oral conversations with the retaining attorney. For this reason, good lawyers are careful about what they say to experts; they know that written or transcribed communications reach both parties to a case.

As is usually the situation, there are good reasons for this rule. An expert witness is one of the most dangerous creatures to enter a courtroom. By definition, he or she invariably knows more about the subject matter of the testimony than anyone else involved in the litigation, except perhaps the opposing expert. The judge and jury lack the knowledge and training to assess what the expert says. Thus, the law provides that experts must disclose anything that might form the basis of an expert's opinion, including communications with trial counsel (along with workpapers, references consulted, and other items).

Expert witness frustration aside, this discovery rule has other negative side effects; it affects not only how well lawyers prepare a case for trial, but also the treatment of the suit more generally. Parties and their attorneys need information to settle, and a lack of clear communication between lawyer and expert may cause the former to misjudge the settlement value of a case. Once again, we see how atypical Professor Kousser's experience as an expert was (see here), as lawsuits concerning the internal structure of a municipality or a state entity settle less often than, say, employment discrimination class actions.

In closing, a word to potential and actual social science expert witnesses: If you find yourself frustrated by a certain reticence or irrational exuberance on the part of the attorney retaining you, remember, there may be good reason for it.

Posted by SSS Coauthors at 2:50 AM

November 8, 2005

Creative Instruments

Sebastian Bauhoff

In a recent presentation at Harvard, Caroline Hoxby outlined a paper-in-process on estimating the causal impact of higher education on economic growth in the US states (Aghion, Boustan, Hoxby and Vandenbussche (ABHV) "Exploiting States' Mistakes to Identify the Causal Impact of Higher Education on Growth", draft paper August 6, 2005).

ABHV's paper is interesting for the model and results, and you should read it to get the full story. But the paper is also intersting because of the instrument used to get around the endogeneity of education spending (where rich states spend more on higher education).

The basic idea is as follows: politicians are motivated to channel pork to their constituents in return for support. They do so through appropriations committees that can disburse "earnmarked" funds to research-type education. Observing that the membership of these committees is to a large extent random, ABHV have an instrument for research spending (and more instruments for spending on other types of education) and proceed to estimate the causal effect of education on growth. So this paper employs what could be called a political instrument. Of course there are plenty of other classes of IV's such as natural events (rainfall or natural disasters) etc. But an instrument is only partly distinguished by its ability to fulfill the formal requirements. There's also plenty of scope for creativity.

The IQSS Social Science Statistics blog is soliciting suggestions and requests for instruments: send your favorite IV and its application. Or tell our readers what you always wanted to get instrumented and see if someone comes up with a suggestion.

Posted by Sebastian Bauhoff at 1:08 AM

November 7, 2005

Evolutionary Thoughts on Evolutionary Monte Carlo

Gopi Goswami

Thanks a lot to Mike Kellerman for inviting me over for the talk on Oct 26, 05 at the IQSS (see here for details). I really enjoyed giving the talk and getting interesting comments and questions from the audience. In particular, Prof. Donald Rubin, Prof. Gary King and others made important contributions which I really appreciate. Prof. Kevin Quinn gave me some excellent suggestions on how to improve the structure of the talk which I think will turn out to be very helpful in the near future when I prepare for the job market. In fact, along those lines, if anyone may have any inputs/suggestions/comments on the presentation please feel free to send them to me at

Here are some afterthoughts on the talk. The PBC (Population Based Clustering) moves I presented, namely, SCSC:TWO-NEW, SCSC:ONE-NEW and SCRC are new and they are very specific to the sampling based
clustering (which is a discrete space) problem. I haven't been successful in devising similar moves in dealing with general sampling problem on a continuous space. In the Evolutionary Monte Carlo (EMC) literature these types of moves are also called "cross-over" moves because these moves take two chromosomes (or states of two chains)
which are called two "parents" and implement some cross-over type operation with the parents to produce two chromosomes (or proposed states of two chains) which are called "children."

The main motivation behind devising the above mentioned moves, as I mentioned in the talk, is that we were looking for moves which propose to update "more than one coordinate but not too many" at a time. Gibbs sampler proposes one coordinate at a time update. This is the main reason why Jain and Neal (A Split-Merge Markov Chain Monte Carlo
Procedure for the Dirichlet Process Mixture Model with Radford M. Neal, Journal of Computational and Graphical statistics, volume 13, No. 1, pp. 158-182 . (2004)) proposed their sampler which updates more than one coordinates at a time but it does so for one too many of them. To counter this problem we proposed the above mentioned PBC moves which are kind of a middle ground between the Gibbs sampler and the Jain-Neal sampler.

The other main issue addressed by the two moves, namely, SCSC:ONE-NEW and SCRC, is that "they produce only one new child" after "cross-over." To expand on this, we note that since all the PBC moves, the mentioned ones included, are Metropolis-Hastings type moves, two "children" have to be produced to replace the parents so as to maintain reversibility or detailed balance. But the children produced by two good parents are usually not good enough, and one does not want to throw away some good parent by chance. Thus, it has long been desired to design some moves that both can take advantage of the "cross-over" strategy and can keep some good parent. Our new moves are the first such in the literature.

Lastly, some members of the audience in the talk were worried about the temperature placement problem in the parallel tempering set up. Prof. Jun Liu and I proposed a first cut solution to the problem which solves the problem in two steps. First, we determine the highest temperature to be used in the ladder, namely, $t_1 = \tau_{max}$. Next, we look at the length and the structure of the ladder i.e. the placement of the intermediate temperatures within the
range $(\tau_{min}, \tau_{max})$. You can find the details of this the paper at my website by clicking on "On Real-Parameter Evolutionary Monte Carlo Algorithm (ps file) [submitted]":

Posted by James Greiner at 5:42 AM

November 4, 2005

Measuring Social Trust

You, Jong-Sung

I had a very embarrassing experience, when I presented my early draft paper on “Inequality and Corruption as Correlates of Social Trust? at a Work-in-Progress Seminar at the Kennedy School of Government last fall. Professor Edward Glaeser came to my talk, but I was not aware of him although I had read his articles including one about “measuring trust." He asked a question about measurement of social trust without identifying himself. Since I had already talked about the problem of measurement (apparently he did not hear that because he was late) and was about to present my results, I did not want to spend much time about the measurement issue. He was not satisfied with my brief answer and repeated his questions and comments, saying that the typical trust question in surveys, “Do you agree that most people can be trusted or you can’t be too careful??, may reflect trustworthiness rather than trust “according to a study.? Because I assumed that trust and trustworthiness reinforce one other, I did not think that was a great problem.

Our encounter was an unhappy one for us both. Probably he had an impression that I did not respect him and did not give adequate attention and appreciation to his questions and comments, and I was also kind of annoyed by his repeated intervention. One thing that made the things even worse was that I am not a native English speaker; I have particular difficulty with husky voices like his, a difficulty made the interaction even more problematic. After the talk, I asked him to give the reference for the study on measurement of trust he mentioned. He wrote down Glaeser et al. (2000), and I realized that I had read the article he cited. Even then, I was unaware who he was. I asked a participant of the seminar who he was, and to my surprise, he was Edward Glaeser, the lead author of the article on measuring trust. If I had recognized him, I would have paid much more attention to his questions and comments and tried to answer them better. How big a mistake I made!

Although I still think that the typical trust question captures both trust and trustworthiness, Glaeser et al.’s experimental results may indicate the trust question needs to be designed better. One thing to note in this regard is that caution is not the opposite of trust, as Yamagishi et al. (1999) argued. In my case study of social trust in Korea, I found that inclusion and exclusion of “being careful? option in trust questions produced substantially different results. More respondents agree that most people can be trusted when they were simply asked, “Do you think most people can be trusted? than when they were given the two options “trusting most people? and “being careful.? Average percentage of trusting people was 42.9 per cent for the former type of questions, and 32.2 per cent for the latter type of questions. I looked at the GSS, and the same was true there. The trust question was given without the option of being careful once during 1983-87, and 55.7 per cent of respondents agreed that most people can be trusted. When the “being careful? option was given, only 42.1 per cent of respondents did so.

Posted by SSS Coauthors at 5:54 AM

November 3, 2005

Expansion of Economics

John Friedman

In my last post, I wrote about the methodological identity of economics and some of the corresponding advantages. But perhaps the greatest benefit to economists from this definition of the discipline is the great range of subjects on which one can work.

There are, of course, areas of inquiry traditionally dominated by economists – monetary policy, or the profit-maximizing activities of companies, to name a few – and most people connect economics, as a field, to these subjects. Increasingly, though, economists are venturing further afield. Steven Levitt’s best-selling book, Freakonomics, exemplifies this trend, using the tools of economics to investigate corruption in sumo wrestling, cheating in Chicago schools, and ethnic names, to name a few. While Levitt currently sits farther from the mainstream than most economists, his work appears to be not a randomly scattered shot but rather the vanguard of a new generation of scholars.

What are the consequences of this expansion of economics across the social sciences? The increasing incidence of economists working on problems traditionally associated with other fields will, no doubt, create some conflict in the coming years. No local baron, ruling a fiefdom of land or knowledge, savors a challenge over his turf. And the “imperial? economists, many of whom view other fields as weak and primed for colonization, will surely disrespect the vast contributions of non-economists to date. But despite the inevitable (but still unfortunate) conflicts of ego, the majority of these interactions should be not only of great benefit to the world but also a wondrous sight to see. Nothing in academia is quite so spectacular as the collision of two great points of view, obliterating long-held dogmas and, in the heat of debate, forging new paradigms for generations to come.

As a young economist, I look forward to following (and even contributing to) these great arguments to come. And I hope that those of us writing this blog, viewing the questions in social science from diverse perspectives, can give you a look at the current state of these debates.

Posted by James Greiner at 4:00 AM

November 2, 2005

Human Statistical Learning

Amy Perfors

If it's of interest, I will be blogging every so often about the numerous ways that humans seem to be remarkably adept statistical learners. This is a big question in cognitive science for two reasons. First, statistical learning looks like a promising approach to help answer the open question of how people learn as well and as quickly as they do. Second, better understanding how humans use statistical learning may be a good way to improve our statistical models in general, or at least investigate in what ways they might be applied to real data.

One of the more impressive demonstrations of human statistical learning is in the area usually called "implicit grammar learning." In this paradigm, people are presented with strings of nonsense syllables like "bo ti lo fa" in a continuous stream for a minute or two. One of the first examples of this paradigm, by Saffran et. al., studied word segmentation -- for example, being able to tell that "the" and "bird" are two separate words, rather than guessing it is "thebird" or "theb" and "ird." If you ever listen to a foreign language, you realize that word boundaries aren't signaled by pauses, which is a huge problem if you're trying to learn the words. Anyway, in the study, syllables occurred in groups of three, thus making "words" like botifa or gikare. As in natural language, there was no pause between words; the only cues to word segmentation were the different transition probabilities between syllables -- that is, "ti" might be always followed by "fa" but "fa" could be followed by any of the first syllables of any other words. Surprisingly, people can pick up on these subtleties: adults who first heard a continuous stream of this "speech" were then able to identify which three-syllable items they heard were "words" or "nonwords" in the "language" they had just heard. That is, the people could correctly say that "botifa" was a word, but "fagika" wasn't, at an above chance level. Since the only cues to this information were in the transition probabilities, people must have been calculating those probabilities implicitly (none had the conscious sense they were doing much of anything). Most surprisingly of all, the same researchers demonstrated in a follow-up study that even 8-month old infants can use these transitional probabilities as cues to word segmentation. Work like this has led many to believe that statistical learning might be one of the most powerful resources infants use during the difficult problem of language learning.

From the modeling perspective, this result can be captured by Markov models in which the learner keeps track of the string of syllables and the transition probabilities between them, updating the transition probabilities as they hear more data. More recent work has begun to investigate whether humans are capable of statistical learning that cannot be captured by a Markov model -- that is, learning nonadjacent dependencies (dependencies between syllables that do not directly follow each other) in a stream of speech. For instance, papers by Gomez et. al. and Onnis et. al. provide evidence that discovering even nonadjacent dependencies is possible through statistical learning, as long as the variability of the intervening items is low or high enough. This has obvious implications for how statistical learning might help in acquiring grammar (in which many dependencies are nonadjacent), but it also opens up new modeling issues, since simple Markov models are no longer applicable. What more sophisticated statistical and computational tools are necessary in order to capture own unconscious, amazing abilities?

Posted by James Greiner at 4:20 AM

November 1, 2005

Judge Alito & Statistics

Jim Greiner

Social science statistics is everywhere. So is law. And both are tangled up with each other. I was forcefully reminded of these facts when my wife pointed out an article on about an opinion Samuel Alito (as of yesterday, a nominee to the Supreme Court) wrote while a judge on the United States Court of Appeals for the Third Circuit in a case called Riley v. Taylor. The facts of the specific case, which concerned the potential use of race in preemptory challenges in a death penalty trial, are less important than Judge Alito's approach to statistics and the burden of proof.

Schematically, the facts of the case follow this pattern: Party A has the burden of proof on an issue concerning race. Party A produces some numbers that look funny, meaning instinctively unlikely in a race-neutral world, but conducts no significance test or other formal statistical analysis. The opposing side, Party B, doesn't respond at all, or if it does respond, it simply points out that a million different factors could explain the funny-looking numbers. Party B does not attempt to show that such innocent factors actually do explain the observed numbers, just that they could, and that Party A has failed to eliminate all such alternative explanations.

Such cases occur over and over again in cases involving employment discrimination, housing discrimination, preemptory challenges, and racial profiling, just to name a few. When discussing them, judges inevitably lament the fact that one side or the other did not conduct a multiple regression analysis, as if that technique would provide all the answers (Judge Alito's Riley opinion is no exception here).

The point is, of course, that how a judge views such cases has almost nothing to do with the facts at bar and everything to do with a judge's priors on the role of race in modern society. For judges who believe that race has little relevance in the thought processes of modern decision makers (employers, landlords, prosecutors, cops), Party A in the above situation must eliminate all potential explanatory factors via (alas) multiple regression in order to meet its burden of production. For judges who believe that race still matters, Party B must respond in the above situation or lose the case. Judge Alito's Riley opinion demonstrates where he stands here.

Is there a middle way? Perhaps. In the above situation, what about requiring some sort of significance test from Party A, but not one that eliminates alternative explanations? In the specific facts of Riley, the number-crunching necessary for "some sort of significance test" is the statistical equivalent of riding a tricycle: a two-by-two hypergeometric with row totals of 71 whites and 8 blacks, column totals of 31 strikes and 48 non-strikes, and an observed value of 8 black strikes yields a p-value of 0.

Posted by James Greiner at 3:58 AM

October 31, 2005

The Value of Control Groups in Causal Inference (and Breakfast Cereal)

Gary King

A few years ago, I taught the following lesson in my daughter's kindergarden class and my graduate methods class in the same week. It worked pretty well in both. Anyone who has a kid in kindergarten, some good graduate students, or both, might want to try this. It was especially fun for the instructor.

To start, I hold up some nails and ask "does everyone likes to eat nails?" The kindergarten kids scream, "Nooooooo." The graduate students say "No," trying to look cool. I say I'm going to convince them otherwise.

I hand out a little magnet to everyone. I ask the class to figure out what it sticks to and what it doesn't stick to. After a few minutes running around the classroom, the kindergardners figure out that magnets stick to stuff with iron in it, and anything without iron in it doesn't stick. The graduate students sit there looking cool.

From behind the table, I pull out a box of Total Cereal (teaching is just like doing magic tricks, except that you get paid more as a magician). I show them the list of ingredients; "iron, 100 percent" is on the list. I ask by a show of hands whether this is the same iron as in the nails. 3 of 23 kindergarten kids say "yes"; 5 of 44 Harvard graduate students say "yes" (almost the same percent in both classes!).

I show the students that the box is sealed (and I have nothing up my sleeves), Then, I open the box, spill some cereal on a cutting board, and smash it up into tiny pieces with a rolling pin. I take the pile of cereal around the room and let the kids put their magnet next to it and see whether the cereal sticks to the magnet. To everyone's amazement, it sticks!

Then I ask, are we now convinced that the iron in the nails is the same iron as in the cereal? All the kids in kindergarten and all the graduate students say "yes."

I respond by saying "but how do you know the cereal stuck to the magnet because it had iron in it? Maybe it was just sticky, like gum or tape." Now that I finally have their attention (not a minor matter with kindergartners), I get to explain to them what a control group is. And from behind the table, I pull out a box of Rice Krispies (which are made of nothing). We examine the side of the box to verify the lack of (much) iron, and then I smash up the Rice Krispies, and let them see if their magnet sticks. It doesn't stick!

Everyone gets to take home a cool fact (they love to eat the stuff in nails), I get to convey the point of the lesson in a way they won't forget (the central role of control groups in causal inference), and everyone gets a free magnet.

Posted by Gary King at 2:18 AM

October 30, 2005

Applied Statistics - Guido Imbens

This week, the Applied Statistics Workshop will present a talk by Guido Imbens of the University of California at Berkeley Department of Economics. Professor Imbens is currently a visiting professor in the Harvard Economics Department and is one of the faculty sponsors of the Applied Statistics Workshop, so we are delighted that he will be speaking to the group. He received his Ph.D. from Brown University and has served on the faculties of Harvard and UCLA before moving to Berkeley. He has published widely, with a particular focus on questions relating to causal inference.

Professor Imbens will present a talk entitled " Moving the Goalposts: Addressing Limited Overlap in Estimation of Average Treatment Effects by Changing the Estimand. " If you have been following the discussion on achieving balance taking place on the blog, then this talk should be of great interest. It considers situations in which balance is difficult to achieve in practice, and suggests that estimating treatment effects for statistically defined subsamples may produce better results. The presentation will be at noon on Wednesday, November 2 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. The abstract of the paper follows on the jump:

Estimation of average treatment effects under unconfoundedness or selection on observables is often hampered by lack of overlap in the covariate distributions. This lack of overlap can lead to imprecise estimates and can make commonly used estimators sensitive to the choice of specification. In such cases researchers have often used informal methods for trimming the sample or focused on subpopulations of interest. In this paper we develop formal methods for addressing such lack of overlap in which we sacrifice some external validity in exchange for improved internal validity. We characterize optimal subsamples where the average treatment effect can be estimated most precisely, as well optimally weighted average treatment effects. We show the problem of lack of overlap has important connections to the presence of treatment effect heterogeneity: under the assumption of constant conditional average treatment effects the treatment effect can be estimated much more precisely. The efficient estimator for the treatment effect under the assumption of a constant conditional average treatment effect is shown to be identical to the efficient estimator for the optimally weighted average treatment effect. We also develop tests for the null hypotheses of a constant and a zero conditional average treatment effect. The latter is in practice more powerful than the commonly used test for a zero average treatment effect.

Posted by Mike Kellermann at 8:05 PM

October 28, 2005

More Questions About Balance (And No Answers)

Jim Greiner

The recent posts on achieving good balance within matching have stimulated a certain amount of interest. To this debate I offer more questions and, alas, no answers, which are what I'd really like to know. (For what it's worth, I am not doing research in this area. All of my questions are genuine, not rhetorical.)

As I understand it, the genetic algorithm that Diamond and Sekhon favor searches for matches that minimize p-values from hypothesis tests. The subject of the hypothesis tests are the covariates, taken one at a time, and the two-way interactions, also taken one at a time.

My questions:
Is the objective in matching treated and control units to find sets of observations with the same JOINT distribution of the covariates, which is what one would have in a randomized experiment?

If so, do we expect achieving balance in all univariate (i.e. marginal) and two-way distributions to accomplish this goal, given that the marginal distributions of any multidimensional random vector do not determine the joint? On the other hand, if two sets of random vectors have the same joint distribution, would we expect hypothesis tests applied to individual (univariate) covariates or their interactions to achieve p-values of .15 or greater?

Does the dimension of the vector (i.e. the number of covariates) play a role here, in that if we had 20 covariates, we would expect a comparison of individual covariates marginally to produce a few p-values of below .15? Perhaps more broadly, what theory tells us that the genetic algorithm search is actually attempting to do the right thing - and what is it?

A propensity score method has answers to some of these questions, though it raises others. On the plus side, the theorems say that observations with the same propensity score have the same joint (not merely marginal) distribution of the covariates. Thus, if the goal is to replicate a randomized experiment's much-valued ability to produce observations with the same joint covariate distribution, conditioning on the true propensity score will do that. That's the theory that tells us what propensity score matching is attempting to do is the right thing. The problem is, of course, that in any case that matters, we don't know the true propensity scores, and estimation of them raises profound questions about model fit and adequacy. One can check disparities in marginal distributions, but for the reasons stated above, such checks are not really enough. A question for advocates of propensity scores is the following: if propensity score matching is designed to reduce dependence on the substantive model that relates outcomes to covariates, does it do so only by inducing dependence on proper specification of the propensity score model?

For those who would eschew hypothesis tests in assessing balance (see yesterday's post), how does one assess balance? True, one can always reduce the power of any test to reject a null by discarding observations (I have heard that K-S in particular has low power), but any comparison of distributions rests on some set of criteria. Looking at t-scores is a hypothesis test (how else would one decide when the set of scores is too big or too small?). Are hypothesis tests the worst method of assessing balance, except for all of the others?

I have only one suggestion on this subject: whatever method one uses to create matched sets of treated and control groups, after all ordinary checking of marginal distributions is complete, throw something completely wild at the results. For both groups, calculate a fifth moment of covariate one, interact it with a third moment of covariate two and a second moment of covariate three. Do a test and see what happens. If the two groups have the same joint distribution of their covariates . . . .

Posted by James Greiner at 3:19 AM

October 27, 2005

Don't Use Hypothesis Tests for Balance

Gary King

Jens' last two blog posts constitute an excellent statement of where the literature on matching is, but I think almost all of the literature has this point wrong. Hypothesis tests for checking balance in matching are in fact (1) unhelpful at best and (2) usually harmful.

Suppose you had a control group and a treatment group that are identical (exactly matched) except for one person, or except for a bunch of people in one very minor way. Suppose hypothesis tests indicate no difference between the groups, and so you'd be in the situation of reporting balance was great and no further adjustment was needed. (We might think of this as a real experiment where the outcome variable hasn't been collected but is expensive to do so.) If you were given the chance of dropping the one or few people that caused the two groups to differ and replacing them with others that exactly matched, would you do so? Since the dimension on which the inexact match or matches occurred might be the one that has a huge effect on your outcome variable, the bias due to not switching could be huge. So you'd undoubtedly make the switch, despite the fact that the hypothesis test indicated that there was no problem. Hence (1) the tests are unhelpful: passing the test does not necessarily protect one from bias more than failing the test.

Now suppose you have data that don't match very well by all hypothesis tests and you randomly (rather than systematically to improve matching) drop observations, in a bad application of matching. what will happen? Your t-tests or ks-tests or any other hypothesis tests will lose power and so will indicate that balance is getting better and better. Yet, bias is not changing at all, and efficency is dropping fast. The tests are telling you to discard data! Hence (2) hypothesis tests to evaluate balance are harmful, quite seriously so.

The fact is that there is no superpopulation to which we need to infer features of the explanatory variables; all analysis models we regularly use after matching are conditional on X. Balance should be assessed on the observed data, and not be the subject of inference or hypothesis tests.

This message rehearses an argument in a to-be-revised version of our matching paper by Ho, Imai, King, and Stuart that we hope to be finished with and post in a couple of weeks.

Posted by Gary King at 4:40 AM

October 26, 2005

Did You Achieve Balance?! Part II

Jens Hainmueller

Continuing from yesterday's post, another popular way to test balance is to examine standardized differences (SDIFF) between groups (Rubin and Rosenbaum 1985). SDIFF capture the difference in means in the matched samples, scaled by the square root of the average variance in the un-matched groups. This test has been criticized for the lack of formal criteria for judging the size of the standardized bias. Moreover, it may be open to manipulation as one can add observations to the control group in order to decrease variance in the denominator (Smith and Todd 2005).

Staying in the realm of univariate balance tests, some claim that difference in means tests are insufficient and that Kolmogorov-Smirnov (KS) tests are needed to non-parametrically test for the equality of distributions (Diamond and Sekhon 2005). These KS tests need to be bootstrapped, by the way, to yield correct coverage in the presence of point masses in the distributions of the covariates (Abadie 2002). Again, these tests would substantially increase the balance hurdle. Are they necessary for reliable causal inference?

Apart from univariate tests there are also some multivariate balance tests floating around in the literature such as the Hotelling T^2 test of the joint null of equal means of all covariates, multivariate (bootstrapped) Kolmogorov-Smirnov (KS) and Chi-Square null deviance tests based on the estimated assignment probabilities, as well as various regression-based tests for joint insignificance, etc. Which of these tests is preferable in what situation? What is the relationship between uni- and multivariate balance?

Last but not least, there is the thorny question of significance levels. Is a p-value of 0.10, let's say against the null of equality of means, high enough for satisfactory balance? Is .05 permissible? There is evidence that conventional significance standards are too lenient to obtain reliable causal inference in the canonical LaLonde data set (Diamond and Sekhon 2005).

These are too many questions to which I do not know the answers. The current lack of a scholarly standard for covariate balance strikes me as troubling, because balance affects the quality of the causal inferences we draw. I think it is important to bring the balance issue to the forefront of the matching debate. That is why Jas Sekhon and I are currently working on a paper on this topic. Suppose you are reviewing a matching article. What does it take to convince you that the authors "achieved balance"? Please feel cordially invited to join the debate.

Posted by James Greiner at 4:08 AM

October 25, 2005

Did You Achieve Balance?! Part I

Jens Hainmueller

There exists a growing consensus in the causal inference literature that when it comes to bias adjustment under selection on observables, matching methods dominate ordinary regression (esp. when discrepancies between groups are large). But how do we judge the quality of a matching? My professors tell me: "We want good balance." Sounds great, so I thought at first. Reading more matching articles, however, I soon became somewhat startled by the scholarly disagreement about what actually constitutes "good" balance in observational studies. Despite the fact that matching methods are now widely used all across the social sciences, we still lack shared standards for covariate balance: Which tests should be used in what type of data? What are their statistical properties and how do they compare to each other? And how much balance is good enough?

From reading this literature (sincere apologies if I have missed something relevant), it seems to me that most people agree that paired t-tests for differences in means are obligatory. T-tests are useful because matching by construction produces matched pairs. But should we test by comparing whole groups (treated vs. matched-untreated) or within propensity score ("PS") subclasses? A problem with the latter may be that the choice of intervals can be arbitrary, which is critical as interval width affects the power of the test (Smith and Todd 2005).

Moreover, which covariates should we t-test balance on? At least all that are included in the matching (right?), but how about other moments, the full set of interactions and higher-order terms, etc? The latter seems helpful to minimize bias but is done once in a blue moon (at least in the papers that I encountered). Most authors avoid these additional tests since they exacerbate common support problems and substantially raise the hurdle for obtaining balance.

Finally, should we t-test balance on the PS score and or the covariates othorgonalized to the PS score? How do we deal with the estimation uncertainty in these variables? And what does it mean -- as happens sometimes in practice -- to have remaining imbalance on the PS while all covariates are balanced?

Stand by for part II of this post tomorrow.

Posted by James Greiner at 5:00 AM

October 24, 2005

Applied Statistics - Gopi Goswami

This week, the Applied Statistics Workshop will be presenting a talk by Gopi Goswami of the Harvard Statistics Department entitled "Evolutionary Monte Carlo Methods for Clustering." Gopi Goswami received his Ph.D. from the Department of Statistics at Harvard in June 2005. Before coming to Harvard, he was an undergraduate and master's student at the Indian Statistical Institute in Calcutta. His dissertation, "On Population-Based MCMC Methods," develops new techinques for more efficiently sampling from a target density. He is currently a post-doctoral scholar in the Harvard Statistics Department. The presentation will be at noon on Wednesday, October 26 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. The paper he will present on Wednesday explores these methods in the context of clustering problems:

We consider the problem of clustering a group of observations according to some objective function (e.g. K-means clustering, variable selection) or according to a posterior density (e.g. posterior from a Dirichlet Process prior) of cluster indicators. We cast both kinds of problems in the framework of sampling for cluster indicators. So far, Gibbs sampling, “split-merge? Metropolis-Hasting algorithm and various modifications of these have been the basic tools used for sampling in this context. We propose a new population based MCMC approach, in the same vein as parallel tempering. We introduce three new “crossover moves? (based on swapping and reshuffling sub-clusters intersections) which make such an algorithm very efficient with respect to Integrated Autocorrelation Time (IAT) of various relevant statistics and also with respect to the ability to escape from local modes. We call this new algorithm Population Based Clustering (PBC) algorithm. We apply PBC algorithm to motif clustering, Beta mixture of Bernoulli clustering and a Bayesian Information Criterion (BIC) based variable selection problem. We also discuss clustering of mixture of Normals and compare the performance PBC algorithm as a stochastic optimizer with K-means clustering.

Posted by Mike Kellermann at 3:16 PM

“IV Etiquette?

You, Jong-Sung

One of my most embarrassing experiences occurred surrounding the use of instrumental variables in my ASR article with Sanjeev Khagram on inequality and corruption (2005). The article developed from my qualifying paper on causes of corruption (2003), in which I examined several hypotheses on the causal effects of inequality, democracy, economic development, and trade openness. Since all these four explanatory variables may be affected by corruption, I tried to find appropriate instruments. Initially, I tried five: latitude, # frost days, malaria prevalence index, ethno-linguistic fractionalization, and constructed openness. They had a strong predictive power for the endogenous variables in the first stage regression, and the p-values for the over-identification test in the second stage regressions were generally large enough so that I could not reject the null hypothesis of no correlation between the instruments and the error term of the regression. I worked with Professor Khagram to make a publishable article from my qualifying paper, and we submitted our manuscript to the ASR. The first review we received from the editor was encouraging. The editor advised us to “revise and resubmit? in his three-page long letter, which showed his interest in our paper. But the editor as well as an anonymous reviewer asked us to provide an argument explaining how our instruments were correlated with the endogenous variables but not directly correlated with corruption. I initially considered responding to this critique by citing Rodrik et al.’s draft paper entitled “Institutions Rule: The Primacy of Institutions over Geography and Integration in Economic Development? (later published in the Journal of Economic Growth, 2004), which argued, “An instrument is something that simply has some desirable statistical properties. It need not be a large part of the causal story.?

However, I was criticized regarding the use of instruments when I presented at a Work-in-Progress Seminar at the Kennedy School of Government and at Comparative Political Economy Conference at Yale University in spring 2004. In the Work-in-Progress Seminar, some professors at the Kennedy School noted that overidentification test can pass if they are all wrong in the same direction. In the Yale conference, Professor Daron Acemoglu of MIT was a discussant for my paper, and he used the term “IV etiquette? to emphasize the importance of giving a plausible story for the first stage. He pointed that without a clear story for the fist stage, it is impossible to tell whether the instrument is uncorrelated with unobserved determinants of the dependent variable. It was really an embarrassing moment when I was criticized for the lack of etiquette in front of many scholars.

So, I had to find more convincing instruments. In this regard, I have to thank my friend, Andrew Leigh, who was a doctoral student in public policy then and is currently Research Fellow at Australian National University. He found that “mature cohort size? can be used as an instrument for inequality in his dissertation paper entitled "Does Equality Lead to Fraternity?", based on Higgins and Williamson's (1999) theory of cohort size effect on income inequality. Also, I came to realize how conference presentations and discussions can be helpful in improving the quality of research.

Posted by SSS Coauthors at 4:04 AM

October 21, 2005

Best Practice Stats Reporting (Almost)

Felix Elwert

Let’s salute the New York Time’s for its near perfect polling documentation. In a recent edition of the Sunday Magazine, the Times includes a two-page spread on a phone survey on New York City politics. Though the survey touches on some life-and-death issues (“Would you ever date a Republican??), it’s really more for laughs than higher learning. Regardless, the Times goes to great length to describe its methodology:

“Methodology: This telephone poll of a random sample of 1,011 adults in New York City was conducted for the New York Times Magazine by Blum &Weprin Associates Inc. between Aug. 29 and Sept. 1. The sample was based on a random-digital-dialing design that draws numbers from all existing telephone exchanges in the five boroughs of New York, giving all numbers, listed and unlisted, a proportionate chance of being included. Respondents were selected randomly within the household and offered the option of being interviewed in Spanish. The overall sample results were weighted demographically and geographically to population data. The estimated average sample tolerance for data from the survey is plus or minus 3 percent at the 95 percent confidence level. Sampling error for subgroups is higher. Sampling is only one source of error. Other sources of error may include question wording, question order and interviewer effects.?

That’s 146 words on survey sampling likely lost on many readers. We may quibble about the omission of the nonresponse rate (although they mention that results were weighted to represent known geographic and demographic distributions). We may find the phrase “sample tolerance? for “confidence interval? a tad confusing. We may protest that they forgot a comma before the “and? in the closing enumeration. But that’s about it.

I would cry tears of joy if the major papers in my native Germany would start taking survey sampling nearly as seriously as the Times. Instead, we get anecdote-laden head scratching over recent failures to predict national election results with anything approaching accuracy. Seriously, I know Europeans aren’t currently inclined to follow American examples. But how would attention to basic statistical ethics work for an exception?

Posted by Felix Elwert at 5:26 AM

October 20, 2005

Book Review: “The Probability of God?, by Stephen D. Unwin

Drew Thomas

I continue with my review of The Probability of God, by Stephen D. Unwin, which I began here.

The first clue I had that this book would have anything but rigorous mathematical analysis was that I found it in the Harvard’s Divinity library. As expected, the book is mainly philosophical in nature, but that doesn’t mean it exceeds its mathematical scope. Indeed, it gives the reader a good introduction to Bayesian inference while being very clear about its limits.

The premise is simple: start with a proposition – in this case, that a monotheistic God exists; select a series of evidential questions that are relevant to the investigation; and assess the evidence under each of the two mutually exclusive probabilities.

The considerations he takes into account are as follows:

Prior distribution: Is there any reason to believe God exists other than using anti-anthropic arguments? Unwin believes there is no value in the “watchmaker? hypothesis – that the wonder and beauty we see around us is so complex that it could only have been designed by a being of higher order than our own – and so chooses the simplest of priors, that there might as well be a 50-50 chance. (Unwin later demonstrates that this prior fails any reasonable sensitivity analysis – stay tuned.)

In its rawest form, Bayesian inference takes the following form:

P(proptrue|evid) = P(evid|proptrue)P(proptrue)
P(evid|proptrue)P(proptrue) + P(evid|propfalse)P(propfalse)

Notice that if we divide top and bottom by P(evidence|prop false), we have the following quantity on top and bottom: P(evidence|proptrue)/P(evidence|propfalse). Statisticians call this a Bayes Factor – the likelihood of one model over another – while Unwin, seeking to appeal to a wider audience, calls this a Divine Indicator. I’ll continue with the former.

He then considers six “quantities? that relate to God’s existence, and how they fair under a world with God or no-God. In particular, he examines each Bayes factor, considers each piece of evidence to be independent from the others, then performs the Bayes calculation one at a time, using each subsequent posterior probability as the new prior probability. Any skeptic might question that the nature of his inquiries might be skewed under his own personal biases should remember that this is just an exercise.

In addition, to simplify the math, Unwin uses a scale of 1 to 3 to evaluate each piece of evidence, indicating no, weak or strong support (this is my interpretation, rather than a hard ranking system the author himself uses.) To put this into the equation, he uses a 5-level scale, setting the Bayes factor to be 0.1, 0.5, 1, 2 or 10 depending on the comparison of evidence.

1) The recognition that “goodness? exists. Under God, he argues, good and evil are built into the system. Without God, goodness can only be described as a pragmatic measure, so goodness wouldn’t be taken in that context. Unwin starts off with a blast and gives himself a 10. P(God exists) is now 91%.

2) The recognition that “moral evil? exists. Unwin says that moral evil is inevitable in a godless universe, but that God wouldn’t tolerate such a degree we have right now. Strong meets weak; the Bayes factor at this step is 0.5, leaving an 83% chance. (I find this step a little unsettling, as it immediately turns God into a humanlike figure, attaching too much specificity in my mind.)

3) The recognition that “natural evil? exists. In the wake of Hurricane Katrina, a great number of survivors in Louisiana are asking themselves what kind of a God would allow such a tragedy to happen. Unwin carries the same spirit across and claims that such a perspective makes little sense under God’s domain. No evidence versus strong gives a Bayes factor of 0.1 and a 33% chance of God’s existence.

4) The incidence of “intra-natural? miracles (such as whether praying for the Red Sox to win makes it so.) There are studies carried out routinely whether organized prayer can aid in the healing process. Never mind that these studies are highly unscientific – there isn’t an equal group praying against another injured person with roughly the same path to recovery, and a control group is nearly impossible to manufacture. Unwin doesn’t mind the inconclusiveness of these experiments; instead he relies on personal perspective and finds that prayer has some place in the world of God but little in one without. A Bayes
factor of 2 brings the probability of God back to 50%.

5) The incidence of “extra-natural? miracles (those examples that can’t be explained by science). These sorts of miracles were observed before God, so Unwin says many other systems are good enough to explain their existence (though certainly not their cause.) Equal evidence means a Bayes factor of 1, and the probability of God holds at 50%.

6) Religious experiences. I find this category to be the weakest of Unwin’s areas of evidence, since it immediately suggests a stacked deck. Unwin does hold back and merely suggests that what we perceive to be religious experiences – perceived moments of oneness with a higher power – are more likely to be justified if there is such a higher power. Unwin gives a Bayes factor of 2, bringing us to the conclusion that in his perspective, the probability of God’s existence is 67%.

Now many of you (including my co-authors) are bewildered as to why I’d consider this book, and this analysis, as being relevant to the practice of statistics. To begin with – or rather, end with – Unwin admits that this test is extremely sensitive to the choice of prior beliefs. Under his assessment of the evidence, his prior belief in God’s existence (50%) yields the probability of God’s existence at 67%; using prior beliefs of 10% or 75%, using the same evidence, swings the result to 18% or 86% respectively.

As in many strong works of philosophy, the important lesson is not in the answer, but in asking the questions that lead there. These calculations lead only to the halfway point of the text, as Unwin now segues from his method of observation into a discussion of the nature of faith, and what components of probability and faith lead to what we understand as belief.

Posted by SSS Coauthors at 5:31 AM

October 19, 2005

Social Science and Litigation, Part II

Jim Greiner

Professor Kousser’s 1984 article on objectivity in expert testimony, which I first introduced to the blog here, raises fundamental questions about the role of expert witnesses in litigation. Among those questions is the following: when presenting conclusions to a court, how much are expert witnesses entitled to rely on the adversarial process that is the foundation of lawsuits? Some experts appear to believe that their job is to present the best statistical, engineering, chemical, or whatever, case for their sides. Of course, they would not perjure themselves. Still, such witnesses do not attempt to provide a balanced look at the factual information to be evaluated; rather, they focus on demonstrating how the relevant data can be interpreted in favor of the parties retaining them. After all, the opposing sides have their own lawyers and, ordinarily, its own experts who (surely) are doing the same thing.

To make matters more concrete, I provide the following simplified example. My colleagues and I retained a quantitative expert in a redistricting case to measure the partisan bias of several proposed redistricting plans. We used a measure of bias that assigned a score to each plan; a score of zero meant no bias, while a score of two meant roughly that the plan would give one party two “extra? seats. The (litigation) difficulty we ran into was that the scores did not appear to distinguish the plan we favored from the one the other side proposed. The bias in our plan was, say, .03, while that of the other side was something like .15. Thus, the difference in bias between the two plans was approximately one tenth of one seat. But our expert, at our prompting, presented the results differently: he emphasized the other side’s plan was five times more biased than our own.

Before dismissing this story, and the view of the expert as an extension of trial counsel, with a snort and a shake of the head about the lack of ethics in modern society, consider how the structure of the litigation process favors such choices. At trial, an expert (just like any other witness) is not allowed to relate his or her views directly to the court. Rather, the expert speaks to the judge or jury only in response to questions from lawyers under the duty to advocate their respective clients' cases, that is, the duty NOT to be neutral. Before trial, an expert who has consulted with a party to litigation may not be retained by the opposing party. And trial counsel, not the witness, decides whether the expert speaks to the court at all.

There are good reasons for all of these rules. The rule requiring testimony to come in response to questions from an attorney prevents witnesses from testifying about subjects deemed inadmissible (opposing counsel can object between question and answer). With respect to the prohibition on consulting with one side and then working for the other, experts who have consulted for Side A learn about Side A’s case in a way that Side B might pay handsomely to discover. But if, as many in the legal profession appear to believe, expert witnesses really are whores, could it be otherwise as litigation is presently structured?

Posted by James Greiner at 6:04 AM

October 18, 2005

A Social Science of Architecture

Gary King

After eight years of learning something about architecture (from Harry Cobb and his team) and extensive programmatic planning, the Institute for Quantitative Social Science this semester moves into the new Center for Government and International Studies buildings. Our official address is the Third Floor of 1737 Cambridge Street (the design is vaguely reminiscent of the bridge of the Starship Enterprise), although we also occupy some of the other floors and some of the building across the street. It is not really finished yet, but it is a terrific facility, with floor to ceiling windows in most offices, a wonderful seminar room for our Applied Statistics Workshop, and many other useful features. Perhaps even more remarkably, everyone seems to love it (Congratulations Harry!).

One issue I learned during this long process was how the field of architecture has the best science, engineering, and art, but very little modern social scientific analysis. Yet, social science, quantitative social science in particular, could greatly help architecture achieve its goals, I think. Ultimately the goal of this particular $100M-plus building, and of most buildings built by universities, is not only to create beautiful surroundings but also to increase the amount of knowledge created, disseminated, and preserved (my summary of the purpose of modern research universities). So do not limit yourself to asking how a building makes you feel, what architectural critics might think, how it fits in with the style of other buildings on campus, or whether your office is to your liking. Ask instead, or in addition, whether the building increases the units of knowledge created, disseminated, and preserved more than some other building or some other potential use for the money. This strikes me as the central question to be answered by those who decide what buildings to build, and yet the systematic scientific basis for this decision is almost nonexistent.

As such, some systematic data collection could have a considerable impact on this field. Do corridors or suites make the faculty and students produce and learn more? Does vertical circulation work as well as horizontal? Should we put faculty in close proximity to others working on the same projects or should we maximize interdisciplinary adjacencies? Which types of floor plans increase interaction? Which types of interaction produce the most knowledge created, generated, and preserved? Do we want to build buildings that encourage doors to be kept open, so as to make the faculty seem approachable or should we try to keep doors closed so that they can get work done? In this field as in most others, a great deal can be learned by directly measuring the relevant outcome variable; in architecture, quite remarkably, this has only rarely been attempted.

Of course it is done all the time via qualitative judgments, but in almost every field of science where a sufficient fraction of information can be quantified, statistical analysis beats human judgment. There is no reason to think that the same kind of statistical science wouldn't also create enormous advances here too.

I have heard of a couple of isolated academic works on this subject, but we're talking about some of the most important and expensive decisions universities make (and among the biggest decisions businesses, and many other institutions make too). There should be an entire subfield devoted to the subject. All it would take is some data collection and analysis. Outcome measures could include, for example faculty citation rates, publications, awards, grants, and departmental rankings, along with student recruitment, retention, graduation, and placement rates. The key treatment variables would include various information on the types of buildings and architectural design. Random assignment seems infeasible, but relatively exogenous features might include departmental moves or city and town building restrictions. Universities that allow faculty the choice of buildings could also provide useful revealed preference measures. I would think that a few enterprising scholars on this path could have an enormous impact both in creating a new academic subfield and in improving a vitally important set of university (and societal) decisions.

In the interm, we'll enjoy the new buildings and hope they have a positive impact.

Posted by James Greiner at 5:28 AM

October 17, 2005

Ideal Points

Michael Kellermann

One of the goals of this blog is to promote dialog between people working in different social science disciplines. As part of that, we have been posting reports from the Political Methodology conference in Tallahassee. Of course, even though we may all speak the same statistical language, we often speak it with distinct accents; similar concepts and methods often go by different names in different fields. For example, it turns out that estimating the ideal points of political actors is similar in many ways to the problem of estimating the difficulty of question on standardized tests, a commonality that has only been exploited in the last few years.

First things first, however; what exactly is an ideal point? People have long thought about politics in spatial terms: "left" and "right" have been used to describe political preferences since at least the French Revolution, when royalists sat on the right and radicals on the left in the Legislative Assembly. Ideal point models attempt to estimate the position of each legislator on the left-right or other dimensions using the votes that they cast on legislation. Basically, the models assume that a legislator will vote in favor of a motion if it moves policy outcomes closer to their most preferred policy. The resulting estimates from these models provide a descriptive summary of the distribution of preferences within a legislature. They are also important parameters in many formal models of legislative behavior.

Much of the recent work in the area of ideal point estimation has drawn on earlier research by education scholars. Item response theory studies the relationship between the ability (and other characteristics) of test subjects and the answers they give to particular test questions. The general idea is that every test question has an associated ability cutpoint; those with ability above the cutpoint will answer correctly on average. In a typical testing situation, the authors will attempt to include questions with an array of cutpoints in order to estimate the ability of the test takers.

The analogy between ability estimation and ideal point estimation is close; votes in the legislature correspond to questions on the test. One difference is that, in the item response context, the researcher will typically know the correct answer and can therefore associate those responses with higher estimated ability. In the ideal point context, it is not always clear whether a proposal moves policy left or right. Several recent articles have addressed this and other problems in translating item response models to the political context, including work by Harvard's own Kevin Quinn with Andrew Martin (Martin and Quinn 2002) , Clinton, Jackman, and Rivers (2004), and Bafumi, Gelman, Park, and Kaplan (2005). Dan Hopkins described some recent work on ideal point estimation in an earlier post.

Posted by James Greiner at 4:19 AM

October 14, 2005

Instrumental Variables in Qualitative Research

You, Jong-Sung

In large-N quantitative research, instrumental variables are often used to address the problem of endogeneity. In small-N qualitative research such as comparative historical case studies, researchers examine historical sequence and intervening causal process between an independent variable(s) and the outcome of the dependent variable in order to establish causal direction and illuminate causal mechanisms (Rueschemeyer and Stephens 1997). However, careful examination of sequence and intervening process through process-tracing may not solve the problem of endogeneity. When Y affected X initially and X, in turn, influenced Y later, looking at the sequence and intervening causal process in the latter part without examining the former process will produce a misleading conclusion.

In my comparative historical case study of corruption in South Korea, relative to Taiwan and the Philippines, I attempted to test my hypothesis that income inequality increases corruption and to identify causal mechanisms. It was easy to show the correlations between inequality and corruption. Both inequality and corruption have been the highest in the Philippines and the lowest in Taiwan, with Korea in between. I found that the success of land reform in Korea and Taiwan produced much lower levels of inequality in assets and income than was true of Philippines, where land reform failed. I provided plausible evidence that the different levels of inequality due to success and failure of land reform accounted for different levels of corruption, and identified some causal mechanisms. Also, between Korea and Taiwan, I found that Korea's chaebol (large conglomerate)-centered industrialization and Taiwan's avoidance of economic concentration led to a divergence of inequality over time, which contributed to divergence of corruption level.

However, the process-tracing for the period after the success or failure of land reform and for the period after the adoption of different industrial policies was not sufficient to establish causal direction because different levels of corruption might have influenced the success and failure of land reform as well as the industrial policy. Hence, I had to show that success and failure of land reform was affected very little by corruption, but largely determined by external factors such as the threat of communism and the differences in the US policy toward these countries. Also, I had to provide evidence that the initial adoption of different industrial policies by Park Chung-hee in Korea and by the KMT leadership in China were not affected by the different levels of corruption. Essentially, land reform and industrial policy played the role of instrumental variables in statistical studies. These were exogenous events that produced different levels of inequality and thereby caused different levels of corruption but had not been influenced by corruption. Thus, the idea of instrumental variable can be useful in qualitative research as well.

Posted by SSS Coauthors at 3:29 AM

October 13, 2005

Dangerous Statistics: Estimating Civilian Losses in Afghanistan

Felix Elwert

There are tougher tasks than appeasing the human subject review board. A few weeks ago, I met Aldo Benini at the American Sociological Association annual meeting in Philadelphia. Benini has worked for various humanitarian organizations over the past decades and specializes in what strikes me as the most dangerous subfield of social science statistics: he collects, analyzes, and models data on the direct and indirect casualties of war.

I had come across Benini before when I saw a presentation on his work with the Global Landmine Survey, which involved building quantitative models to assist the ongoing mine cleanup in Vietnam. Recently, Benini has been working on estimating the number of civilian victims during the first nine months of Operation Enduring Freedom in Afghanistan following 9/11/01. There, field staff visited all 600 communities directly affected by fighting (both airstrikes and ground combat). This survey improves on previous estimates in the news – not least by being a virtual census of the affected communities, employing trained interviewers, and using standardized questionnaires. It’s hard for me to imagine more dangerous conditions of data collection (but, wait, Benini currently works on a similar project in Iraq).

The resulting study establishes a number of important findings. It’s also methodologically interesting. All told, 5,576 residents were killed violently between 9/11/01 and June 2002. Another 5,194 were injured. These numbers are considerably higher than previous estimates. I’m not going to rehash their entire analysis* here. But with respect to the methodological focus of this blog, I’d like to highlight the authors' conclusion that there's evidence that modern war apparently facilitates considerable underreporting of civilian losses.

*Including an interesting zero-inflated Poisson model for the concurrent and historical factors affecting the distribution of civilian victims in Afghanistan.

Posted by James Greiner at 4:50 AM

October 12, 2005

Estimating the Causal Effect of Incumbency

Jens Hainmueller

Since the early seventies, political scientists have been interested in the causal effects of incumbency, i.e. the electoral gain to being the incumbent in a district, relative to not being the incumbent. Unfortunately, these two potential outcomes are never observed simultaneously. Even worse, the inferential problem is compounded by selection on unobservables. Estimates are vulnerable to hidden bias because there probably is a lot of unobserved stuff that’s correlated with both incumbency and electoral success (such as candidate quality, etc.) that you cannot condition on. To identify the incumbency advantage, estimates had to rely on rather strong assumptions. In a recent paper entitled "Randomized Experiments from Non-random Selection in U.S. House Elections", economist David Lee took an innovative whack at this issue. He employs a regression discontinuity design (RDD) that tackles the hidden bias problem based on a fairly weak assumption.

Somewhat ironically, this technique is rather old. The earliest published example dates back to Thistlethwaite and Campbell (1960). They examine the effect of scholarships on career outcomes by comparing students just above and below a threshold for test scores that determines whether students were granted the award. The underlying idea is that in the close neighborhood of the threshold, assignment to treatment is as good as random. Accordingly, unlucky students that just missed the threshold are virtually identical to lucky ones who scored just above the cutoff value. This provides a suitable counterfactual for causal inference. Take a look at the explanatory graph for the situation of a positive causal effect and the situation of no effect.

See the parallel to the incumbency problem? Basically, the RDD works in settings in which assignment to treatment changes discontinuously as a function of one or more underlying variables. Lee argues that this is exactly what happens in the case of (party) incumbency. In a two party system, you become the incumbent if you exceed the (sharp) threshold of 50 percent of vote share. Now assume that parties usually do not exert perfect control over their observed vote share (observed vote share = true vote share + error term with a continuous density). The closer the race, the more likely that random factors determine who ends up winning (just imagine the weather had been different on election day).

Incumbents that did barely win the previous election are thus virtually identical to non-incumbents that did barely lose. Lee shows that as long as the covariate that determines assignment to treatment includes a random component with a continuous density, treatment status close to the threshold is (in the limit) statistically randomized. The plausibility of this identification assumption is a function of the degree to which parties are able to sort around the threshold. And the cool thing is that you can even test whether this identifying assumption holds - at least for the observed confounders – by using common covariate balance tests.

There is no free lunch, of course. One potential limitation of the RDD is that it identifies the incumbency effect only for close elections. However, one could argue that when looking at the incumbency advantage, marginal districts are precisely the subpopulation of interest. It is only in close elections that the incumbency advantage is likely to make any difference. Another potential limitation is that the RDD identifies the effect of “party? incumbency, which is not directly comparable to earlier estimates of incumbency advantage that focused on “legislator? incumbency advantage. Party incumbency subsumes legislator incumbency, but also contains a seperate party effect and there is no chance to disentangle the two. So surley, the RDD design is no paneca. Yet, it can be used to draw causal inferences from observational data based on weaker assumptions that previously employed in this literature.

The Lee paper has led to a surge in the use of the RDD in political science. Incumbency effects have been re-estimated not only for US House elections, but also for other countries as diverse as India, Great Britain, and Germany. It has also been used to study split-party delegations in the Senate. There may be other political settings in which the RDD framework can be fruitfully applied. Think about it - before economists do :-)

Posted by Jens Hainmueller at 6:48 AM

October 11, 2005

Spatial Lag

Sebastian Bauhoff

In my last blog entry (here), I wrote that associations like space can mess up the assumptions underlying standard estimation techniques. This entry is about the first problem I mentioned, spatial lag: when neighboring observations affect one another. Such dependencies can lead to inconsistent and biased estimates in an OLS model. And even if you don't care about "space" in a geographic sense, you might be interested in related topics like technology diffusion among farmers, network effects, countries that share the same membership in international organizations (an idea picked up in Beck, Gleditsch and Beardsley; see below) etc. The point is that spatial lag is pervasive in many contexts and though it might be called different names, the basic problem remains the same.

Spatial lag models are similar to lagged dependent variable autoregression models in time series analysis but the problem is that the correlation coefficient cannot be easily estimated. That's a problem because to estimate the coefficient, a spatial weights matrix is needed but it is often not clear what that matrix should look like, i.e., what the actual spatial relation is.

So how much can it matter? James LeSage (in an excellent guide to spatial econometrics and his MATLAB functions, also below) provides an example of OLS and spatial lag estimations of the determinants of house values. The idea is that -- apart from the influence of the independent variables like county population density or unemployment rates -- areas with high house values might be adjacent to other high value areas, and therefore there is a spatial trend in the outcome variable. The example shows that an interesting variable like population density can become statistically insignificant when spatial dependence is taken into account, and that coefficients of other variables can change in magnitude. In addition, taking spatial lag into account also improves the model fit.

So one should really take space into account if it matters. How would you know if it does? There are a number of tests to check for spatial lag, but for most part just starting to think about it helps.

For some more information of spatial lag, take a look at the sources mentioned:

-- James LeSage's Econometrics Toolbox (, which has an excellent workbook discussing spatial econometrics and examples for the MATLAB functions provided on the same site; and
-- Beck, Gleditsch and Beardley (draft of April 14, 2005) "Space is more than Geography: Using Spatial Econometrics in the Study of Political Economy" (

Posted by James Greiner at 5:51 AM

October 10, 2005

Applied Statistics - Rima Izem

This week’s Applied Statistics Workshop presentation will be given by Rima Izem of the Harvard Statistics Department. After receiving her Ph.D. in Statistics from the University of North Carolina at Chapel Hill, Professor Izem joined the Harvard Statistics Department in 2004. Her research interests include statistical methodology in functional data analysis, spatial statistics, and non-parametric statistics. She has presented her work at conferences across North America and Europe. Her 2004 paper, "Analyzing nonlinear variation of thermal performance curves," won the Best Student Paper award from the Graphics and Computing Section of the American Statistical Association.

Professor Izem will present a talk on "Boundary Analysis of Unemployment Rates in Germany." The presentation will be at noon on Wednesday, October 12 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Mike Kellermann at 11:19 AM

October 7, 2005

Social Science and Litigation, Part I

Jim Greiner

Over twenty years ago, J. Morgan Kousser wrote an article with the provocative title, “Are Expert Witnesses Whores? Reflections on Objectivity in Scholarship and Expert Witnessing? (6 The Public Historian 5 (1984)). In answering the rhetorical question largely in the negative, Professor Kousser recounted his own experience as an expert in litigation under the Voting Rights Act, an experience which, according to him, “afforded me the opportunity to tell the truth and do good at the same time.?

As a historian of southern politics specializing in the post-Reconstruction and Progressive eras, Professor Kousser had concluded that at-large voting systems had a racially discriminatory impact upon disfavored minority groups, and that such systems were adopted for exactly that purpose. Having written on the subject, he was “‘discovered’? by a civil rights attorney, retained, and stood ready to provide “window-dressing? in Section 2 cases challenging at-large systems when the Supreme Court decided Mobile v. Bolden, 446 U.S. 55 (1980). Without delving into legal technicalities, and oversimplifying somewhat, Mobile compelled Section 2 plaintiffs to produce evidence regarding the motives of those who adopted the voting schemes under challenge. In doing so, Mobile “made historians . . . necessary participants in voting rights cases? (at least until Congress removed the intent requirement by amending Section 2 in 1982), and so Professor Kousser ended up testifying in several pieces of litigation regarding the motives of those who adopted at-large voting systems and the effectiveness of such systems in achieving their framers’ desires. After examining various meanings of bias and objectivity, and the threats to the latter in both expert witnessing and researching, Professor Kousser concludes his article with the statement, “Testifying and scholaring are about equally objective pursuits.?

As a former litigator of employment discrimination and voting rights cases, I believe that Professor Kousser’s vision of an expert witness is one few lawyers would recognize. As a budding statistician interested in application of social science to the litigation setting, I assert (admittedly with slightly less certainty) that Professor Kousser’s narrative would be unfamiliar to most expert witnesses as well. Few attorneys discover expert witnesses who have spent years studying a question critical in a case they are litigating, fewer still an expert who has reached the “right? answer. It is rare that scholars, having reached conclusions after years of study and research for academic purposes, suddenly discover that the law has evolved in a way to make those conclusions relevant to pending (and, in Professor Kousser’s case, high-profile) litigation.

I’ll be using Professor Kousser’s article as a springboard for a discussion on the relationship among courts, litigators, and expert witnesses in several blog posts. As is true of all members of the Content Committee of this blog, I remain eager for responses and comments.

(It should go without saying that I do not intend in any way to question Professor Kousser’s honesty or integrity, either in the testimony he gave or in his 1984 article. In case it does not go without saying . . .).

Posted by James Greiner at 6:02 AM

October 6, 2005

A Bit on Human "Irrationality"

Amy Perfors

One of the key applications of cognitive science to the other social sciences can lie in testing some of the assumptions made about human psychology in other fields. A classic example of this is in economics: as I understand it, for a long time economists envisioned people as rational actors who act to increase their utility (usually measured by money) as much as they can. The classic results of Kahneman & Tversky, which earned the Nobel Prize, were among the first to show that, contrary to this assumption, in many spheres people act "irrationally." I am putting the word "irrational" in quotes because it's not that we act completely randomly or without motivation, simply that we do not always simply exist to maximize our utility: we use cognitive heuristics to calculate the value of things, we value money not as an absolute but with respect to many other factors (such as how much we already have, how things are phrased and sold to us, etc), and our attitudes towards money and maximizing are influenced by culture and the social situation. This means that models of human economic or group behavior are often only as good as the assumptions made about the people in them.

One researcher who studies these problems is Dan Ariely at MIT. In a recent line of research, he looks at what he calls two separate markets, the monetary and the social. The idea is that if people perceive themselves to be in a monetary market (one involving money), they are highly sensitive to the quantity of compensation, and will do less work if they receive less compensation. If, on the other hand, they perceive themselves to be in a social market (one in which no money is exchanged), they will not be concerned with the quantity of "social" compensation, such as the value of any gifts received.

I really liked this article, in part because (unusual for academic articles) it is kind of funny in places. For instance, their methodology consisted of having the participants do a really boring task and measuring how well their effort correlated to how much they were paid, in either a monetary or social market. The task is really grim: repeatedly dragging a computerized ball to a specific location on the screen. As the authors dryly state, "pretesting and post-experiment debriefing showed that our implementation continues in the grandest tradition of tasks that participants view as being utterly uninteresting and without any redeeming value." (I do not envy that debriefer!)

Funny parts aside, the point this research makes is really interesting: people approach the same task differently depending on what they think it is. When they are not compensated or compensated with a gift (a "social" exchange) they will expend a high amount of effort regardless of the value of the gift. When compensated with money or a gift whose monetary value they are told of, effort is proportional to the value of the compensation. Methodologically, this makes an important point -- if we want to model all sorts of aspects of the market or even social behavior, it's good to understand how our behavior changes as a function of how we conceptualize what is going on. From the cognitive science side, the question is why our behavior changes in this way, and in what instances this is so.

And the message for all of us? If we have a task we need help on, the authors suggest "asking friends and offering them dinner. Just do not tell them how much the dinner costs."

Posted by James Greiner at 6:24 AM

October 5, 2005

Economics as Methodology

John Friedman

Most disciplines define themselves through their field of inquiry; historians study events of the past and the evolving stories of those events, psychologists study the working of the mind, and political scientists study the interaction of governments and people. Economists take a different approach, though, identifying themselves not through subject matter but instead through methodology.

What are these tenets of methodology? While the precise delineation of one’s field is always a tricky matter, I believe most economists would agree on three basic principles: Preferences, Optimization, and Equilibrium. In essence, economics operates under the assumption that people know what they want and then do their best (given limited means) to get it. Given these foundations, mathematics helps to formalize our intuition, since choosing the best alternate can be rewritten as the maximization of a function, often named “utility.? In many cases, of course, people will fail miserably to achieve these goals. The problem might be a lack of information, or unforeseen costs, or any number of other obstacles; but, in economics, it cannot be that people simply do not want something that is better for them.

To many, this definition of economics will seem extraordinarily narrow, disallowing the study of a great many human phenomena. No doubt, in many cases, this observation is correct. But I believe it exactly this methodological focus that has laid the foundations for the great success of economics in the past 70 years. As a foundation, the framework is straightforward and intuitive; why would someone not want something that, by definition, they prefer? Furthermore, the mathematical expression of economics ideas – a direct result of the assumption of optimization – has helped to lay bare the assumptions lurking behind arguments with great speed. And while I freely admit that economics cannot capture all relevant aspects of human behavior, it would seem a fool’s errand to find a research design that could.

(A brief aside: Mathematics, in economics, is no more than a language for expressing ideas. It is extremely helpful in many situations, as is much jargon, for discussions among experts within the field. But, far too often, economists allow this language to become a barrier between them and the world. I suggest you hold all economists you meet to this standard: If they cannot explain the intuition behind an economic idea, using only standard English words, in five minutes, it is their fault and not yours!)

Posted by James Greiner at 5:58 AM

October 4, 2005

Applied Statistics - Andrew Thomas

This week’s Applied Statistics Workshop presentation will be given by Andrew Thomas of the Harvard Statistics Department. Drew’s talk, entitled “ A Comparison of Strategies in Ice Hockey,? considers the choices facing players and coaches during the course of a game. Should they prioritize possession of the puck, or its location on the rink? The paper presents a model that divides the play of the game into different states in order to estimate the probability of scoring or allowing goals conditional on a given starting state.

Drew is currently a second-year Ph.D. candidate in the Statistics Department, having graduated from MIT in 2004 with a degree in Physics. He has presented his work at the Joint Statistical Meetings and the New England Statistical Symposium. He was born and raised in Toronto, Ontario, which may have something to do with his interest in hockey. And, most importantly, he is a fellow blogger on the Social Science Statistics blog. The presentation will be at noon on Wednesday, October 5 (coincidentally enough, opening night for the NHL) in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by Mike Kellermann at 5:45 PM

"On The Fringe": The Probability of God, An Initial Look

Drew Thomas

Stephen D. Unwin made headlines - at least, in the Odds and Ends section – two years ago, with the publication of his book "The Probability of God". His idea was to determine, using some numerical method, whether conditions on earth would be enough to predict whether the Judeo-Christian construction of God does indeed exist.

Thankfully, the book is classified as humor. The actual problem being solved is somewhat irrelevant to the greater community, since matters of faith are conducted in the absence of fact. But this does represent the fringe of our discipline, and how numbers are perceived in the real world.

In this "real world," there are too many examples of numbers distorted for the sake of an agenda. For example, that 4 out 5 dentists choose a particular toothpaste to endorse tells us nothing about the sample size (or about a possible line of dentists they tossed beforehand). Sports statistics are mangled and mishandled all the time without a mention of sample size concerns or actual relevance. (The misuse of numbers in society is a favorite theme of mine; keep looking for it in my entries.)

At least Dr. Unwin has not only a clearly stated agenda behind his work, but also a clearly stated method and an acknowledgement of subjectivity. Unwin’s calculation puts the probability of God’s existence at 67%; Richard Dawkins, the famed atheist, used the same method and obtained a result of 2% -- about 2% higher than Dawkins would otherwise be willing to admit.

Most of this information came from a radio interview with the good Dr. Unwin. Stay tuned for the book review and a look at his technique.

Posted by James Greiner at 6:07 AM

October 3, 2005

Multilevel Hazard Models in Log-Time Metric?

Felix Elwert

Has anybody figured out how to estimate multilevel hazard models with time-varying covariates in log-time metric (i.e., an accelerated failure time model)?

Together with two colleagues from the Medical School, I’m working on the effect of contextual variables on mortality. We're using a large longitudinal dataset of around ½ million married couples and nine years of follow up. Our key independent variable is time varying. In recent years, much work has been done on multilevel hazard models, for example, that done by Harvey Goldstein and colleages. But the standard recommendation for estimating such models in the presence of time-varying covariates is to approximate the Cox proportional hazard model using a conditional (i.e, fixed effects) logistic regression, which makes hefty demands on memory. Given the size of our data, we can implement this standard strategy only for a subset of our data.

We are hoping that the log-time metric would make better use of memory and allow us to use the entire sample. The question is: has anybody already developed software to estimate multilevel hazard models with time-varying covariates in the log-time metric? Or can't it be done in principle? Either way, I'd be grateful for pointers.

Posted by SSS Coauthors at 6:00 AM

September 30, 2005

Use of Averaged Data; Mature Cohort Size as an Instrument for Inequality

Jong-Sung You

In my paper with S. Khagram entitled "A comparative study of inequality and corruption" (ASR 2005, vol.70:136-157), we demonstrated that data averaged for a long period (say, 1971-1996) instead of single-year data can be useful for both reducing measurement error and capturing a long-term effect.

In previous empirical studies of causes of corruption, income inequality was found insignificant. We suspected this lack of significance might be due to attenuation bias because income inequality was poorly measured. We found that using averaged data for inequality and other control variables increased the coefficient for inequality and made it significant.

Another result from this paper used "mature cohort size" (ratio of population 40 to 59 years old to the population 15 to 69 years old) as an instrument for inequality in IV regressions; again, inequality was found significant. Higgins and Williamson (1999) have previously studied the effect of cohort size on inequality. Because fat cohorts tend to get low rewards, when these fat cohorts lie at the top of the age-earnings curve, earnings inequality is reduced. When the fat cohorts are old or young adults, earnings inequality is augmented. Indeed, the mature cohort size is a powerful predictor of inequality
across countries.

Note that by "fat cohorts" and "slim cohorts" I mean the relative size of the cohorts. When the mature cohorts is fat, or the relative size of the mature cohort is large, the earns differential (earnings gap between the mature cohort and the others) is reduced and hence earnings inequality is reduced.

You can view my paper here.

Posted by James Greiner at 7:00 AM

September 29, 2005

Near, Far, Wherever You Are

Sebastian Bauhoff

Tobler's First Law of Geography states that "everything is related to everything else, but near things are more related than distant things." Obviously there are many examples -- an infection is more likely to spread to a nearby person than to a far away one, a new highway might depress house prices for people living right next to it, and so on. The point is that there can be important dependencies and heterogeneities that vary with space, among other associations. And in those cases the usual assumptions that observations or errors are independently distributed don't hold. Urgh. Welcome to the world of spatial statistics.

As an estimation problem this is often addressed through clustering methods. Households in a village with some infected persons are at higher risks than households in neighboring villages. Or are they really? Clustering works when the locations are relatively homogenous and separated. What if there is no good way to classify observations into clusters, for example, if an area is evenly populated? Or if the infected household lives right at the end of the village road, and some neighbors are in the other village? The administrative boundaries commonly used for clustering (village name) might not properly account for the actual proximity or whatever defines the space between the observations. If a transmitting mosquito wouldn't care much about the village name when deciding who to bite next, why should an analyst rely on it?

Using clustering may often be a good approximation but in some cases it's not good enough and there can be substantial spatial lags (observations are spatially dependent), spatial errors (error terms are related) and spatial heterogeneity (model parameters vary across space). Those can lead to biased estimates, inefficient ones, or both. The bad news is that those effects can matter a lot. The good news is that there are methods to test for spatial dependence and correlation, and estimation techniques to deal with them.

Of course the underlying interactions we are trying to better capture can be anything from linear to more complicated relations. It is unlikely that they are perfecrly well described by any abstract spatial model, so we will still need to make assumptions. But at least there are some methods that can handle cases where the usual assumptions fail, and they can make an important difference to the analysis. I will write more about them in later blog entries. Meanwhile you might be interested in the following texts:

-- James LeSage's Econometrics Toolbox ( has an excellent workbook discussing spatial econometrics and examples for the MATLAB functions provided on the same site
-- Anselin (2002) "Under the Hood: Issues in the Specification and Interpretation of Spatial Regression Models" Agricultural Economics 27: 247-267 provides a quick overview of the issues
-- Anselin (1988) Spatial Econometrics: Methods and Models is the classic and widely quoted reference for spatial statistics

Posted by James Greiner at 6:00 AM

September 28, 2005

Extreme Values

Michael Kellermann

Every year, the host university of the Political Methodology conference invites a local scholar from some other discipline to share his or her research with the political science methods community. This year's special presentation, by James Elsner of the Florida State University Department of Geography, was sadly prescient. Professor Elsner's talk, "Bayesian Inference of Extremes: An Application in Modeling Coastal Hurricane Winds," applied extreme value theory in a Bayesian context to estimate the frequency with which hurricanes above a given strength make landfall in the United States. The devastating impact of Hurricane Katrina amply illustrates the importance of estimating maximum intensities; news reports suggest that as little as a foot or two of water overtopping the levees and eroding them from below may have caused the breaches that flooded New Orleans.

Extreme value theory provides a way to estimate the distribution of the maximum or minimum of a set of independent events. While this could be done directly if the distribution of the underlying events was known, in practice it is preferable to use the extremal types theorem to estimate the distribution of the maximum or minimum directly from data. The theorem states that, with appropriate transformations, the distribution of extreme values converges in the limit to one of three classes of distribution - Gumbel, Frechet, or Weibull - regardless of the shape of the underlying distribution.

There are several challenges in estimating the distribution of extreme values. The three classes of limit distributions for extreme values have different behavior in the extreme tail: one family has a finite limit, while the other two have no limit but decay at different rates. To the extent that we are interested in "extreme" extremes, these differences could have substantive implications. Compounding this problem, observations in the extreme tail are likely to be sparse. Finally, one might expect that the quality of data is lower when extreme maxima or minima are occurring. Consider Katrina: most of the instrumentation for recording wind speeds, storm surge, and rainfall rates were knocked out well before the height of the storm. (Nor is this just a problem with weather phenomena; imagine trying to measure precisely daily price changes during a period of hyperinflation). The Bayesian approach pursued in this work seems promising, as is allows the uncertainty in both the data itself and in the functional form to be modeled explicitly.

In talking with other grad students after the presentation, I think the consensus was that, while interesting methodologically and sobering substantively, it was hard to see how we would apply these methods in our own work. A quick Google search suggests that this approach is (not surprisingly) well established in financial economics, but not much else from the social sciences. With a little more time to reflect, however, I think that this may be more due to a lack of theoretical creativity on our part. Coming from the formal side of political science, I could see how thinking about extreme values might provide some insight into how political systems are knocked out of equilibrium, much like the levees in New Orleans.

Posted by James Greiner at 6:00 AM

September 27, 2005

Non-compliance, Bayesian Inference and School Vouchers: A Thesis Defense

Drew Thomas

Proceedings in the Harvard Dept. of Statistics seminar series started early this year, as Hui Jin eloquently delivered her doctoral thesis defense on Wednesday, September 14, entitled "Principal Stratification for Causal Inference with Extended Partial Compliance." Jin applied her ideas both to drug trials and to school choice (voucher) programs. She spoke in particular about the second application, focusing on a study of vouchers as offered to students from low-income families in the New York City public school system. In this study, 1000 students were offered a subsidy to help pay tuition for a private school of their choice, and were matched with students with similar conditions who were not offered the grant. Both groups were tracked for three years, and a set of tests at the beginning and end were used to measure achievement. The compliance factor was whether grant recipients would always take advantage of the offer, and whether unlucky ones would never make their own way to private school. While the compliance rate after three years remained high - roughly 80% - it was the compliance factor that proved to be the most instructive on the achievement pattern of students, a result found by stratifying the outcomes according to compliance patterns.

Those students expected to comply perfectly - attend private school with the grant and public school without it, in all three years - made the least improvement as compared to their colleagues in the other strata. Comparative performance improved with non-compliance; the biggest non-conformers, those who attended private or public school regardless of whether the grant was offered showed the most improvement over their previous scores.

Notably, the reasons for this performance haven't been completely explained, though Prof. Rubin (Jin's advisor and collaborator on the project) suggests that perhaps using the voucher as a threat to remove a student from his friends may compel a higher performance at public school. Whatever the underlying mechanism, the results give strong and compelling reason to fully consider the effect of vouchers in the school system.

Posted by SSS Coauthors at 7:00 AM

September 26, 2005

Applied Statistics - Xihong Lin

Mike Kellermann

This week's Applied Statistics Workshop presentation will be given by Professor Xihong Lin of the Department of Biostatistics at the Harvard School of Public Health. Professor Lin received her Ph.D. in Biostatistics from the University of Washington. She is one of the newest members of the Harvard statistical community, having just moved to Harvard from the University of Michigan School of Public Health. She has published widely in journals including the American Journal of Epidemiology, Biometrika, and the Journal of the American Statistical Association. She currently serves as the co-ordinating editor of Biometrics. Among her other awards, she has been recognized as an outstanding young scholar by both the American Statistical Association and the American Public Health Association.

Professor Lin's presentation, "Causal Inference in Hybrid Intervention Trials Involving Treatment Choice," considers the problem of causal inference from experiments in which some subjects are allowed to choose the treatment that they receive. Allowing treatment choice may increase compliance levels, but creates inferential challenges not present in a fully randomized experiment. Professor Lin will discuss her approach to this problem on Wednesday, September 28 at noon in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.

Posted by SSS Coauthors at 11:53 AM

A multilevel analysis of the WVS/EVS data

Jong-Sung You

In my draft paper on the "correlates of social trust" (presented at the ASA conference, August 2005), I argued that fairness of a society such as freedom from corruption (fair administration of rules) and distributive fairness (relatively equal and unskewed distributions) affects the society's level of social trust more than its homogeneity does. Based on a multilevel analysis of data from the World Values Surveys (WVS, 1995-97, 2000-01) and the European Values Study (EVS, 1999), I found that corruption and inequality are significantly negatively associated with social trust controlling for individual-level factors and other country-level factors, while ethnic diversity loses significance once corruption or inequality is accounted for. Also, I found that the inequality effect is primarily due to the skewness of income rather than its simple heterogeneity, and that the negative effect of minority status is greater in more unequal and undemocratic societies.

The WVS and the EVS have been conducted in close cooperation with (almost) identical questions. The WVS (1995-97) covers 50 countries, and the WVS/EVS (1999-2001) covers 66 countries in all continents of the world. By pooling the 1995-97 data and the 1999-2001 data, I was able to increase the number of countries to 80. My literature review has unearthed few articles employing multilevel modeling in the comparative politics or sociology literatures. I suspect the scarcity of adequate multilevel data is one reason for this. Schofer and Fourcade-Gourinchas (2001) used the 1991 WVS in a multilevel analysis of the "structural contexts of civic engagement," but the country coverage was just 32. Although they had a lot of observations at the individual level, the relatively small N at the country level prevented them from including many explanatory variables at the country level. Now, with a relatively large number of countries, the WVS/EVS data seems to be an ideal dataset for which many interesting multilevel analyses can be conducted.

Since my draft is rough, I will welcome any comments, either methodological or substantive. You can find a draft here.

Posted by James Greiner at 7:00 AM

September 23, 2005

Cog Sci Conf

Amy Perfors

The annual meeting of the Conference of the Cognitive Science Society took place in late July. Amid a slew of interesting debates and symposia, one paper stood out as having particularly interesting implications from the methodological perspective. The paper, by Navarro et. al., is called "Modeling individual differences with Dirichlet processes" (pdf found here).

The basic idea is that many questions in cognitive and social science hinge on identifying which items (subjects, features, datapoints) belong to which groups. The individual difference literature is replete with famous psychological theories along these lines: the factors contributing to IQ, the different "personality types", the styles of thought on this or that problem. In cognitive science specifically, the process of classification and categorization - arguably one of the more fundamental of the mind's capabilities - is basically equivalent to figuring out which items belong to which groups. Many existing approaches can capture different ways to assign subjects to groups, but in almost all of them the number of groups must be prespecified - an obvious (and large) limitation.

A Dirichlet process is a "rich-get-richer" process: as new items are seen, they are assigned to groups proportional to the size of the group, with some nonzero probability alpha of forming a new group. This naturally results in a power-law (Zipfian) distribution of items, which parallels the natural distribution of many things in the world. It also often seems to form groups that match human intuitions about the "best" way to split things up. Dirichlet process models, often used in Bayesian statistics, have been around in machine learning and some niches of cognitive science for at least a few years. However, the Navarro article is one of the first I'm aware of that (i) examines their potential in modeling individual differences, and (ii) attempts to make them more widely known to a general cognitive science audience.

It's exciting to see more advanced Bayesian statistical models of this sort poke their way into cognitive science. As I think about how useful these can be, I have some questions. For instance, Navarro et al.'s model gives a more principled mechanism for figuring out how many groups best fit a set of data, but the exact number of groups identified is still dependent on the alpha parameter. Is this a severe limitation? Also, the "rich-get-richer" process is intuitive and natural in many cases, but not all groups follow power-law distributions. How might we use models with other processes (e.g., Gaussian process models) to assign items to an unspecified number of groups in delete "other" ways that don't yield power-law distributions? I think we've only started to scratch the surface of the uses of this type of model, and I'm eager to see what happens next.

Posted by James Greiner at 7:00 AM

September 22, 2005

The Two Levels of Cognitive Science

Amy Perfors

Our job as social scientists is to learn how to take data that reflects various aspects of how people and societies work, and then use that data to form abstract theories or models about the world. Different fields in social science look at different data, but we all share common methods and (I imagine) some common general questions. This blog is set up to allow our different disciplines to discuss our commonalities of method and approach, sharing insights from our respective fields.

Cognitive science is a bit unusual because the questions of method and approach are simultaneously relevant on two levels rather than one. In cognitive science, the object of study (the brain) must solve the same questions as the scientists themselves. In other words, just as the job of the cognitive scientist is to figure out how best to take data in the world and form models about the world, the job of the brain is to figure out how to take data in the world and form a model about the world. As a result, the issues that crop up again and again for scientists—which quantitative approaches "compress" data most effectively and fastest, when statistical or symbolic models capture the world best, and how much needs to be built into our models from the beginning—are the very issues the brain needs to solve as it is learning about the world. They are thus issues that the cognitive science world continually debates about on both levels: not only what works for us as scientists (and when), but what works for the brain itself (and when).

When I post here, therefore, I'll be constantly playing with these levels: I'll be talking about quantitative methods in social science not just from the perspective of the scientist (as will everyone else here), but also from the perspective of the mind (which I'm guessing most other people won't). In short, the questions we all struggle with in terms of methodology are the same questions cognitive scientists struggle with in terms of content. It's my hope that playing with these questions on two levels at once will be edifying, entertaining, and lots of fun. I think it will be.

Posted by James Greiner at 7:00 AM

September 21, 2005

More on Affirmative Action

Felix Elwert

It's well known that African American college students on average (repeat: on average) have lower SAT scores than white students (see Bowen and Bok's book The Shape of the River). Now here's something that annoys me: Every now and then, I run into somebody who takes this observation as evidence that affirmative action dilutes academic standards. Hello? Differences in mean SATs among accepted students have little or nothing to do with affirmative action!!

Consider this: SAT scores are roughly normally distributed among both blacks and whites but the distribution for blacks is shifted a bit to the left (lower mean). Now consider a college that will admit every candidate above a certain cut-off point (same cut-off for everybody). Under these circumstances the average SAT score of accepted black students would be lower than the average SAT score among accepted white students, even though the college has applied a uniform, race-blind admission standard. Why? Because the tail area of the white SAT distribution extends farther to the right of the cut-off point than the tail area of the distribution for blacks, whatever the reason. Upshot: racial differences in test scores in a student body don't reveal whether a school practices affirmative action and by themselves certainly don't betray "diluted standards." In addition, more or less the only way to create a student body where black and white students have the same average SAT score, given these race specific SAT distributions, would be to set drastically higher admissions standards for blacks than for whites - i.e. to discriminate against blacks. Surely, that wasn't the point?

(This observation comes to me via friends of UCLA's Thomas Kane. Kane is now moving to Harvard - thus moving this blog closer to the source.)

Posted by James Greiner at 7:00 AM

September 20, 2005

Misreading Racial Disparities - Beware Of Ratios of Percentages

Felix Elwert

It's fascinating how far you can get by taking a second look at the simplest statistics - in this case percentages and ratios. Case in point, James Scanlan's clever and unjustly ignored observation that African Americans will necessarily appear to be losing ground relative to whites even as their standing improves in absolute terms. (Actually, the argument holds for any inter-group comparisons, not just race.) Scanlan shows that this is an artifact of measuring progress by focusing exclusively on ratios of percentages from dissimilar distributions. This insight begs the question of how best to measure progress. Here are some of Scanlan's examples.

Black-white differences in infant mortality: In 1983, 19.2 black infants but only 9.7 white infants died per 1000 births in each group. The resulting black-white ratio was 1.98. In 1997, infant mortality had decreased quite a bit, to 14.2 for blacks and 6.0 for whites. Note that in raw percentage terms, infant mortality had improved more for blacks than for whites. That should be good news, no? But, lo, now look at the black-white ratio in 1997 - it increased from 1.98 to 2.4. How can infant mortality have improved more for blacks than for whites in absolute terms at the same time as the relative position of
blacks to whites has worsened?

Here's another example for the same underlying statistical phenomenon: Moving the income distributions of blacks and whites up by the same dollar amount relative to the poverty threshold would increase the racial disparity in poverty (because relatively more blacks suffer extreme poverty than whites)! Except for extreme circumstances, this will be true even if we boost black real incomes more than white real incomes. How can it be that helping blacks more than whites in absolute terms would worsen blacks' relative economic position?

Here's my favorite example - racial disparities in college acceptance rates. Suppose that college admissions are solely a function of SAT scores (as I'm told they essentially are for some large, selective state schools) and that the SAT distribution of black test takers equals that of whites except it's shifted to the left (as it is). Let the cut-off point for college acceptance be the same for blacks and whites (i.e. no affirmative action). Lowering the admission standard (for everybody) would then reduce the racial disparity in admission rates. That's good, no? But at the same time - and necessarily so - the lowering of admission standards would increase the racial disparity in rejection rates. That's bad, no? Huh?

It turns out that seemingly straightforward comparisons of ratios of percentages may hide more than they tell (in these examples, with important policy implications). Interestingly, all three examples draw on the same statistical phenomenon. The secret lies in the funny shape of cdf-ratios from density functions that are shifted against each other. I plan to provide an intuitive explanation for this point once we've figured out how to post graphics on this blog. Until then, read James P. Scanlan's "Race and Mortality" in the Jan/Feb 2000 issue of Society.

Posted by James Greiner at 7:00 AM

September 19, 2005

Conference on Methods in Health Services and Outcomes in Boston, October 28-30

Sebastian Bauhoff

Boston will host the 2005 International Conference on Health Policy Research from October 28-30. This year's theme is "Methodological Issues in Health Services and Outcomes Research" and presentations are meant to convey both content and methodology.

The conference includes a slightly eclectic selection of workshops on methods and the use of well-known health datasets -- two workshops on the latter are free, others cost $60 or $30 (students). Registration is not free either but studentes pay only $80. Looks interesting and useful overall, though you might want to attend selectively.

For more info check the conference website.

Posted by SSS Coauthors at 10:50 PM

Applied Stats Seminar

Michael Kellermann

The Research Workshop in Applied Statistics brings together the statistical community at Harvard for a lively exchange of ideas. It is a forum for graduate students, faculty, and visiting scholars to present and discuss their work. We advertise the workshop as "a tour of Harvard's statistical innovations and applications," with weekly stops in different disciplines such as economics, epidemiology, medicine, political science, psychology, public policy, public health, sociology and statistics. The topics of papers presented in recent years include matching estimators, missing data, Bayesian simulation, sample selection, detecting biological attacks, imaging the Earth's interior, incumbency in primary elections, the effects of marriage on crime, and revealed preference rankings of universities.

One of the strengths of the workshop is its diverse group of faculty sponsors. This year's sponsors include Alberto Abadie (Kennedy School), Garrett Fitzmaurice (School of Public Health), Lee Fleming (Business School), Guido Imbens (Economics), Gary King (Government), Kevin Quinn (Government), James Robins (School of Public Health), Donald Rubin (Statistics), and Christopher Winship (Sociology). The workshop provides an excellent opportunity for informal interaction between graduate students and faculty.

The workshop meets Wednesdays during the academic year; lunch is provided. If you are interested, come to our organizational meeting on Wednesday, September 21 at noon in Room N354 at the Institute for Quantitative Social Science (IQSS is located on the 3rd Floor of CGIS North, 1737 Cambridge St., located behind the Design School). Course credit is available for students as an upper-level class in either Government or Sociology.

For more information, check out our website at here . There you will find contact information, the schedule of presentations, and links to papers from previous presentations. We'll also be using this blog to announce speakers and to post reports from the workshop, so check back here often. We hope to see many of you there. If you have any questions, feel free to e-mail me at

Posted by James Greiner at 12:53 PM

Censoring Due to Death, cont'd, & A Visit To Harvard

Censoring, cont'd
John F. Friedman

Continuing from the most recent post, for the economist, perhaps a more interesting incidence of this statistical problem is not researchers making this error within the literature but consumers making misjudgments in the marketplace. (Since most people approach problems in their lives with less rigor than a statistician, perhaps this is not surprising). In particular, once consumers make these inference mistakes, economic theory suggests that firms will take advantage. Edward Glaeser wrote at length on this phenomenon in 2003 in "Psychology and the Market."

One classic example of this phenomenon - as specifically related to censorship by death - is the mutual fund industry. Most brochures for management companies aggressively tout the high past returns that have accumulated in their funds. Consumers then extrapolate these historical earnings into the future, usually choosing managers based on past performance. Of course, their reasoning is tainted by the same statistical problem; companies will shut down those mutual funds which have poor past performance, leaving only their winners for customers to admire. (Another problem with this line of reasoning is that there is virtually no evidence that strong past performance predicts of strong future performance. In this sense, perhaps the greater error is to pay attention to past returns at all!) This problem is compounded in the market by the fact that any firm which attempts to educate consumers about their mistakes is unlikely to capture the value-added from that effort. The now-savvy consumers have no reason to invest at the firm that provided the information, and, even if they did, these firms make the most money from naive consumers rather the smart ones, who would now make up the clientele. See David Laibson and Xavier Gabaix (2004) for more on this phenomenon. Since no firm has an incentive to educate the public, the entire industry becomes geared towards taking advantage of naive consumers, obfuscating costs, and selectively presenting information.

A Visit To Harvard

Anton Westveld (Visiting from University of Washington Statistics Department)

This past week I had the opportunity to visit with Kevin Quinn, one of my main Ph.D. advisors, at for the Center for Government and International Studies at Harvard. Kevin and Gary King asked if I would provide a brief description of my recent visit.

I was fortunate enough to arrive in time to work in the new buildings for the Center. The new space has a modern design that is quite beautiful and utilitarian.

Currently we are working on developing statistical methodology for longitudinal social network data. Social network data consist of measured relations occurring from interactions within a set of actors. This type of data allows for the empirical investigation of the interconnectivity of the actors, which is a cornerstone of social science theory. The methodology focuses on data generated from the repeated interaction of pairs of actors, including temporal dyadic data resulting in an outcome for each actor at each time point (e.g. the level of exports from Canada to Japan in a given year). The methodology incorporates structure to account for correlation resulting from interactions as well as the repeated nature of the data. In particular, a random effects model is employed which accounts for five different types of network dependencies. These five dependencies are then correlated over time through the assumption that the random effects follow a weakly stationary process.

Kevin and I spent the last few days discussing appropriate methodology and writing C++ code. We also spent some time discussing the relationship between social network models and statistical game theory models, both of which seek to gain an understating of social phenomena by examining social interaction data. Due to the Center’s collegial environment, I also had opportunities to discuss my work with Gary King and Jake Bowers.

Posted by James Greiner at 7:00 AM

September 18, 2005

Censoring Due to Death: A Statistics Symposium

Drew Thomas

The Harvard Dept. of Statistics kicks off its 2005-2006 seminar series on Monday, September 19 with a talk by the father of the Rubin Causal Model himself, Prof. Donald Rubin. An entertaining speaker if there ever was one, Prof. Rubin will give a firsthand account of his research to all who are interested.

The talk will be held in Science Center 705 at 4:00; a reception will follow. Looking forward to seeing all interested parties in attendance.

Posted by SSS Coauthors at 5:05 PM

September 16, 2005

Censoring Due to Death, cont'd

John F. Friedman

The problem of "censoring by death" also surfaces up in a number of economic contexts. For instance, firms that go bankrupt as a result of poor corporate policies will not appear in many datasets, making any analysis of the impact of other financial events biased upwards. This problem has particularly plagued the literature on the impacts of corporate restructuring and leveraged buyouts (LBOs) of distressed firms. Since these firms are at high risk of failure by nature of their inclusion in the study in the first place, such firms exit the sample at high frequency, and the benefits of restructuring and LBOs may be overstated.

One can theoretically correct for this problem by modeling the ways in which the sample selection occurs, but these approaches have performed poorly in many economic settings due to the sensitivity of the results to the parametric assumptions of the econometric model. For instance, the "Heckman selection correction" - brought into Economics by Nobel laureate James Heckman in 1979 - models the death process as a first stage Probit based on observable characteristics. By estimating this first stage, one can correct for the lost observations. Bob LaLonde (1986) later tested this model by comparing the results from a job training study with random assignment to the results one would have gotten had one used Heckman's method on the treated group. Though the selection correction performed better than many alternative methods, such as matching or differences-in-differences, the estimates were rather imprecise and confidence intervals mismeasured. In this case, the problem is the joint assumption of normality and selection entirely on observables. Though more flexible models have come into Economics in recent years - the Propensity Score, for instance – these too have proven sensitive to the particular model properties in many applications.

Though perhaps an old-fashioned solution, the studies in economics that best avoid this problem have simply endeavored to correct for the sample selection problem by collecting otherwise unavailable data on firm deaths in the sample. These samples are often smaller, permitting less broad analysis, but effectively mitigate the selection by death problem.

Posted by James Greiner at 7:00 AM

September 15, 2005

Censoring Due to Death

D. James Greiner

I'm interested in the problem of "censoring due to death" within the framework of the Rubin Causal Model ("RCM").

As readers will know, the RCM is a framework for studying the effects of causes in which the science is represented via a set of potential outcomes for each unit. (A potential outcome is the value the dependent variable would take on if the treatment variable had a certain value, whether or not the treatment variable actually had that value). An assignment mechanism decides what treatment (e.g., active treatment or control) a unit receives and thus which potential outcome will be observed. Unit-level causal effects are defined as the difference in the potential outcomes of some quantity of interest. The fundamental problem of causal inference is that we can observe at most one potential outcome for each unit. Unobserved potential outcomes are treated as missing data. Observational studies are analyzed as "broken" randomized experiments, broken in the sense that the assignment mechanism was not recorded and therefore must be reconstructed in some approximate way. For a more complete discussion, see Holland, P.W. (1986). Statistics and Causal Inference. Journal of the American Statistical Association 81: 945--960.

Censoring or truncation due to death occurs when some units' failure to comply with a post-treatment condition renders their values of the quantity of interest undefined. Consider for example a medical study designed to assess the effect of a new cancer treatment on the percentage of patients who survive cancer-free for ten years. Suppose some individuals die from car accidents or drug overdoses or other causes clearly unrelated to cancer before the ten-year time period has elapsed. Such individuals do not have a value for ten-year cancer-free survival, so their values of the quantity of interest are undefined. (The problem here is not that these individuals' values for cancer-free survival are missing data; rather, the problem is that they have no such values.) Under such circumstances, some quantitative analysts simply remove such individuals from the study and analyze the remainder. This course of action can bias results in several different ways. To illustrate one such way, it could be that individuals who die from non-cancer related causes might smoke, have less healthy diets, refuse to wear seat belts, or otherwise engage in more risky behavior than many of the other individuals in the study. If the treatment is effective in warding off cancer, there could be more deaths unrelated to cancer in the treated group than the control group, because some treated group members survive cancer that would otherwise have killed them long enough to be felled by, for example, car accidents, before ten years are up. This difference could render comparison of the units remaining in the treated and control groups an inappropriate method of assessing the effect of the treatment.

The key is to realize that a comparison of ten-year cancer-free survival rates only makes sense for units who would not die from causes unrelated to cancer if assigned treatment AND who would not die from causes unrelated to cancer if assigned control. Thus, removing individuals who died from causes unrelated to cancer is not enough.

The remaining group actually assigned control may include some units who would have died from non-cancer causes if they had been assigned treatment, and the remaining group actually assigned treatment may have some units who would have died from non-cancer causes had they been assigned control. The researcher must take appropriate steps to remove both sets of people from the study, so as to isolate the set of individuals who would not die from causes unrelated to cancer regardless of treatment assignment. Junni Zhang (Peking University) and Don Rubin (Harvard University) discuss these issues in "Estimation of Causal Effects Via Principal Stratification when Some Outcomes Are Truncated by 'Death,'" (2003). Journal of Educational and Behavioral Statistics 28:353-368. They extend them in a forthcoming paper with Fabrizia Mealli (University of Florence) currently entitled "Evaluating Causal Effects in the Presence of 'Truncation by Death' -Likelihood-based Analysis via Principal Stratification."

Posted by James Greiner at 7:00 AM

September 14, 2005

Math Camp

Michael Kellermann

When I was an undergrad, the first political science class that I took was taught by the late A.F.K. Organski. At one point, someone asked him what advice he would give to freshmen interested in political science as a major. "Take as many math courses as you can," he said with his inimitable accent. I'm pretty sure that this was not the advice that most people wanted to hear, and that it was honored more in the breach than the observance, but it was sound advice nonetheless.

In keeping with this idea, several Harvard programs offer short math refresher courses for incoming graduate students, including Government, Economics, and the Kennedy School. The Gov Department's "math (p)re-fresher" is held during the first two weeks of September. We cover calculus, probability, linear algebra, and a bit of optimization theory, along with an introduction to some of the software (R, Xemacs, and Latex) that we use in the department's methods courses. All told, it is a quick review of about five semesters worth of undergraduate math courses in the span of ten days. As you might imagine, there is considerable variation in the amount of "pre-freshing" versus "re-freshing" that goes on in the course.

I'm curious about the prevalence of these kind of "math camp" courses in the social sciences. I only know of a few others in political science, but I get the sense that they are more common in economics. Are there any sociology math camps out there? Psychology? Public health? If you have a math camp, I'd be interested in taking a look at your syllabus. Comments should be enabled.

Posted by James Greiner at 7:00 AM

September 13, 2005

Pol Meth Conf IV

Dan Hopkins, G4, Government (guest author)

Continuing with the discussion of papers presented at the recent Political Methodology Conference, Kevin Quinn and Arthur Spirling's paper begins with the problem of identifying legislators' preferences in conditions of strict party discipline. To tackle this challenge, they applied a Dirichlet process mixture model and presented some interesting results about the intra-party groups observed in the British House of Commons. They backed up the groupings recovered from the model with significant qualitative work, and showed how qualitative and quantitative work of this kind can go hand in hand. At the same time, the discussant, Andrew Martin, raised a valuable question: how does this method relate to other analyses of grouping/clustering? I am curious about this question as well.

James Honaker's paper tackled a question of substantive importance: what is the role of economic conditions in triggering sectarian violence? Honaker analyzed all available data, far more than anyone previously, and used a creative combination of ecological inference and multiple imputation to estimate the impact of the Protestant-Catholic unemployment ratio on a monthly basis. His substantive result was that this ratio matters: as the gap between Protestant and Catholic employment grows, so too does the risk of violence. One questioner suggested that we might want to instrument for unemployment, since unemployment could be endogenous to violence. Honaker responded that unemployment in Northern Ireland tracks unemployment in comparable cities elsewhere. This paper struck me as, among other things, a powerful (if implicit) rebuttal to those who are that one should never attempt ecological inferences. The question Honaker addressed is one scholars have already tried to answer - sometimes with counter-intuitive results - suggesting that we may not be able to simply wait for perfect, individual-level data.

Kosuke Imai presented co-authored work on an Internet experiment in Japan. As with the Jackman et al. paper, this work presented a single Bayesian model that dealt with 1) the problem of non-compliance; 2) the problem of non-response; and 3) estimated causal effects. The methods were compelling, although the data were less cooperative: almost no statistically significant treatment effects emerged. That result seems to fit with our priors: the experiment directed Japanese Internet users, presumably a relatively well-informed group, to click on a webpage containing party manifestos during the Upper House election. The fact that we are selecting our sample based on a set of covariates might help explain why the covariates are (at least individually) relatively helpless in predicting compliance. As with the Bowers and Hansen, I hope that the authors make their statistical code public and easily adapted to other applications-as these tools are well-suited to analyzing a wide range of randomized experiments.

David Epstein presented a joint paper with Sharyn O'Halloran that argued for using higher-dimension Markov models-that is, Markov models with more than two states-to model transitions to and from autocracy/democracy. The substantive argument: adding a third category of "partial democracy" helps us see that economic growth matters both for transitioning to democracy and for staying there. Discussant Jeff Gill and others questioned the appropriateness of the basic Markovian assumption (that the probability of transition conditional on the current state is equal to the probability of transition conditional on all previous states) and suggested exploring a higher-order Markov model (that is, models that allow previous states to influence present transition probabilities). I agree with their suggestion, but my question is more basic: if we have polity scores that are continuous on an interval, how much information is thrown away by transforming these scores into three discrete states? I have not seen the data, so I also wonder if these three states emerge naturally from it. In other words, how much would this analysis change if we redefined autocracy or democracy by a few polity points?

Posted by James Greiner at 7:00 AM

September 12, 2005

Pol Meth Conf III, & GOV 2000

Pol Meth Conf III
Dan Hopkins, G4, Government (guest author)

Continuing the discussion of the recent Political Methodology Conference, throughout its first two days the notion of the conference as the "Second Annual Conference on Matching" was a running joke, and definitely a fair joke, although the two matching papers were, well, matched by two ideal point papers. So on to ideal points. Michael Bailey's paper tackled an important problem: because major figures across the different institutions of the federal government are faced with different policy decisions, it is hard to make statements about how their preferences relate. Is the Supreme Court to the left of Congress? How would today's court rule on famous decisions from the past? Bailey's paper sought to extend ideal points across institutions, using such things as public statements and the court briefs of the Solicitor General to compare the ideal points of not just justices but of members of all three branches of the federal government. Bailey argued, for example, that if the first Bush administration filed a brief in support of a certain side in a court case, we could use that filing to put Bush in the same space as Chief Justice Rehnquist. Bailey used the same sort of logic to extend ideal points back in time, focusing on statements about preferences-for instance, Clarence Thomas's statement that Roe was wrongly decided-to allow figures from different time periods to be placed on the same scale. Especially impressive was the data collection effort this project entails, as the author tracked down public statements from a wide range of figures.

One of the challenges of making these kinds of cross-institutional inferences, though, is that we need to implicitly assume non-strategic behavior. Needing to build a majority of five, justices in the Supreme Court face a task distinct from that of the President—or from that of the average member of the House. These strategic contexts will undoubtedly affect politicians' decisions: Presidents have little incentive to make public statements that put them at odds with the majority of Americans, even if those statements reflect their preferences. Also, if Presidents (or others in the system) are selective about the subjects of their commentary, we might wind up with a biased idea of where they actually stand. Still, Bailey provided quite a neat paper, one that provides useful tools for tracking inter-institutional dynamics. The substantive results were also very interesting, with the median ideal point of the Court almost always between that of the House and the Senate.

The next ideal point paper came from Simon Jackman, Matthew Levendusky, and Jeremy Pope. Here, the goal was to estimate the baseline propensity of a Congressional district to support Democratic or Republican candidates—although much of the Q&A was taken up by questions about whether this was best thought of as the "natural vote" or something else. The authors emphasized that measurement and structural modeling go hand-in-hand because inaccurate measurement may well bias the structural estimate of quantities like the incumbency advantage. They also pointed out that in this field we are content with rough proxies of district tendencies despite the fact that in other areas we demand much more precision in our measurements. Jackman, Levendusky, and Pope's model was a Bayesian hierarchical ideal point model that draws on information about both Congressional and Presidential results to make inferences about districts' underlying partisan preferences.

For me, one provocative result from this paper was that the discrimination parameter-that is, the impact of the covariates on the estimated vote share-increased over the decades. In other words, demographic characteristics are becoming increasingly effective predictors of districts' preferences. I would love to see the authors try to get at exactly why that is. One possibility, which Levendusky mentioned in making his presentation, is redistricting: politicians get better at picking their constituents, districts become more homogeneous, and so district-level demographics become better predictors of aggregate vote choices. To test this theory, one might re-estimate the model without the least populous states (because such states have less potential for gerrymandering. Consider Wyoming: no gerrymandering there). Another possibility is that the electorate is sorting itself into more politically homogeneous groups, something one might test in a preliminary way by running the model separately for high-mobility and low-mobility districts. The Census gives data on how many people have lived in the same house for their entire lives, data that could help with these questions.

GOV 2000
Kevin Quinn

This fall I am teaching GOV 2000 Quantitative Methods for Political Science I. This course is also offered for credit through Harvard's distance learning program as GOVT E-2000. GOV 2000 is the first course in the Department of Government's methodology sequence and it is designed to introduce students to statistical modeling with emphasis on least squares linear regression. Although we will not ignore the theory underlying the linear model, much of the course will focus on practical issues that arise when working with regression models. Topics covered in the course include: data visualization, statistical inference for the linear model, assessing model adequacy, when is a regression model a causal model, dealing with leverage points and outliers, robust regression, and methods for capturing nonlinearities. We will also be working with real social science datasets throughout the course. For more information, please visit the course website here .

Posted by James Greiner at 7:00 AM

September 9, 2005

Pol Meth Conf II

Dan Hopkins, G4, Government (guest author)

Continuing with the matching theme on which I ended the post of two days ago, Alexis Diamond and Jas Sekhon presented a paper on genetic matching that claimed to be a significant improvement on past approaches. One of the challenges of matching is to weight each of the covariates so as to produce the optimal set of matches. Genetic matching uses a genetic algorithm to search across the set of possible weight matrices to find the weight matrix that minimizes some loss function. Of course, what exactly that loss function should be is debatable. In Rawlsian fashion, Diamond and Sekhon argued that it should be to maximize the p-value of the most unbalanced covariate, and Sekhon's software (link here) does exactly that. In some applications, one could certainly imagine other loss functions; seeking the best possible balance on the most unbalanced covariate could jeopardize the overall balance, a libertarian sort of rebuttal. The discussion of the paper also raised the question of whether using a p-value is the right criterion. If the algorithm is comparing p-values from samples with different sizes, for instance, it could disproportionately favor a smaller sample.

Despite the questions, I buy Diamond and Sekhon's argument. Genetic matching makes effective use of computing power to search across a high-dimensional space for the most balanced sample that the data can provide. In cases where there is insufficient overlap on covariates, data analysts will know this quickly rather than devoting weeks to Holy Grail-style quests for optimal matches. And in cases where there is sufficient overlap on the covariates to make causal inferences, data analysts will be far more certain that they have attained the best possible balance—again, subject to the constraints about the loss function.

Posted by James Greiner at 5:37 PM

September 8, 2005

State Failure

This article in World Politics on forecasting state failure that Langche Zeng (who by the way is moving this week from GW to UCSD) and I wrote a few years ago seems relevant to what is presently happening in New Orleans. Here are the opening sentences of the article: "`State failure' refers to the complete or partial collapse of state authority, such as occurred in Somalia and Bosnia. Failed states have little political authority or ability to impose the rule of law [on its citizens]." We normally associate state failure with foreign countries you would not want to visit, but with a third of the New Orleans police force not showing up for work, with the two-thirds that remained barricaded in their homes or police stations, with corpses strewn around the streets from the hurricane and some murders, and where a policeman today "joked that if you wanted to kill someone here, this was a good time" (see today's NY Times Article), it is hard to see how New Orleans this past week was anything but the definition of state failure.

Our article was about some methodological errors we found in the U.S. State Failure Task Force's forecasts and methods of forecasting. They had selected data via a case-control design (i.e., selecting on their dependent variable all examples of state failure and a random sample of nonfailures), which can save an enormous amount of work in data collection, but it is only valid if you properly correct. The Task Force didn't correct and so, for example, their forecast for Brazil failing was reported at 0.72 but their model, correctly interpreted, indicated that it was only 0.11; their reported forecast for Somalia failing was 0.45, but the model actually indicated that it was only 0.04. We also improved their methods and thus forecasting success over their corrected models via neural network methods and some other approaches. They also collected one of the best data sets on the subject, which you might want to use.

The charter of the U.S. State Failure Task Force prohibits it from discussing state failure in the U.S. or making forecasts of U.S. state failure, but by their definitions, there is little doubt that for a time anyway all relevant governmental authorities in the U.S. suffered a "complete or partial collapse of state authority" and so the U.S. would seem to fit that definition. I haven't checked, but I doubt their model or our's had any ability forecast these events.

Posted by Gary King at 8:59 AM

September 7, 2005

Pol Meth Conf I

The 14 papers presented at the 2005 Conference of the Society for Political Methodology, held July 21st-July 23rd in heavily air-conditioned rooms at Florida State, provided plenty of good fodder for discussion. I will focus on several I found especially provocative--and on which I could reasonably comment--in blog posts over the next few days.

Starting off the conference, Gary King's presentation of his paper Death by Survey: Estimating Adult Mortality without Selection Bias'' with Emmanuela Gakidou argued that we need to take a new approach to estimating death rates in the many countries that do not have vital registration systems. The dominant approach at the present assumes that larger families do not have differing mortality rates, but given the uneven pace of development in so many countries, that seems a heroic assumption, and their paper shows it is completely wrong empirically. King and Gakidou's approach involves two fixes: first, weighting to deal with the over-sampling of families with more surviving children during the observation period (since samples are drawn in proportion to survivors rather than those alive at the start of the period), and second, extrapolation to deal with the fact that families with no surviving children are entirely excluded from the sample. The first problem is fixed exactly by weighting; the second requires assumptions beyond the range of the data. Some discussion focused focused on one of the main challenges to this second fix—that it involves extrapolation, extrapolation based on a small number of data points, and extrapolation based on a quadratic term. The paper deals with the danger of extrapolation through repetition in different data: By showing that the relationship between mortality and the number of siblings is constant in its shape across a wide range of countries, King and Gakidou argue that we can be reasonably confident about the fit of the curve from which we are extrapolating. The authors are now gathering survey data to replicate this approach in cases where we know the answer--that is, where we also have accurate, non-survey data on mortality rates. That is especially critical since families without any surviving children might be disproportionately the victims of wars or other violence, for instance, making it challenging to use data about families with surviving children to make inferences about families without any surviving children.

Another early paper came from Kevin Clarke, who argued that political science as a discipline has become too worried about omitted variable bias. Clarke took another look at the familiar theoretical omitted variable bias result and pointed out that contrary to conventional wisdom, including additional variables can, under certain circumstances, exacerbate problems of omitted variable bias. The circumstances are that something else is wrong: that is, omitting a variable that is causally prior to and correlated with the treatment variable and affects the outcome variable will bias inferences in a predictable direction, and including it will reduce bias -- but only when other modeling assumptions are correct. If you have five things wrong with your model and you fix four, it is at least possible that you can make things worse.

In my view, the sociology of the discipline embedded in Clarke's presentation was right on. In substantive presentations, it is incredibly common for presenters to be barraged with questions that are of this form: "did you account for [insert favorite variable]? What about [insert second favorite variable]? Or maybe [insert random variable that no one before or since has ever heard of]?" Reviewers, too, seem to find this an easy way to respond to articles. One way to deal with this problem--sensitivity tests--was highlighted during the discussion. Our models are almost never perfectly specified, so there will always be omitted variables, and knowing how those variables would need to look to overturn a result is a good (if incomplete) start to deal with this problem. One example of how these kinds of sensitivity analyses might work, by the way, is David Harding's 2003 "Counterfactual Models of Neighborhood Effects: The Effect of Neighborhood Poverty on Dropping Out and Teenage Pregnancy." (American Journal of Sociology 109(3): 676-719). Another is Paul Rosenbaum and Don Rubin's 1983 "Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome." (Journal of the Royal Statistical Society, Series B 45: 212-218).

One other case against overly-saturated models, one that did not come up in the discussion but that is probably familiar to many, is the challenge of thinking in terms of conditional effects as the number of variables increases. For instance, if we think about vote choices as our dependent variable, I understand what it means to talk about the impact of income conditional on race, but it is much harder to know what it means to say the impact of income conditional on ten other, inter-correlated variables. This problem becomes all the more difficult when we remember that we are conditioning not just on the inclusion of certain variables but also on the functional form specified for them.

Because I am a Harvard graduate student, I should also play to type and say something briefly about how matching (which these days is well-represented in hallway conversations at IQSS) relates to omitted variables. Obviously, it is no panacea, as unobserved confounders can be just a troublesome as in the case of more conventional models. But there is one way in which matching adds value here. In cases where we are matching observations of units for which we have information not quantified in our dataset, looking at the list of matched pairs can help identify the omitted variable. If, say, we are studying countries, and see that our observed variables wind up pairing Ethiopia and Greenland, we can use that pairing to think through what kinds of unobserved variables might be potential confounders.

Dan Hopkins, G4, Government (guest author)

Posted by James Greiner at 5:11 PM