July 2012
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        

Authors' Committee


Matt Blackwell (Gov)


Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries



SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
The Education Wonks
Empirical Legal Studies
Free Exchange
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science



Powered by
Movable Type 4.24-en

June 4, 2012

Post-processing graphics in Adobe Illustrator

A while a ago Rich posted about chartsnthings, the blog written by Kevin Quealy of the New York Times graphics department. One thing that stood out to both Rich and myself was that a couple of the NYT folks do sketches in R and then post-process the sketches with Adobe Illustrator. A good example is Kevin's post on Mariano Rivera's career.

Today, on the R-help mailing list, Michael Friendly pointed to a tutorial on the practicals of how to, you know, actually do post-processing with Illustrator.

Posted by Matt Blackwell at 10:29 AM | Comments (7)

November 2, 2011

Privacy, Statistics, and the Debate over the Regulation of Social Science Research

(This is a guest post by Dr. Micah Altman, who is a Senior Research Scientist and Director of Data Archiving and Acquisitions at IQSS.)

The U.S. Office for Human Research Protections (OHRP) proposed a set of sweeping changes to the federal regulations that govern research involving human subjects (the “Common Rule”), in the form of an Advance Notice of Proposed Rule Making (ANPRM) and solicited comments from investigators, Institutional Review Boards (IRBs), and any other interested parties by October 26, 2011. The ANPRM posed 75 questions, among these were many that implicated the collection, storage, de-identification and distribution of information about individual research subjects, as well as major questions about the types and nature of exempt and minimal risk research. Together these proposed changes could have a huge effect on the conduct of social science research and on the sharing of research results.

There have been over 1100 comments submitted on the proposed legislation. The Data Privacy Lab, which is run by Latanya Sweeney and which has now joined IQSS, organized a series of seminars at Harvard on the proposed changes, resulting in two responses being submitted. One response was drafted by the lab and joined by about 50 data privacy researchers and supported, and by two national privacy groups. A second complementary response, was led by Salil Vadhan, Joseph Professor of Computer Science and Applied Mathematics, and former Faculty Director of the Center for Research on Computation and Society of Computer Science at Harvard, and joined by academics and researchers.

Among other issues these responses draw attention to the key role of data sharing in social science research, and to the statistical and computation advances that have fundamentally changed both the analysis of informational risks, and the opportunities available to use statistical methods to disclose data safely.

For example, the proposed privacy HIPAA privacy rule is implicitly tailored to traditional microdata, as Vadhan, et al.’s response points out: “[T]here is increased interest in collecting and analyzing data sets that are not in the traditional microdata form. For example, social network data involves relationships between individuals. A “friendship” relationship or contact between two individuals on a social network does not entirely “belong” to either individual’s record; the relationship can have privacy implications for both parties. While this change from data about individuals to data about pairs may seem innocuous, it makes the task of anonymization much more difficult and one cannot expect standards developed for traditional microdata, like HIPAA, to apply. ”

The response then goes on to highlight how advances in statistical and computational methods can direct access to confidential data safely through dynamic interactive mechanisms for tabulation, visualizations, and general statistical analysis; multiparty computation; and synthetic data generation. In many circumstances these techniques can yield both better privacy protections and better research utility than traditional “de-identification” techniques such as removing and generalizing fields.

Sweeney, et. al’s response goes on to comment on the systemic issues, incentive problems, and policy issues associated with the proposed changes, most important:

First, that “The proposed ban on re-identification would drive re-identification methods further into hidden, commercial activities and deprive the public, the research community and policy makers of knowledge about re-identification risks and potential harms to the public.”

And second, that the proposed policy provides no incentive to develop or use statistical and computational methods that would improve both privacy and research utility of data sharing:”[T]here needs to be a channel for NCHS, NIST or a professional data privacy body to operationalize research results so that real-world data sharing decisions rely on the latest guidelines and best practices.”

The DPL has collected these responses, along with the related responses from Harvard University, major privacy groups, and Social Science research association.

Posted by Matt Blackwell at 5:26 PM

September 27, 2011

Cross Validated

Cross Validated is a question-and-answer site dedicated to statistics and statistical computing. It is part of the completely awesome Stack Exchange network of Q/A sites which I rely on heavily when coding. The questions range from the fairly straightforward (“How can a regression be significant yet all predictors be non-significant?”) to the fairly complicated (“What’s the difference between principal components analysis and multidimensional scaling?”) to the fairly abstract (“How are we defining ‘reproducible research’?”).

There is a ton of interesting instruction being done there. I have heard others scoff at the idea of giving away your expertise for free on sites like these, but I think that Cross Validated and other sites are crucial and vibrant places for students to learn about statistical methods. And this is a chance to actually help people in a concrete way.

Also, check out their community blog, which promises to have great little tidbits, mostly focused on R.

Posted by Matt Blackwell at 9:17 AM

July 19, 2011

Detecting (edit) wars

A fun modeling project from a group of physicists on Edit wars in Wikipedia:

We present a new, efficient method for automatically detecting severe conflicts `edit wars’ in Wikipedia and evaluate this method on six different language WPs. We discuss how the number of edits, reverts, the length of discussions, the burstiness of edits and reverts deviate in such pages from those following the general workflow, and argue that earlier work has significantly over-estimated the contentiousness of the Wikipedia editing process.

Burstiness is new to me and appears to be popular in studying communication networks. Bursty (?) processes have many occurrences in short interval and long gaps between these bursts. Where is the burstiness in the social sciences? Would it measure anything interesting?

Also, maybe I don’t look much at the acknowledgements from other disciplines, but I have never seen such a clear delineation of work contributed:

Sumi developed the main classifier, Yasseri the temporal profiles, and Rung selected the seed examples and generated the data for supervised training. Kornai advised on multilingual natural language processing and statistical analysis. Kertesz designed the research, advised on temporal studies and on modeling.

Posted by Matt Blackwell at 9:40 PM

May 26, 2011

Google Correlate

Google Correlate is a new services from Google that allows you analyze temporal or spatial correlations between search terms. Could be incredibly interesting data in here for various fields. The methodology from the whitepaper gives an insight as to how Google does these correlations at scale:

In our Approximate Nearest Neighbor (ANN) system, we achieve a good balance of precision and speed by using a two-pass hash-based system. In the first pass, we compute an approximate distance from the target series to a hash of each series in our database. In the second pass, we compute the exact distance function on the top results returned from the first pass.

I tried my hand with “Barack Obama”, where most of the action comes from the South, the Rust Belt, and the Eastern Seaboard:

Spatial distribution of the search "Barack Obama"

Compare that to map for the search phrase “Barack Hussein Obama”:

Spatial distribution of the search "Barack Hussein Obama"

Here you see a much more distinct Appalachian pattern emerging. This map looks similar to others that highlight where John McCain in 2008 did better than George Bush in 2004. The kind of search terms with high spatial correlation with the two are also fascinating. For “Barack Obama” you see mostly references to African Americans or African American culture:

Items with high spatial correlation with the search "Barack Obama"

For “Barack Hussein Obama”, there is quite the hodgepodge, with references to “obama koran” and “obama the antichrist”:

Items with high spatial correlation with the search "Barack Hussein Obama"

Obviously, we wouldn’t want to read too much into these comparisons due to the ecological inference problems, but there is a lot to explore here.

Posted by Matt Blackwell at 11:19 AM

April 25, 2011

App Stats: Lauderdale on "There Are Many Median Justices on the Supreme Court"

We hope that you can join us for the final Applied Statistics Workshop of the year this Wednesday, April 27th when we will be happy to have Benjamin Lauderdale, currently a College Fellow in the Department of Government, Harvard University and soon to be at the London School of Economics. You will find an abstract below. As always, we will serve a light lunch and the talk will begin around 12:15p.

“There Are Many Median Justices on the Supreme Court” Benjamin Lauderdale Department of Government, Harvard University CGIS K354 (1737 Cambridge St.) Wednesday, April 27th, 2011 12 noon


While unidimensional preference estimates for the U.S. Supreme Court exist in both constant and time-varying forms, estimating variation in preferences across areas of the law has been difficult because multidimensional scaling models perform poorly with only nine voters. We introduce a new approach to recovering estimates of judicial preferences that are localized to particular legal issues as well as periods of time. Using expert issue area codes and majority opinion citations to identify the strength of substantive relationships between cases, we apply a kernel-weighted optimal classification estimator to analyze how justices’ preference vary across both areas of the law and time. Allowing for issue-variation in preferences improves the predictive power of estimated preference orderings more than allowing for time-variation. We find substantial variation in the identity of the median justice across areas of the law during most periods of the modern court, suggesting a need to reconsider empirical and theoretical research that hinges on the existence of a unitary and well-identified median justice.

Posted by Matt Blackwell at 9:26 AM

April 18, 2011

Lewis on "The Compactness of Congressional Districts"

We hope that you can join us for the penultimate Applied Statistics Workshop of the year this Wednesday, April 20th. This week we are extremely excited to have Jeffrey Lewis, Associate Professor of Political Science at UCLA, presenting on the compactness of congressional districts, a topic that involves some interesting econometric issues as well as a large GiS component. Note that this is a change from what is on the schedule. As usual, we will start the workshop at 12 noon with a light lunch and begin the talk at 12:15. We wrap up the workshop at 1:30pm.

“A study of Congressional district compactness, 1789-2011”
Jeffrey B. Lewis
Department of Political Science, UCLA
CGIS K354 (1737 Cambridge St)
Wednesday, April 20th, 12 noon.

Posted by Matt Blackwell at 9:36 AM

April 13, 2011

A Cure for the Regex Headache

Whenever I do a little data cleaning with a scripting language, I always find myself struggling with regular expressions. Now a new site, txt2re, allows you to figure out the regular expression you want from some sample text:

This system acts as a regular expression generator. Instead of trying to build the regular expression, you start off with the string that you want to search. You paste this into the site, click submit and the site finds recognisable patterns in your string. You then select the patterns that you are interested in and it writes a fully fledged program that extracts those patterns from that string. You then copy the program into your editor or IDE and play with it to integrate it into your program.

(via kottke)

Posted by Matt Blackwell at 10:03 PM

EM Galore

Statistical Science has a new issue out dedicated to the EM Algorithm, entitled “Celebrating the EM Algorithm’s Quandunciacentennial”. David Van Dyk and Xiao-Li Meng are the guest editors. Here is the abstract from their (awesome looking) contribution, “Cross-Fertilizing Strategies for Better EM Mountain Climbing and DA Field Exploration: A Graphical Guide Book”:

In recent years, a variety of extensions and refinements have been developed for data augmentation based model fitting routines. These developments aim to extend the application, improve the speed and/or simplify the implementation of data augmentation methods, such as the deterministic EM algorithm for mode finding and stochastic Gibbs sampler and other auxiliary-variable based methods for posterior sampling. In this overview article we graphically illustrate and compare a number of these extensions, all of which aim to maintain the simplicity and computation stability of their predecessors. We particularly emphasize the usefulness of identifying similarities between the deterministic and stochastic counterparts as we seek more efficient computational strategies. We also demonstrate the applicability of data augmentation methods for handling complex models with highly hierarchical structure, using a high-energy high-resolution spectral imaging model for data from satellite telescopes, such as the Chandra X-ray Observatory.

You can find most of the papers on arXiv, using a simple search. A quick Google search for “Quandunciacentennial” yields no hits. Any one know the etymology there? Some reference to 34 that I’m missing?

Posted by Matt Blackwell at 9:15 AM

April 11, 2011

App Stats: Perry on "Point process modeling for directed interaction networks"

We hope that you can join us for the Applied Statistics Workshop this Wednesday, April 13th, 2011 when we will be happy to finally have Patrick Perry from the Statistics and Information Sciences Laboratory. This is a talk rescheduled from earlier in the term when the weather was much worse. You will find an abstract for the paper. As always, we will serve a light lunch and the talk will begin around 12:15p.

“Point process modeling for directed interaction networks”
Patrick Perry
Statistics and Information Sciences Laboratory
CGIS K354 (1737 Cambridge St.)
Wednesday, April 13th, 2011 12 noon


Network data often take the form of repeated interactions between senders and receivers tabulated over time. Rather than reducing these data to binary ties, a model is introduced for treating directed interactions as a multivariate point process: a Cox multiplicative intensity model using covariates that depend on the history of the process. Consistency and asymptotic normality are proved for the resulting partial-likelihood-based estimators under suitable regularity conditions, and an efficient fitting procedure is described. Multicast interactions—those involving a single sender but multiple receivers—are treated explicitly. A motivating data example shows the effects of reciprocation and group-level preferences on message sending behavior in a corporate e-mail network.

Posted by Matt Blackwell at 9:48 AM

April 7, 2011


From @ajreeves comes an article and corresponding paper on how political information spreads through the media. From the paper’s abstract:

We study the dynamics of public media attention by monitoring the content of online blogs. Social and media events can be traced by the propagation of word frequencies of related keywords. Media events are classified as exogenous - where blogging activity is triggered by an external news item - or endogenous where word frequencies build up within a blogging community without external influences. We show that word occurrences show statistical similarities to earthquakes. The size distribution of media events follows a Gutenberg-Richter law, the dynamics of media attention before and after the media event follows Omori’s law. We present further empirical evidence that for media events of endogenous origin the overall public reception of the event is correlated with the behavior of word frequencies at the beginning of the event, and is to a certain degree predictable. These results may imply that the process of opinion formation in a human society might be related to effects known from excitable media.

Social is the new physical? For more on the distribution of earthquakes and aftershocks, see Gutenberg-Richter law and aftershocks.

Posted by Matt Blackwell at 11:48 AM

April 4, 2011

App Stats: Mandel on "Hierarchical Bayesian Models for Supernova Light Curves"

We are really excited about this week’s Applied Statistics Workshop this Wednesday, April 4th, 2011 when we will be happy to have Kaisey Mandel from the Harvard-Smithsonian Center for Astrophysics. Kaisey will be presenting on hierarchical Bayesian models in Astrophysics. This will be a great chance to see how the statistical methods that we use transport to other disciplines around the sciences. No prior knowledge of astrophysics required! As always, we will serve a light lunch and the talk will begin around 12:15p.

“Hierarchical Bayesian Models for Type Ia Supernova Light Curves, Dust, and Cosmic Distances”
Kaisey Mandel
Harvard-Smithsonian Center for Astrophysics
CGIS K354 (1737 Cambridge St.)
Wednesday, April 4th, 2011 12 noon


Type Ia supernovae (SN Ia) are the most precise cosmological distance indicators and are important for measuring the acceleration of the Universe and the properties of dark energy. To obtain the best distance estimates, the photometric time series (apparent light curves) of SN Ia at multiple wavelengths must be properly modeled. The observed data result from multiple random and uncertain effects, such as measurement error, host galaxy dust extinction and reddening, peculiar velocities, and distances. Furthermore, the intrinsic, absolute light curves of SN Ia differ between individual events: different SN Ia have different intrinsic luminosities, colors and light curve shapes, and these properties are correlated in the population. A hierarchical Bayesian model provides a natural statistical framework for coherently accounting for these multiple random effects while fitting individual SN Ia and the population distribution. I will discuss the application of this statistical model to optical and near-infrared data for computing inferences about the dust, distances and intrinsic covariance structure of SN Ia. Using this model, I demonstrate that the combination of optical and NIR data improves the precision of SN Ia distance predictions by about a factor of 2 compared to using optical data alone. Finally, I will discuss some open research problems concerning statistical analysis of supernova data and their application to cosmology.

Posted by Matt Blackwell at 4:07 PM

March 28, 2011

App Stats: McKay on "A New Foundation for the Multinomial Logit Model"

We hope that you can join us for the Applied Statistics Workshop this Wednesday, March 30th, 2011 when we will be happy to have Alisdair McKay from the Department of Economics at Boston University. You will find an abstract for the paper. As always, we will serve a light lunch and the talk will begin around 12:15p.

"Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model"
Alisdair McKay
Department of Economics, Boston University
CGIS K354 (1737 Cambridge St.)
Wednesday, March 30th, 2011, 12 noon


Often, individuals must choose among discrete alternatives with imperfect information about their values, such as selecting a job candidate, a vehicle or a university. Before choosing, they may have an opportunity to study the options, but doing so is costly. This costly information acquisition creates new choices such as the number of and types of questions to ask the job candidates. We model these situations using the tools of the rational inattention approach to information frictions (Sims, 2003). We nd that the decision maker's optimal strategy results in choosing probabilistically exactly in line with the multinomial logit model. This provides a new interpretation for a workhorse model of discrete choice theory. We also study cases for which the multinomial logit is not applicable, in particular when two options are duplicates. In such cases, our model generates a generalization of the logit formula, which is free of the limitations of the standard logit.

Posted by Matt Blackwell at 9:39 AM

March 21, 2011

Chaney on "Economic Shocks, Religion and Political Influence"

We hope you can join us at the Applied Statistics Workshop this Wednesday, March 23rd, when we are excited to have Eric Chaney from the Department of Economics here at Harvard. Eric will be presenting his paper entitled “Revolt on the Nile: Economic Shocks, Religion and Political Influence.” You’ll find an abstract below. As usual, we will begin at 12 noon with a light lunch and wrap up by 1:30pm.

“Revolt on the Nile: Economic Shocks, Religion and Political Influence”
Eric Chaney
Department of Economics, Harvard University
Wednesday, March 23rd, 12 noon
CGIS Knafel 354 (1737 Cambridge St)


Can religious leaders use their popular influence to political ends? This paper explores this question using over 700 years of Nile flood data. Results show that deviant Nile floods were related to significant decreases in the probability of change of the highest-ranking religious authority. Qualitative evidence suggests this decrease reflects an increase in political power stemming from famine-induced surges in the religious authority’s control over popular support. Additional empirical results support this interpretation by linking the observed probability decrease to the number of individuals a religious authority could influence. The paper concludes that the results provide empirical support for theories suggesting religion as a determinant of institutional outcomes.

Posted by Matt Blackwell at 3:37 PM

March 7, 2011

Rubin on "Are Job-Training Programs Effective?"

We hope you can join at the Applied Statistics Workshop us this Wednesday, March 9th, when we are excited to have Don Rubin, the John L. Loeb Professor of Statistics here at Harvard University, who will be presenting recent work on job-training programs. You will find an abstract below. As usual, we will begin with a light lunch at 12 noon, with the presentation starting at 12:15p and wrapping up by 1:30p.

“Are Job-Training Programs Effective?”
Don Rubin
John L. Loeb Professor of Statistics, Harvard University
Wednesday, March 9th 12:00pm - 1:30pm
CGIS Knafel K354 (1737 Cambridge St)


In recent years, job-training programs have become more important in many developed countries with rising unemployment. It is widely accepted that the best way to evaluate such programs is to conduct randomized experiments. With these, among a group of people who indicate that they want job-training, some are randomly assigned to be offered the training and the others are denied such offers, at least initially. Then, according to a well-defined protocol, outcomes, such as employment statuses or wages for those who are employed, are measured for those who were offered the training and compared to the same outcomes for those who were not offered the training. Despite the high cost of these experiments, their results can be difficult to interpret because of inevitable complications when doing experiments with humans. In particular, some people do not comply with their assigned treatment, others drop out of the experiment before outcomes can be measured, and others who stay in the experiment are not employed, and thus their wages are not cleanly defined. Statistical analyses of such data can lead to important policy decisions, and yet the analyses typically deal with only one or two of these complications, which may obfuscate subtle effects. An analysis that simultaneously deals with all three complications generally provides more accurate conclusions, which may affect policy decisions. A specific example will be used to illustrate essential ideas that need to be considered when examining such data. Mathematical details will not be pursued.

Posted by Matt Blackwell at 10:20 AM

Machine Learning Tutorials

Andrew Moore has a fairly long list of tutorials on various topics in Machine Learning and Statistics. Here is the description:

The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms.

These include classification algorithms such as decision trees, neural nets, Bayesian classifiers, Support Vector Machines and cased-based (aka non-parametric) learning. They include regression algorithms such as multivariate polynomial regression, MARS, Locally Weighted Regression, GMDH and neural nets. And they include other data mining operations such as clustering (mixture models, k-means and hierarchical), Bayesian networks and Reinforcement Learning.

There is a little modesty in the description here. The slides that I have looked at do a great job motivating the methods using intuition, which is often hugely lacking.

Posted by Matt Blackwell at 9:50 AM

March 1, 2011

Michel and Liberman Aiden on "Quantitative Analysis of Culture Using Millions of Digitized Books"

We hope that you can join us for the Applied Statistics Workshop tomorrow, March 2nd when we will be happy to have Jean-Baptiste Michel (Postdoctoral Fellow, Department of Psychology) and Erez Lieberman Aiden (Harvard Society of Fellows). You will find an abstract below. As always, we will serve a light lunch and the talk will begin around 12:15p.

“Quantitative Analysis of Culture Using Millions of Digitized Books”
Jean-Baptiste Michel and Erez Lieberman Aiden
CGIS K354 (1737 Cambridge St.)
Wednesday, March 2nd 12 noon


We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

Posted by Matt Blackwell at 8:20 PM


I’ve been a fairly long-time emacs/ESS user, but there’s a new IDE for R called Rstudio that has a lot of potential. At the very least, it is a huge improvement over the standard R GUI (both on the Mac and Windows). Strangely, though, there are some emacs-like commands (cut everything from the cursor to the end of the line) that are available in the Console, but not the source editor. Organizing the figures, help, workspace, and history is just great, though.

Posted by Matt Blackwell at 1:44 PM

February 9, 2011

Tingley on "A Statistical Method for Empirical Testing of Competing Theories"

Just a note about the Applied Statistics Workshop today, February 9th, where we are excited to have Dustin Tingley from the Department of Government here at Harvard presenting joint work with Kosuke Imai entitled “A Statistical Method for Empirical Testing of Competing Theories”. As usual, the workshop will begin with a light lunch at 12 noon, followed by the presentation at 12:15.


Empirical testing of competing theories lies at the heart of social science research. We demonstrate that a very general and well-known class of statistical models, called finite mixture models, provides an effective way of rival theory testing. In the proposed framework, each observation is assumed to be generated from a statistical model implied by one of the theories under consideration. Researchers can then estimate the probability that a specific observation is consistent with either of the competing theories. By directly modeling this probability with the characteristics of observations, one can also determine the conditions under which a particular theory applies. We discuss a principled way to identify a list of observations that are statistically significantly consistent with each theory. Finally, we propose several measures of the overall performance of a particular theory. We illustrate the advantages of our method by applying it to an influential study on trade policy preferences.

Posted by Matt Blackwell at 7:53 AM

February 4, 2011

Crayola Colors in R

Kottke observes that the whole list of Crayola colors and their hex codes is on Wikipedia, which got me thinking that it might be useful to have some colors to spruce up R graphics. So, I went ahead and created a convenient crayola vector to access all 133 standard Crayola colors. Here are all the colors in R:


And here is a simple example of how to use it:

hist(rnorm(1000), col = crayola["Granny Smith Apple"], border = "white", yaxt = "n")


Always remember, folks, playing with colors is a dangerous game. Use discretion.

Posted by Matt Blackwell at 10:38 AM

January 24, 2011

Sen on "How Having Daughters Affect Judges' Voting"

We hope that you can join us for the first Applied Statistics Workshop of the term this Wednesday, January 26th when we will be happy to have Maya Sen from the Department of Government. She will be presenting joint work with Adam Glynn, also in the Department of Government. You will find an abstract below. As always, we will serve a light lunch and the talk will begin around 12:15p. (Note that this talk is not on the website schedule yet due to technical issues.)

“Female Socialization: How Having Daughters Affect Judges’ Voting on Women’s Issues” (with Adam Glynn)
Maya Sen
Department of Government
CGIS K354 (1737 Cambridge St. map)
Wednesday, January 26th 12 noon


Social scientists have long maintained that women judges might behave different than their male colleagues (e.g., Boyd et al. (2010)). This is particularly true when it comes to highly charged social issues such as gender discrimination, sexual harassment, and the status of gender as a suspect classification under federal law. Less studied has been the role that a judge’s family might have on judicial decision making. For example, we may think that a male judge with daughters might have different views of gender discrimination and sexual harassment than a male judge without any daughters. This paper takes a look at the question causally by leveraging the hypothesis that, conditional on the number of total number of children, the probability of a judge having a boy or a girl is independent of any covariates (Washington 2008). Looking at data from the U.S. Courts of Appeals, we find that conditional on the number of children, judges with daughters consistently vote in a more liberal fashion on gender issues than judges without daughters. This effect is particularly strong among Republican appointed judges and is robust and persists even once we control for a wide variety of factors. Our results more broadly suggest that personal experiences — as distinct from partisanship — may influence how elite actors make decisions, but only in the context of substantively salient issues.

UPDATE (1/25/11): Correct a typo in the abstract. Judges become more liberal on gender issues with daughters, not more conservative.

Posted by Matt Blackwell at 1:39 PM

January 5, 2011

ESP and Bayes at the Times

You say you wanted an update on that ESP paper where a professor of psychology “time-reversed” some classic experiments? The New York Times has you covered. Want to see more discussion of null hypotheses and Bayesian analysis in the NYT? Also covered:

Many statisticians say that conventional social-science techniques for analyzing data make an assumption that is disingenuous and ultimately self-deceiving: that researchers know nothing about the probability of the so-called null hypothesis.

The so-called null hypothesis? Take that, Fisher! Oh, there’s more:

Instead, these statisticians prefer a technique called Bayesian analysis, which seeks to determine whether the outcome of a particular experiment “changes the odds that a hypothesis is true”…

Also, the last paragraph of the story seems very relevant:

So far, at least three efforts to replicate the experiments have failed.

Posted by Matt Blackwell at 10:57 PM

The PhD Glut, Tenure, and Meritocracy

The Economist has an article that has made the rounds on the Monkey Cage and with Drew Conway. The article bemoans the glut of PhD degrees granted in the U.S. and cries oversupply. It points to the rise of post-docs and adjuncts as cheap labor in the education and research markets. While the article does paint a perhaps overly bleak portrait of the current academic environment and the potential for those who leave it, I think the reaction has been a little strange. Since it’s slightly off-topic, I’ll banish most of my argument below the fold.

From Josh Tucker at the Monkey Cage:

Like it or not, academia is a meritocracy. It may be a highly flawed meritocracy susceptible to overvaluing labels or fads of the day, but ultimately tenure is bestowed on those who earn the respect of their peers, and the more of your peers that respect you, the more job offers you are going to get and the more money you are going to make.

And yet, tenure and meritocracy are goals at odds. The average age of professors when they received tenure is 39 in the United States, meaning that many will spend roughly half of their academic life with tenure. I don’t think it is a stretch to say that academic merit ceases to be a relevant criteria for employment after tenure. Of course, it matters for further professional success, but that is not what’s on the table. The most recent generation of scholars are likely to live quite long and productive lives, meaning tenure looms large for budgets.

To bring to back to sports, baseball teams do not hire players for life. The certainty of tenured faculty positions shifts most of the uncertainty in the academic labor market to graduate students and junior faculty. Does this mean we should reduce the number of PhD candidates we admit? Not necessarily, but we should not ignore the fundamental cracks in the academic system that we have seen in the last economic downturn. The academy (and especially the way we teach) will change drastically in the next 10-20 years. We would be wise to be ahead of those change instead of trying to catch up.

We could, though, try to reduce the information asymmetries that exist for potential PhD students. But it is hard to deny this sentence from the Economist article:

The interests of academics and universities on the one hand and PhD students on the other are not well aligned.

Though, even when they are aligned you see problems due to selection effects: professors who are dispensing advice are the ones who made it. Very few potential graduate students understand the market before they decide to earn a PhD. This might actually be getting better due to (of all things!) the often insipid and almost always anonymous job-rumour mongering forums that have cropped up for various disciplines. Even when people are attempting to give honest advice, they fail to anticipate impending problems. When I applied to graduate school 5 years ago, people made the job market sound like the land of milk and honey. You just go pick your job off the tree! No one anticipated the crash in the number of available jobs, or at least they did not reveal this information to me. This problem exists even more strongly for law schools as they are unaware of the life at top law firms and the likelihood of receiving such jobs. Yet law students have more paths available to them after they graduate compared to PhDs. The paths to success in academia are fairly limited to the traditional model.

And lastly, this story from Drew:

On my first day of graduate school one of my professors said, “Congratulations on being accepted to the program. While most people will not understand it, you have one of the greatest jobs one the planet. People are going to pay you to think, and I think that is pretty cool.”

I heard this numerous times as well and may have even said it myself. And yet, I think there is an inherent tension between this idea and Josh’s assertion that we succeed by convincing others of our ideas. The most surprising thing I have learned in graduate school, embarrassingly enough, is that we are not in the business of creating ideas, we are in the business of selling ideas and, in some sense, selling yourself. I think potential PhD students should know this.

Posted by Matt Blackwell at 11:33 AM

December 2, 2010

How Random are Marginal Election Outcomes?


Dan Carpenter came by the workshop yesterday to talk about his paper (with SSS-pal Justin Grimmer, Eitan Hersh, and Brian Fienstein) on close elections and their usefulness for estimating causal effects. A few recent papers exploit these marginal elections to answer the a general class of questions: how does being elected to office affect a person’s outcomes? At the least, office getting into office is a fairly big boost to resume and, further, while in office there are various (ahem) business opportunities that may or may not be entirely legal. Fans of politics (including political scientists) have a keen interest in the effect of office-holding on re-election, commonly known as the incumbency advantage.

Obviously, simply looking at winners and losers is a problematic strategy, to say the least. So, instead, we look for winners and losers whose elections were randomly determined. And extremely close two-candidate elections seem to fit the bill. Poor weather, ballot miscounts, and voting errors can all push a narrow election margin to either candidate. Thus, the argument goes, vote share counts right around 50-50 essentially assign the office by coin flip. Thus, comparing winners and losers around the that cutoff can actually estimate causal effects.

What Dan and his coauthors point out, though, is that some candidates might have more control than others over that coin flip. And candidates and parties are likely to devote more of their resources to those close elections than to safe ones. In fact, they show that candidates with structural advantages in these resources are far more likely to win these close elections. In the above graph, you can see that winners of House elections in the U.S. are much more likely to share the party of the current governor of the state, even when we restrict the sample to +/- 2% around the 50-50 mark. This indicates that there may be deeper imbalances between winners and losers, even very close to the 50-50 mark.

They suggest the fundamental differences between winners and losers in these close elections could come from two sources: ability to get-out-the-vote pre-election or successful legal challenges post-election. If those ballot miscounts get recounted or thrown out in favor of a candidate due to better legal maneuvering, then those aren’t terribly random are they? The main critique here is that causal effects are hard to find when we compare winners and losers in close elections and we have to make sure that our proposed “randomizations” make sense theoretically and hold empirically.

Posted by Matt Blackwell at 8:34 AM

November 29, 2010

Carpenter on "How Random are Marginal Election Outcomes?"

We hope that you can join us for the Applied Statistics Workshop this Wednesday, December 1st when we will be happy to have Dan Carpenter from the Department of Government. You will find an abstract below. As always, we will serve a light lunch and the talk will begin around 12:15p.


Elections with small margins of victory represent an important form of electoral competition and, increasingly, an opportunity for causal inference. Scholars using regression discontinuity designs (RDD) have interpreted the winners of close elections as randomly separated from the losers, using marginal election results as an experimental assignment of offce-holding to one candidate versus the other. In this paper we suggest that marginal elections may not be as random as RDD analysts suggest. We draw upon the simple intuition that elections that are expected to be close will attract greater campaign expenditures before the election and invite legal challenges and even fraud after the election. We present theoretical models that predict systematic differences between winners and losers, even in elections with the thinnest victory margins. We test predictions of our models on a dataset of all House elections from 1946 to 1990. We demonstrate that candidates whose parties hold structural advantages in their district are systematically more likely to win close elections at a wide range of bandwidths. Our findings call into question the use of close elections for causal inference and demonstrate that marginal elections mask structural advantages that may be troubling normatively. (Co-authored with Justin Grimmer, Eitan Hersh, and Brian Feinstein)

Posted by Matt Blackwell at 9:00 AM

November 22, 2010

Weathermap History of the US Presidential Vote

David Sparks has drawn up some isarithmic maps of the two-party presidential vote over the last 90 years. An isarithmic map is sometimes called a heat map, and you would most often see a rough version of them on your local weather report. David shows us the political weather over time:

Isarithmic map of the 2008 Presidential Election

As you can see, the votes have been smoothed over geographic space. David also has a video where he smooths across time, leading to very beautiful plots. You should also see the summary of how he made the plots. A good reminder of the death-by-1000-papercuts nature of data analysis:

Using a custom function and the interp function from akima, I created a spatially smoothed image of interpolated partisanship at points other than the county centroids. This resulted in inferred votes over the Gulf of Mexico, the Atlantic and Pacific Oceans, the Great Lakes, Canada and Mexico -- so I had to clip any interpolated points outside of the U.S. border using the very handy pinpoly function from the spatialkernel package.

My only worry is that spatial geography might be the wrong dimension on which to smooth. With weather data, it makes obvious sense to smooth in space. A suburb of Chicago might have more in common with a suburb of Cleveland than it does to Chicago, even though it is much closer to Chicago. Thus, this type of smoothing might understate real, stark differences between local communities (Clatyon Nall has some work on how the interstate highway system has accelerated some of these divides). Basically, I think there is a political space that fails to quite match up to geographic space. (Exploring what that political space looks like and why it occurs would be an interesting research project, at least to me.)

You should really explore the rest of David's site. He has numerous awesome visualizations.

Posted by Matt Blackwell at 3:10 PM

November 19, 2010

Seven Deadly Sins, Revisited

Gelman responds to Phil Schrodt’s take-down of statistical methodology, about which we commented a while back. To my mind, he has a strange take on the piece. To wit:

Not to burst anyone’s bubble here, but if you really think that multiple regression involves assumptions that are too much for the average political scientist, what do you think is going to happen with topological clustering algorithms, neural networks, and the rest??

Gelman is responding to two of Schrodt’s seven sins: (1) kitchen-sink regressions with baseless control sets and (2) dependence on linear models at the exclusion of other statistical models. I think that Gelman misinterprets Schrodt’s criticism here. It is not that political scientists somehow lack the ability to comprehend multiple regression and its assumptions. It is that political scientists are being lazy intellectually (possibly incentivized by the discipline!) and fail to critically examine their analysis or their methods. It’s a failure of standards and work, not a failure of intellect. Thus, I fail to see the contradiction in Schrodt’s advice or his condemnation—-it’s a call to thinking more about our data and they fit with our models and their assumptions. Now, one may think that this is beyond the abilities of folks, but I fail to see that argument being made by Schrodt (and I am certainly not making it).

Gelman himself often calls for simplicity:

I find myself telling people to go simple, simple, simple. When someone gives me their regression coefficient I ask for the average, when someone gives me the average I ask for a scatterplot, when someone gives me a scatterplot I ask them to carefully describe one data point, please.

This seems more about presentation of results or a failure to know the data. There is a huge challenge in more complicated models since they often require more care and attention to how we present the results. All of the techniques Gelman describes should be essential parts of the data analysis endeavor. That people fail to do these simple tasks speaks more to Schrodt’s accusation of “intellectual sloth” than anything.

Finally, we can probably all get behind a couple of commandments:

  1. Know thy data.
  2. Use this knowledge to find the best method for thy question.

I see both Gelman and Schrodt making these points, albiet differently. While Schrodt sees a violation of 2 as primarily due to intellectual laziness, Gelman see it as primarily due to intellectual handicaps. Both are slightly unfair to academics, but sloth is at least curable.

Posted by Matt Blackwell at 6:55 PM

November 18, 2010

The Future of Bayes

John Salvatier has a blog post on the future of MCMC algorithms, focusing on differential methods, which use derivatives of the posterior to inform where the algorithm should move next. This allows for greater step length, faster convergence, and better handling of multimodal posteriors. Gelman agrees with the direction. There has been some recent work on implementing automatic differentiation in R, which is the cornerstone of the algorithms Salvatier discusses. Perhaps we will see this moving into some of the more popular MCMC packages soon.

On a slightly different Bayes front, SSS-pal and former blogger Justin Grimmer has a paper on variational approximation, which is a method for deterministically approximating posteriors. This approach is often useful when MCMC is extremely slow or impossible, since convergence under VA is both fast and guaranteed.

Posted by Matt Blackwell at 9:52 AM

November 15, 2010

Raškovič on "Managing supplier-buyer relationships in transnational companies"

We hope that you can join us for the Applied Statistics Workshop this Wednesday, November 17th when we will be happy to have Matevž Raškovič from the University of Ljubljana,who is currently a Visiting Fellow in the Sociology Department. You will find an abstract below. As always, we will serve a light lunch and the talk will begin around 12:15p in CGIS K354 (1737 Cambridge St).


This talk is part of an ongoing PhD research taking place at the Faculty of Economics and Faculty of Social Sciences at the University of Ljubljana in Slovenia, and the Technical University Eindhoven in the Netherlands. The research is motivated by literature on the different 'mentalities' of international companies and the fact that transnational companies, as a unique type of an international company mentality, are increasingly being understood as communities and spaces of social relationships. The research explores the management of supplier-buyer relationships within Danfoss, Denmark's biggest industrial organization. In particular, it looks at how Danfoss manages supplier-buyer relationships to be both globally efficient and flexible, while at the same time facilitating learning. Balancing all these three strategic goals presents a considerable challenge for all internationally-active companies. The talk will offer a short theoretical background for the research and focus on presenting the multi-method mixed research design built around two separate two-mode egocentric cognitive social networks.

Posted by Matt Blackwell at 9:00 AM

November 11, 2010

Problematic Imputation at the Census

According to a working paper by Greg Kaplan and Sam Schulhofer-Wohl at the Federal Reserve Bank of Minneapolis, recent estimated declines in interstate migration are simply artifacts of the imputation procedure used by the Census Bureau.

The bureau uses a “hot-deck” imputation procedure to match respondents who fail to respond (called recipients) to those who actually do respond (called donors) and impute the recipient’s missing values with the donor’s observed values. For migration, the crucial questions are where the respondent lived one year ago. Before 2006, they effectively did not match on current location, even though current location is a strong predictor of past location. In 2006, they switched:

Using the most recently processed respondent as the donor to impute missing answers means that the order of processing can a ect the results. Since 2006, respondents have been processed in geographic order. This ordering means that the donor usually lives near the recipient. Since long-distance migration is rare, the donor’s location one year ago is also usually close to the recipient’s current location. Thus, if the procedure imputes that the recipient moved, it usually imputes a local move. Before 2006, the order of processing was geographic but within particular samples. Therefore, on average, donors lived farther from recipients; donors’ locations one year ago were also on average farther from recipients’ current locations; and recipients were more likely to have imputed interstate moves.

(via Gelman)

Posted by Matt Blackwell at 7:09 PM

October 22, 2010

Workflow Agonistes

The Setup is a site dedicated to interviewing nerdy folk about what software/hardware they use to do their jobs. It has mostly been web designers and software developers, which is interesting, yet removed from academics. Thus, I was glad to see them interview Kieran Healy, a sociologist at Duke. The whole thing is worth a read if you are interested (like me) in these sorts of things, but here is a bit of his advice:

Workflow Agonistes: I've written about this elsewhere, at greater length. Doing good social-scientific research involves bringing together a variety of different skills. There's a lot of writing and rewriting, with all that goes along with that. There is data to manage, clean, and analyze. There's code to be written and maintained. You're learning from and contributing to some field, so there's a whole apparatus of citation and referencing for that. And, ideally, what you're doing should be clear and reproducible both for your own sake, when you come back to it later, and the sake of collaborators, reviewers, and colleagues. How do you do all of that well? Available models prioritize different things. Many useful tricks and tools aren't taught formally at all. For me, the core tension is this. On the one hand, there are strong payoffs to having things organized simply, reliably, and effectively. Good software can help tremendously with this. On the other hand, though, it's obvious that there isn't just one best way (or one platform, toolchain, or whatever) to do it. Moreover, the people who do great work are often the ones who just shut up and play their guitar, so to speak. So it can be tricky to figure out when stopping to think about "the setup" is helpful, and when it's just an invitation to waste your increasingly precious time installing something that's likely to break something else in an effort to distract yourself. In practice I am only weakly able to manage this problem.

Also good advice:

I try to keep as much as possible in plain text.

On his site, Kieran has more guidance on choosing workflows for social science research. Sidenote: he has one of the best looking academic websites I have seen.

Posted by Matt Blackwell at 9:53 PM

September 21, 2010

What is the likelihood function?

An interesting 1992 paper by Bayarri and DeGroot entitled “Difficulties and Ambiguities in the Definition of a Likelihood Function” (gated version) grapples with the problem of defining the likelihood when auxiliary variables are at hand. Here is the abstract:

The likelihood function plays a very important role in the development of both the theory and practice of statistics. It is somewhat surprising to realize that no general rigorous definition of a likelihood function seem to ever have been given. Through a series of examples it is argued that no such definition is possible, illustrating the difficulties and ambiguities encountered specially in situations involving “random variables” and “parameters” which are not of primary interest. The fundamental role of such auxiliary quantities (unfairly called “nuisance”) is highlighted and a very simple function is argued to convey all the information provided by the observations.

The example that resonates with me in on pages 4-6, where they describe the ambiguity of using defining the likelihood function when there is an observation y which is a measurement of x subject to (classical) error. There are several different ways of writing a likelihood in that case, depending on how you handle the latent, unobserved data x. One can condition on it, marginalize across it, or include it in the joint distribution of the data. Each of these can lead to a different MLE.

Their point is that situations like this involve subjective choices (though, all modeling requires subjective choice) and the hermetic seal between the “model” and the “prior” is less airtight than we think.

Posted by Matt Blackwell at 4:41 PM

September 14, 2010

You are not so smart

You are not so smart is a blog dedicated to explaining self-delusions. The most recent post is on the Texas sharpshooter fallacy:

The Misconception: You take randomness into account when determining cause and effect.
The Truth: You tend to ignore random chance when the results seem meaningful or when you want a random event to have a meaningful cause.

Posted by Matt Blackwell at 6:33 PM

August 30, 2010

Rigor and modeling in economics

In a postscript, Andrew Gelman laments a general trend he notices in economics:

My only real problem with it is that when discussing data analysis, [the authors] pretty much ignore the statistical literature and just look at econometrics. In the long run, that's fine--any relevant developments in statistics should eventually make their way over to the econometrics literature. But for now I think it's a drawback in that it encourages a focus on theory and testing rather than modeling and scientific understanding.

Gelman has an idea about why this might be the case:
The problem, I think, is that they (like many economists) think of statistical methods not as a tool for learning but as a tool for rigor. So they gravitate toward math-heavy methods based on testing, asymptotics, and abstract theories, rather than toward complex modeling. The result is a disconnect between statistical methods and applied goals.

There is likely a balance here that Gelman misses between theoretical modeling and statistical modeling. Economists are in the business of testing complex theoretical models. A complex statistical model may draw attention away from that narrow goal.

Not that I necessarily endorse that viewpoint. It simply feels slightly unfair to economists to say that their spartan statistical modeling is a product of their obsession with technical rigor.

Posted by Matt Blackwell at 9:43 AM

August 27, 2010

Jackman on the Australian Elections

If you enjoy Australian politics, betting markets, and sharp statistical analysis, take a look at Simon Jackman's blog. He has been killing it lately.

Posted by Matt Blackwell at 10:39 AM

The Seven Deadly Sins of Contemporary Quantitative Analysis

You may think you have good reasons to not stop what you are doing and read Phil Schrodt's essay on the "Seven Deadly Sins of Contemporary Quantitative Political Analysis". But you do not. Not only does the piece make several astute points about the current practice of quantitative social science (in a highly enjoyable way, I might add), but it also reviews developments in the philosophy of science that have led us here. The entirety is excellent, so picking out an excerpt is difficult, but here is his summary of our current philosophical messiness:

I will start by stepping back and taking a [decidedly] bird's eye (Thor's eye?) view of where we are in terms of the philosophy of science that lies beneath the quantitative analysis agenda, in the hope that knowing how we got here will help to point the way forward. In a nutshell, I think we are currently stuck with an incomplete philosophical framework inherited (along with a lot of useful ideas) from the logical positivists, combined with a philosophically incoherent approach adopted from frequentism. The way out is a combination of renewing interest in the logical positivist agenda, with suitable updating for 21st century understandings of stochastic approaches, and with a focus on the social sciences more generally. Much of this work has been done last decade or so in the qualitative and multi-methods community but not, curiously, in the quantitative community. The quantitative community does, however, provide quite unambiguously the Bayesian alternative to frequentism, which in turn solves most of the current contradictions in frequentism which we somehow--believing six impossible things before breakfast--persuade our students are not contradictions. But we need to systematically incorporate the Bayesian approach into our pedagogy. In short, we may be in a swamp at the moment, but the way out is relatively clear.

His section on "prediction versus explanation" is also quite insightful and deserves more attention. The upshot:

...the point is that distinguishing scientific explanation from mythical (or other non-scientific, such as Freudian) explanation is one of the central themes for the logical positivists. In the absence of prediction, it cannot be done.

Warning: if you truly love significance tests, you might feel a little heartbroken when reading this essay.

Posted by Matt Blackwell at 9:34 AM

April 16, 2010

Hainmueller in the New York Times

Jens Hainmueller, Assistant Professor at MIT and former writer for this very blog, has had some of his research written up in the New York Times today:

"Americans, whether they are rich or poor, are much more in favor of high-skilled immigrants," said Jens Hainmueller, a political scientist at M.I.T. and co-author of a survey of attitudes toward immigration with Michael J. Hiscox, professor of government at Harvard. The survey of 1,600 adults, which examined the reasons for anti-immigration sentiment in the United States, was published in February in American Political Science Review, a peer-reviewed journal.

There is an ungated version of the original paper.

Posted by Matt Blackwell at 8:36 AM

April 15, 2010

The inevitable R backlash

There is a blog post floating around by Dr. AnnaMaria De Mars, where she speculates on what the "next big thing" is going to be. Apparently, it is data visualization and analyzing unstructured data, but not R:

Contrary to what some people seem to think, R is definitely not the next big thing, either. I am always surprised when people ask me why I think that, because to my mind it is obvious...I know that R is free and I am actually a Unix fan and think Open Source software is a great idea. However, for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail. It does NOT fit with the way the vast majority of people in the world use computers. The vast majority of people are NOT programmers. They are used to looking at things and clicking on things.

(I am not sure how a "non-programmer" is going to be able to analyze unstructured data or create wonderful visualizations, but that is beside the point.)

The ease-of-use argument or the "not everyone is a programmer" argument is one to which I am sympathetic. It has become quite heated in the debate over the Apple iPad in the last few months. Where the iPad succeeds is to simplify the act of content consumption, which is fantastic.

The act of content creation is more fickle and has always required special tools and running statistical analyses falls firmly into content creation. While it is true that most people are not programmers, it is also true that most people are not creating statistical content. Being able to program grants you agility in the face of data analysis that large statistical software packages cannot provide. They move too slowly.

R's core functionality moves fairly slowly as well, but it gives you the tools you need to implement basically any algorithm or any statistical model. This is leading to a lot of innovation by small groups of users, creating packages to fill voids. It feels more like a programming language than a unified piece of software (libraries! command-line!), but this is what makes it flexible.

And if we are being honest with ourselves there is a fundamental fact: point-and-click interfaces do not promote replicability. This might be fine in the private sector, I am not sure. But in the academic world, being able to replicate a finding is crucial.

Posted by Matt Blackwell at 8:27 AM

March 10, 2010

Google public data explorer

The Google Public Data Explorer just went up and it is worth a look. They have collected a number of large datasets and created a set of visualization tools to explore the data. Probably most interesting is the ability to show how the data changes over time using animation. This will be familiar to you if you have seen any of Hans Rosling's TED talks.

While it is fun to play around with the data, it can be a bit overwhelming. Content requires curation. One that I found interesting was the World Bank data on net migration:

It's hard to get the colors/sizes quite right since size measures just the magnitude (positive or negative) and the colors range from red (people coming) to blue (people going). This sort of feels like the natural extension of programs like SPSS to the web.

Posted by Matt Blackwell at 4:53 PM

March 8, 2010

Zajonc on "Bayesian Inference for Dynamic Treatment Regimes"

We hope you will join us this Wednesday, March 10th at the Applied Statistics workshop when we will be happy to have Tristan Zajonc (Harvard Kennedy School). Details and an abstract are below. A light lunch will be served. Thanks!

"Bayesian Inference for Dynamic Treatment Regimes"
Tristan Zajonc
Harvard Kennedy School
March 10th, 2010, 12 noon
K354 CGIS Knafel (1737 Cambridge St)


Policies in health, education, and economics often unfold sequentially and adapt to developing conditions. Doctors treat patients over time depending on their prognosis, educators assign students to courses given their past performance, and governments design social insurance programs to address dynamic needs and incentives. I present the Bayesian perspective on causal inference and optimal treatment choice for these types of adaptive policies or dynamic treatment regimes. The key empirical difficulty is dynamic selection into treatment: intermediate outcomes are simultaneously pre-treatment confounders and post-treatment outcomes, causing standard program evaluation methods to fail. Once properly formulated, however, sequential selection into treatment on past observables poses no unique difficulty for model-based inference, and analysis proceeds equivalently to a full-information analysis under complete randomization. I consider optimal treatment choice as a Bayesian decision problem. Given data on past treated and untreated units, analysts propose treatment rules for future units to maximize a policymaker's objective function. When policymaker's have multidimensional preferences, the approach can estimate the set of feasible outcomes or the tradeoff between equity and efficiency. I demonstrate these methods through an application to optimal student tracking in ninth and tenth grade mathematics. An easy to implement optimal dynamic tracking regime increases tenth grade mathematics achievement 0.1 standard deviations above the status quo, with no corresponding increase in inequality. The proposed methods provide a flexible and principled approach to causal inference for sequential treatments and optimal treatment choice under uncertainty.

Posted by Matt Blackwell at 12:42 PM

Tufte goes to Washington

In case you have not heard, Edward Tufte has been appointed to the Recovery Independent Advisory Panel by President Obama. The mission statement of the Panel is:

To promote accountability by coordinating and conducting oversight of Recovery funds to prevent fraud, waste, and abuse and to foster transparency on Recovery spending by providing the public with accurate, user-friendly information.

It is hard to imagine a better person for this panel than Tufte. As Feltron said, this is wonderful news for data nerds, designers, and the general public.

Posted by Matt Blackwell at 1:00 AM

March 6, 2010

Teaching teachers

Andrew Gelman has some good comments on the great Elizabeth Green article about teaching in the New York Times Magazine. The article is about how to improve both classroom management and subject instruction for K-12 teachers, but Gelman correctly points out that many of these the struggles resonate with those of us teaching statistics at the undergraduate and graduate levels.

I used to be of the opinion that the teaching of children and the teaching of adults were two fundamentally different beasts and comparisons between the two were missing the point. The more I teach, though, the more I see teaching as a kind of a skill which is separated from the material being taught. Knowing a topic well does not imply being able to teach a topic well. This should have been obvious to me given the chasm between good research and good presentations.1 The article nails this as it talks about math instruction:

Mathematicians need to understand a problem only for themselves; math teachers need both to know the math and to know how 30 different minds might understand (or misunderstand) it. Then they need to take each mind from not getting it to mastery. And they need to do this in 45 minutes or less. This was neither pure content knowledge nor what educators call pedagogical knowledge, a set of facts independent of subject matter, like Lemov's techniques. It was a different animal altogether.

If this is true, how can we improve teaching? I think that Gelman is right in identifying student participation as important to teaching statistics. Most instructors would agree that statistics is all about learning by doing, but many of us struggle to identify how to actually implement this, especially in lectures. Cold-calling is extremely popular with law and business schools, but rare in the social sciences. Breaking off to do group work is another useful technique. In addition to giving up control of the class (which Gelman mentions), instructors have to really build the class around these breaks.

Reflecting on my own experience, both as a student and an instructor, I am starting to believe in three (related) fundamentals of statistics teaching:

  1. Repetition. If we really do learn by doing, then we should pony up and have students do many simple problems that involve the same fundamental skill or concept.

  2. Mantras. We are often trying to give students intuitions about the way statistics "works," but many students just need a simple, compact definition of the concept. Before I understood the Central Limit Theorem, I could tell you what it was ("The sums and means of random variables tend to be Normal as we get more data") because of the mantra that my first methods instructor taught me. As a friend told me, statistics is a foreign language and in order to write sentences you first need to know some vocabulary.

  3. Maps. It is so easy to feel lost in a statistics course and not understand how one week relates to the next. A huge help is to give students a diagram that represents where they are (specific topic) and where they are going (goals). The whole class should be focused around the path to the goals and they should always be able to locate themselves on the path.

There are probably more fundamentals that I am missing, but I think each of these is important and overlooked. Often this is simply because they are hard implement, instructors have other commitments, and the value-added of improving instruction can be very low. In spite of these concerns and the usual red herrings2, I think that there are simple changes we can make to improve our teaching.
1Perhaps a more subtle point is that being a good presenter does not imply being a good instructor. They are related, though. Good public speakers have an advantage as teachers, since they are presumably more comfortable in front of crowds. The goal of presenting (persuasion) and the goal of instruction (training people in a skill) are very different. People confuse the two because the medium is often so similar (lecture halls, podiums, etc).
2Teaching evaluations are important, but they are often very coarse. Students know if they didn't understand something, but rarely know why. Furthermore, improving evaluations need not come from improving instruction.

Posted by Matt Blackwell at 4:10 PM

March 5, 2010

Collecting datasets

Infochimps hosts what looks to be a growing number of datasets mostly free. There seems to be some ability to sell your dataset (at a 50% commission rate!), but the real story is quick ability to browse data. It looks a little thin now, bu as someone who is constantly looking for good examples for teaching, this could be a valuable resource. (via gelman)

Posted by Matt Blackwell at 9:47 AM

March 2, 2010

Newsdot maps the news

Newsdot is a new tool from Slate that displays a "social network" for topics in the news, be they people, organizations, or locations. Here's a look:


It uses a product called Calais, which does automatic tagging of documents by finding keywords. You can try it out with any set of text with their viewer. Here is a sample output from an article in the New York Times about the primary elections in Texas:


You can see that Calais has been able to identify all the Gov. Perry and Sen. Hutchison in addition to any pronouns or verbs that refer to them.

Some thoughts are below the fold.

  1. I love the idea of mapping the space of "news" and using tags is an creative way of doing this. One way of improving this whole enterprise would be to cluster the topics and use those clusters to color the dots instead of the type of "node" it is (currently, it's blue for countries, red for people, etc)
  2. Calais is the kind of tool that really grabs my attention, much like Mechanical Turk did when I first heard about it. These types of products are going to completely change the way we do research. There used to be large barriers of entry to conducting research because of the resources needed to collect, manage and store data. Even just a few years ago, if you wanted to get a large dataset, you would have to either spend a lot of time or hire someone. Tools like Calais and mturk allow non-programmers to collect and manage data at much faster rates, for much cheaper. This opening up of data could shake up academia by increasing the speed of research production and allowing "startup" researchers to produce high-quality analyses. (Relatedly, the opening up of information (not limited to data) over the last decade lowered the cost of becoming an "expert" and altered the depth vs. breadth tradeoff.)

Posted by Matt Blackwell at 11:17 AM

March 1, 2010

Steenburgh on "Substitution Patterns of the Random Coefficients Logit"

We hope you will join us this Wednesday, March 3rd at the Applied Statistics workshop when we will be happy to have Thomas Steenburgh (Harvard Business School). Details, an abstract, and a link to the paper are below. A light lunch will be served. Thanks!

"Substitution Patterns of the Random Coefficients Logit"
Thomas Steenburgh
Harvard Business School
March 3rd, 2010, 12 noon
K354 CGIS Knafel (1737 Cambridge St)

You can find the paper at the SSRN.


Previous research suggests that the random coefficients logit is a highly flexible model that overcomes the problems of the homogeneous logit by allowing for differences in tastes across individuals. The purpose of this paper is to show that this is not true. We prove that the random coefficients logit imposes restrictions on individual choice behavior that limit the types of substitution patterns that can be found through empirical analysis, and we raise fundamental questions about when the model can be used to recover individuals' preferences from their observed choices.

Part of the misunderstanding about the random coefficients logit can be attributed to the lack of cross-level inference in previous research. To overcome this deficiency, we design several Monte Carlo experiments to show what the model predicts at both the individual and the population levels. These experiments show that the random coefficients logit leads a researcher to very different conclusions about individuals' tastes depending on how alternatives are presented in the choice set. In turn, these biased parameter estimates affect counterfactual predictions. In one experiment, the market share predictions for a given alternative in a given choice set range between 17% and 83% depending on how the alternatives are displayed both in the data used for estimation and in the counterfactual scenario under consideration. This occurs even though the market shares observed in the data are always about 50% regardless of the display.

Posted by Matt Blackwell at 10:43 AM

February 23, 2010

Applied Statistics Workshop on Video

Over the course of the year we have tried to record many of the Applied Statistics workshop, but only now have we finally posted one. It is from Cassandra Wolos Pattanayak's talk on propensity score matching at the CDC from last week. You can find it here and on the seminar website.

Posted by Matt Blackwell at 3:43 PM

Killingsworth on "Happiness"

We hope you will join us this Wednesday, February 24th at the Applied Statistics workshop when we will be happy to have Matt Killingsworth (Department of Psychology). An abstract is below. A light lunch will be served. Thanks!

"Mind Wandering and Happiness"
Matt Killingsworth
Department of Psychology
February 24th, 2010, 12 noon
K354 CGIS Knafel (1737 Cambridge St)

You can preview the iPhone app to see how the data is collected

Although humans spend much of their time mind-wandering, i.e., thinking about something other than what one is actually doing, little is known about mind wandering's relation to human happiness. Using novel technology to achieve the world's largest experience sampling study of people's everyday lives, we found that participants spent nearly half of their waking hours mind-wandering and that it had large effects on happiness. Mind wandering was never observed to increase happiness and often reduced happiness considerably. Although some activities and situations modestly decreased the probability of mind wandering, they generally did not buffer against negative thoughts when a person's mind did stray from the present.

Posted by Matt Blackwell at 3:36 PM

January 21, 2010

Voter Outrage over Health Care

Political scientists David Brady and Doug Rivers, along with business and law professor Daniel Kessler wrote an op-ed for the WSJ arguing that the health care bill is hurting the Democrats. Their evidence is that states with lower support for the bill also have lower support for incumbent Democratic senatorial candidates:

Health reform is more popular in some of these states than in others. Where it's popular, Democratic candidates don't have too much of a problem, but where it's unpopular--and that includes most states--the Democratic Senate candidates are fighting an uphill battle. Support for health reform varies in these 11 states from a low of 33% in North Dakota to a high of 48% in Nevada. Democrats trail Republicans in six of the states; three are toss-ups; and in two, Democrats have a solid lead.
I hate to fill any kind of institutional stereotype, but the causal reasoning here leaves much to be desired. The argument of the essay is that BECAUSE of health care, Democrats are doing worse in the polls. On this question, obviously, we have no data: this is why speculation is running rampant. The counterfactual would be: what would have happened to Democratic senatorial candidates if there had been no (or a substantially smaller) health care bill? Pundits can hardly type fast enough to get answers to this question out right now. Certainly, though, a correlation of support for health-care and support for Democrats will not provide the answer (since, you know, there is no variation on the treatment--all states are in the health care reform world).

Despite the general tone of the piece ("The culprit is the unpopularity of health reform...") , I believe the authors are making a different argument. Namely, that voters are responding to their senator's vote on health care. Based on their evidence, however, I think this is a flawed argument as well.

Confounding is an obvious problem here. There are many factors that could influence opinions on health care and the Democrats (ideology, economic performance, etc). The authors clearly consider possible problems of confounding:

How do we know that it's the health-reform bill that's to blame for the low poll numbers for Democratic Senate candidates and not just that these are more conservative states?

First, we asked voters how their incumbent senator voted on the health-care bill that passed on Christmas Eve. About two-thirds answered correctly. Even now, long before Senate campaigns have intensified, voters know where the candidates stand on health care. And second, we asked voters about their preference for Democrat versus Republican candidates in a generic House race. As in the Senate, the higher the level of opposition to health reform, the greater the likelihood that the state's voters supported Republicans.

It might be the case that voters are punishing known health care supporters! But, again, I am not sure that these polls show this. The Senate vote was party-line. If someone knew their senator's party, then they could infer their vote without actually knowing it. They could simply know that Democrats are trying to reform health care and their senator is a Democrat. Under this scenario, the actual vote of the senator would make not difference since our hypothetical voter equates Democrats with health care reform.

Put it this way: do you think that House Democrats that voted against the bill are going to have easy reelection campaigns? That seems like the real test of this hypothesis.

A simple gut check would be to run the same analysis with the stimulus instead of health care. I imagine you would get similar results. The point is that the advice from this article for Democrats--withdraw support for health care reform--is not supported by the data.

UPDATE: Brendan Nyhan over at Pollster makes essentially the same argument.

Posted by Matt Blackwell at 4:02 PM

January 9, 2010

Netflix queues by zip code

The New York Times has put together an awesome data visualization on the geography of Netflix. For each zip code they have the top 50 rentals of 2009 and they use these ranks to draw heat maps for each movie. There are all kinds of interesting patterns that point to both how preferences cluster and information spreads. My favorite two maps are the following, which I reference after the jump (darker colors indicate more rentals in that areas):

Mad Men, Season 1 Disc 1:


Paul Blart, Mall Cop:
paul blartt.png

First, an Oscar nomination seems to put you at the top of everyone's list, regardless of geography. Thus, Slumdog Millionaire, Benjamin Button, Gran Torino and Doubtall have high ranks. Second, box-office blockbusters do fairly poorly across the board, seemingly because most people saw those movies in the theaters (Wall-E, Dark Knight, etc).

Finally, the remaining movies show a great deal of geographic variation. There is a fairly pronounced difference between urban centers and the suburbs. Unsurprisingly, movies that have high Metacritic scores do very well in the urban centers, whereas they seem absent from the outlying areas. Reversely, movies that critics consider terrible (and are usually marketed toward teenagers) mostly ship to the suburbs.You can see this in the stark difference between the critically acclaimed TV show Mad Men and the slapstick comedy Paul Blart: Mall Cop (full disclosure: Mad Men was on my queue last year, Paul Blart was not, and I live in Cambridge/Somerville).

The other obvious divide that arises is race. Tyler Perry's two movies were only on the top 50 lists for a handful of neighborhoods that predominantly African-American. In Boston, for example, the movies cluster heavily in Dorchester and Mattapan.

Tyler Perry's The Family That Prays:


How people form preferences is one my favorite subjects and I love visualizations like these. My instinct is that there is a lot of preference clustering happening, based largely on age, class and, to a lesser extent, race. But above and beyond this, I imagine the information networks vary by geography--urbanites may hear about movies from certain blogs, while folks in the suburbs (who probably have more children and teens) might rely more on national TV advertisements. The Oscars tend to cross geographic and social lines because they are a widely-visible, low-cost indicator of movie quality. All of this points to a key fact: how information gets into and flows through our social network(s) is an important aspect of how our preferences come to be.

Also, this is begging for someone to put together a list of "Democrat" movies and "Republican" movies based on party affiliation in each zip code.

Posted by Matt Blackwell at 5:06 PM

Sequential Ideal Points

Simon Jackman puts together a plot of how the estimation of ideal points of the 111th U.S. Senate changes as he adds each roll call. Every Senator starts the term at 0 and then branches out. It illustrates an interesting feature of these IRT models:

The other thing is that there doesn't seem to be any obvious "vote 1″ update for ideal points. That is, there is no simple mapping from the ideal point estimate based on m roll call to ideal point estimates based on m+1 roll calls. You have to start the fitting algorithm from scratch each time (and hence the appeal of exploiting multiple cores etc), although the results from the previous run giving pretty good start values.

Posted by Matt Blackwell at 3:57 PM

November 30, 2009

Glynn on "What Can We Learn with Statistical Truth Serum?"

We hope you can join us this Wednesday, December 2nd for the final Applied Statistics Workshop of the term, when we will have Adam Glynn (Department of Government) presenting his talk entitled "What Can We Learn with Statistical Truth Serum?" Adam has provided the following abstract:

Due to the inherent sensitivity of many survey questions, a number of researchers have adopted indirect questioning techniques in order to minimize bias due to dishonest or evasive responses. Recently, one such technique, known as the list experiment (and also known as the item count technique or the unmatched count technique), has become increasingly popular due to its feasibility in online surveys. In this talk, I will present results from two studies that utilize list experiments and discuss the implications of these results for the design and analysis of future studies. In particular, these studies demonstrate that, when the key assumptions hold, standard practice ignores relevant information available in the data, and when the key assumptions do not hold, standard practice will not detect some detectable violations of these assumptions.

A copy of the companion paper will appear on our website shortly.

The workshop will begin at 12 noon with a light lunch and wrap up by 1:30. We meet in room K354 of CGIS Knafel (1737 Cambridge St). We hope you can make it.

Posted by Matt Blackwell at 4:28 PM

November 28, 2009

Cookbooks and constitutions

Slightly off-topic insights from Adam Gopnik:

All this is true, and yet the real surprise of the cookbook, as of the constitution, is that it sometimes makes something better in the space between what's promised and what's made...Between the rule and the meal falls the ritual, and the real ritual of the recipe is like the ritual of the law; the reason the judge sits high up, in a robe, is not that it makes a difference to the case but that it makes a difference to the clients. The recipe is, in this way, our richest instance of the force and the power of abstract rules.

There's a research agenda somewhere in those sentences, I believe. Rules lead to rituals and yet rules are simply codified rituals. A small point, perhaps a bit obvious, yet it speaks more broadly to social science research. And highlights where qualitative scholars get it right: looking for correlations between rules (or structure?) and outcomes often averages out the most intriguing part of the story.

(hat tip, MR)

Posted by Matt Blackwell at 2:47 PM

November 17, 2009

Dynamic Panel Models

I have been toying around with dynamic panel models from the econometrics literature and I have hit my head up against a key set of assertions. First, a quick setup. The idea with these models is that we have a set units which we measure at different points in time. For instance, perhaps we survey a group of people multiple times in the course of an election and ask them how they are going to vote, do they plan to vote, how do they rate the candidates, etc. We might then want to know how these answers vary over time or with certain covariates.

Here is a typical model:


There are two typical features of these models that seem relevant. First, most include a lagged dependent variable (LDV) to account for persistence in the responses. If I was going to vote for McCain the last time you called, I'll probably still want to do that this time. Makes sense. Second, we include a unit-specific effect, alpha, to account for all other relevant factors. Dynamic panel models tend to identify their effects with a simple differencing by running the following model:


Which eliminates the unit-specific effect by the differencing, but our parameters remain, ready to be estimated. I should note that there are some identification issues left to solve and the differences between estimators in this field mostly have to do with how to instrument for the differenced LDV.

Reading these models, I have two questions. One, is there a reason to expect that we need both a LDV and a unit-specific effect? This means that we expect that there is a shock to a unit's dependent variable that is constant across periods. I find this a strange assumption. I understand a unit-specific shock to the initial level and then using LDV thereafter, but in every period?

Two, the entire identification strategy here is based on the additivity of the model, correct? If we were to draw a directed acyclic graph of these models, it would be trivially obvious that we could never identify this model nonparametrically. I understand that we sometimes need to use models to identify effects, but should these identifications depend so heavily on the functional form? It seems that this problem is tied up in the first. We are allowing for the unit-specific effect as a way to free the model of unnecessary assumptions, yet this forces our hand into making different, perhaps stronger assumption to get identification.

Please clear up my confusion in the comments if you are more in the know.

Posted by Matt Blackwell at 1:49 PM

November 16, 2009

Greiner on "Exit Polling and Racial Bloc Voting"

Please join us at the Applied Statistics workshop this Wednesday, November 18th at 12 noon when we will be happy to have Jim Greiner of the Harvard Law School presenting on "Exit Polling and Racial Bloc Voting: Combining Individual-Level and R x C Ecological Data." Jim has provided a companion paper with the following abstract:

Despite its shortcomings, cross-level or ecological inference remains a necessary part of many areas of quantitative inference, including in United States voting rights litigation. Ecological inference suffers from a lack of identification that, most agree, is best addressed by incorporating individual-level data into the model. In this paper, we test the limits of such an incorporation by attempting it in the context of drawing inferences about racial voting patterns using a combination of an exit poll and precinct-level ecological data; accurate information about racial voting patterns is needed to trigger voting rights laws that can determine the composition of United States legislative bodies. Specifically, we extend and study a hybrid model that addresses two-way tables of arbitrary dimension. We apply the hybrid model to an exit poll we administered in the City of Boston in 2008. Using the resulting data as well as simulation, we compare the performance of a pure ecological estimator, pure survey estimators using various sampling schemes, and our hybrid. We conclude that the hybrid estimator offers substantial benefits by enabling substantive inferences about voting patterns not practicably available without its use.

Both the paper and the technical appendix are on the course website.

The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.

Posted by Matt Blackwell at 9:00 AM

November 3, 2009

Airoldi on "A statistical perspective on complex networks"

I hope you can join us at the Applied Statistics Workshop this Wednesday, November 4th, when we will be happy to have Edo Airoldi, Assistant Professor in the Department of Statistics here at Harvard. Edo will be presenting a talk entitled "A statistical perspective on complex networks" for which he has provided the following abstract:

Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of science, as many scientific inquiries involve collections of measurements on pairs of objects. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. In this talk, I will review a few ideas that are central to this burgeoning literature. I will emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. I will conclude by describing open problems and challenges for machine learning and statistics.

The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.

Posted by Matt Blackwell at 10:47 AM

October 26, 2009

Tchetgen on "Doubly robust estimation in a semi-parametric odds ratio model"

This Wednesday, October 28th, the Applied Statistics workshop will welcome Eric Tchetgen Tchetgen, Assistant Professor of Epidemiology at Harvard School of Public Health, presenting his work titled "Doubly robust estimation in a semi-parametric odds ratio model." Eric has provided the following abstract for the paper:

We consider the doubly robust estimation of the parameters in a semi-parametric conditional odds ratio model characterizing the effect of an exposure in the presence of many confounders. We develop estimators that are consistent and asymptotically normal in a union model where either a prospective baseline density function or a retrospective baseline density function is correctly specified but not necessarily both. The case of a binary outcome is of particular interest, then our approach yields a doubly robust locally efficient estimator in a semi-parametric logistic regression model For general types of outcomes, we provide a strategy to obtain doubly robust estimators that are nearly locally efficient We illustrate the method in a simulation study and an application in statistical genetics. Finally, we briefly discuss extensions of the proposed method to the semi-parametric estimation of a parameter indexing an interaction between two exposures on the logistic scale, as well as extensions to the setting of a time-varying exposure in the presence of time-varying confounding.

The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.

Posted by Matt Blackwell at 11:10 AM

October 20, 2009

Elements of Statistical Learning (Online)

In case you had not already heard, Trevor Hastie, Robert Tibshirani, and Jerome Friedman have put a PDF copy of the second edition of their excellent text Elements of Statistical Learning on the book's website. I am sure many of you already own it, but a searchable version for the laptop is incredibly useful. The second edition has a lot of new content, including completely new chapters on Random Forests, Ensemble Learning, Undirected Graphical Models, and High-Dimensional Problems.

While a copy on your computer is very handy, a desk copy of this book is essential if you are interested in machine learning or data mining. The book is also a sight to behold. You can buy a copy at Amazon or Springer.

Posted by Matt Blackwell at 10:15 AM

October 19, 2009

Eggers on "Electoral Rules, Opposition Scrutiny, and Policy Moderation in French Municipalities"

Please join us this Wednesday October 21st when we will have a change in the schedule. We are happy to have Andy Eggers (Department of Government) presenting a talk titled "Electoral Rules, Opposition Scrutiny, and Policy Moderation in French Municipalities: An Application of the Regression Discontinuity Design." Andy has provided the following abstract for his talk:

Regression discontinuity design (RDD) is a powerful and increasingly popular approach to causal inference that can be applied when treatment is assigned deterministically based on a continuous covariate. In this talk, I will present an application of RDD from French municipalities, where the system of electing the municipal council depends on whether the city's population is above or below 3500. First I show that cities above the population cutoff have fewer uncontested elections and more opposition representation on municipal councils, consistent with expectations. I then trace the effect of these political changes -- which amount to a heightening of the scrutiny imposed on the mayor -- on policy outcomes, providing evidence that more opposition scrutiny leads to more moderate policy.

The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.

Posted by Matt Blackwell at 7:21 PM

October 14, 2009

The Fundamental Regret of Causal Inference

Tim Kreider at the New York Times has a short piece on what he dubs "The Referendum" and how it plagues us:

The Referendum is a phenomenon typical of (but not limited to) midlife, whereby people, increasingly aware of the finiteness of their time in the world, the limitations placed on them by their choices so far, and the narrowing options remaining to them, start judging their peers' differing choices with reactions ranging from envy to contempt. ...Friends who seemed pretty much indistinguishable from you in your 20s make different choices about family or career, and after a decade or two these initial differences yield such radically divergent trajectories that when you get together again you can only regard each other's lives with bemused incomprehension.

Those familiar with casual inference will recognize this as stemming from the Fundamental Problem of Causal Inference: we cannot observe, for one individual, both their response to treatment and control. The article is an elegant look at how we grow to worry about those mysterious missing potential outcomes--the paths we didn't choose--and how we use our friends' lives to impute those missing missing outcomes. Kreider goes on to make this point exactly, with a beautiful quote from a novel:

The problem is, we only get one chance at this, with no do-overs. Life is, in effect, a non-repeatable experiment with no control. In his novel about marriage, "Light Years," James Salter writes: "For whatever we do, even whatever we do not do prevents us from doing its opposite. Acts demolish their alternatives, that is the paradox." Watching our peers' lives is the closest we can come to a glimpse of the parallel universes in which we didn't ruin that relationship years ago, or got that job we applied for, or got on that plane after all. It's tempting to read other people's lives as cautionary fables or repudiations of our own.

Perhaps the only response is that, while so close to us in so many respects, friends may be poor matches for gauging these kinds of effects. In any case, "Acts demolish their alternatives, that is the paradox" is the best description of the problem of causal inference that I have seen.

Posted by Matt Blackwell at 4:19 PM

October 13, 2009

An on "Bayesian Propensity Score Estimation"

We hope you can join us at the Applied Statistics workshop this Wednesday, October 14th at 12 noon, when we will be happy to have Weihua An, a graduate student in the Sociology Department here at Harvard. Weihua will be presenting "Bayesian Propensity Score Estimators: Simulations and Applications." He has provided the following abstract:

Despite their popularity, conventional propensity score estimators (PSEs) do not take into account the estimation uncertainties in the propensity score into causal inference. This paper develops Bayesian propensity score estimators (BPSEs) to model the joint likelihood of both the outcome and the propensity score in one step, which naturally incorporate such uncertainties into causal inference. Simulations show that PSEs treating estimated propensity scores as if they were known will overestimate the variation in treatment e_ects and result in overly conservative inference, whereas BPSEs will provide corrected variance estimation and valid inference. Compared to other direct adjustment methods (E.g., Abadie and Imbens 2009), BPSEs are guaranteed to provide positive variance estimation, more reliable in small samples, and more flexible to contain complex propensity score models. To illustrate the proposed methods, BPSEs are applied to evaluating a job training program.

The workshop will be in room K354 of CGIS, 1737 Cambridge St. The workshop starts at noon and usually wraps up around 1:30. There will be a light lunch. We hope you can make it.

Posted by Matt Blackwell at 12:53 AM

October 9, 2009

Tom Coburn can backward induce

We are a few days late to comment on the story of Senator Tom Coburn's amendment to the Commerce, Justice and Science Appropriations Bill to cut all National Science Foundation funding for the political science program and any of its missions. Choice quote (of which there are many): "...it is difficult, even for the most creative scientist, to link NSF's political science findings to the advancement of cures to cancer or any other disease." Snap.

This has received attention from the social science community and others. Even Paul Krugman, mentioned in Coburn's press release as an example of (wasteful? political?) NSF funding, has something to say about it. There's no need to rehash the arguments here, which ever-so-nicely point out that Senator Coburn doesn't really know what he's talking about nor do his arguments make a whole lot of sense.

Regardless of the arguments, I just wanted to put a graph up to put all of this in perspective. In the 111th Congress, Coburn has had very little success with his amendments:
Seven of the rejections are instances when Coburn's amendment was tabled without discussion. Most of the rejections have been of proposed budget cuts or banning funds from certain projects And this is just in this year. Out of all the roll call votes on Coburn-sponsored amendments in the Senate over his tenure, only 8 out of 68 have actually passed.

I understand trying to tackle his critiques, as they track with an internal debate already in the discipline. But I think it may be a tad knee-jerk to start letter-writing campaigns to our Senators. Tom Coburn knows that putting out no-win amendments is a great way to take positions in the Senate without committing to anything. Minority amendments are a costless signal of the blandest kind--even a political scientist can see that.

Posted by Matt Blackwell at 12:21 PM

October 6, 2009

Criminal tricks and sugary treats

Just in time for Halloween, a study from the British Journal of Psychiatry by Moore, Carter and van Goozen that uses data from the British Cohort Study to estimate the effect of daily candy intake on adult violent behavior.

They find that 10 year olds that ate candy daily were much more likely to be convicted of a violent crime at age 34 than those who did not eat candy daily. They cite this as evidence that childhood diet has an effect on adult behavior. One of their hypothesized mechanisms is that using candy as a reward for children (e.g. for behavior modification) inhibits the child's ability to delay gratification. And there is evidence that children that posses problems with delayed gratification tend to score lower on a host of measures, including the SATs (see also: the marshmallow studies).

The longitudinal data gives them leverage. For instance, the authors are able to control for parenting style at age 5 along with other variables, such as various scales of behavior problems or mental abilities at age 5 (some of these were discarded in the final analysis because of their variable selection rules). These ease my main concern that "problem children" might lead to a certain type of parenting and also indicate a propensity for violent adult behavior. Their controls help to eliminate this possibility (though, I will say that I am not familiar with this literature and they use fairly complicated scales to measure these concepts).

Strangely, at least to me, they do not seem to control for parental income or socio-economic class. I have a few ideas as to why this might matter. First, candy is relatively cheap compared to a good diet, thus poorer families might be forced to choose the cheaper option when feeding their children. Second, financial pressures lead to time pressures, which could force parents to take shortcuts--feeding their children junk food because it is quick or using it to induce behavior because it is easy. Thus, parental income may matter greatly for candy intake and it also may increase propensity to commit violent crimes. I am not certain this is true, but it seems plausible and unmentioned in the paper. Even if the finding is not causal, however, it is still interesting.

Posted by Matt Blackwell at 1:48 PM

October 5, 2009

Robins on "Optimal Treatment Regimes"

Please join us this Wednesday, October 7th at the Applied Statistics workshop when we will be happy to have Jamie Robins, the Mitchell L. and Robin LaFoley Dong Professor of Epidemiology here at Harvard, who will be presenting on "Estimation of Optimal Treatment Strategies from Observational Data with Dynamic Marginal Structural Models." Jamie has passed along a related paper with the following abstract:

We review recent developments in the estimation of an optimal treatment strategy or regime from longitudinal data collected in an observational study. We also propose novel methods for using the data obtained from an observational database in one health-care system to determine the optimal treatment regime for biologically similar subjects in a second health-care system when, for cultural, logistical, or financial reasons, the two health-care systems differ (and will continue to differ) in the frequency of, and reasons for, both laboratory tests and physician visits. Finally, we propose a novel method for estimating the optimal timing of expensive and/or painful diagnostic or prognostic tests. Diagnostic or prognostic tests are only useful in so far as they help a physician to determine the optimal dosing strategy, by providing information on both the current health state and the prognosis of a patient because, in contrast to drug therapies, these tests have no direct causal effect on disease progression. Our new method explicitly incorporates this no direct effect restriction.

A copy of the paper is also available.

The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.

Posted by Matt Blackwell at 11:31 AM

October 1, 2009

Repeal Power Laws

A group of students from the Machine Learning department at Carnegie Mellon took to the streets last week to protest at the G20 summit in Pittsburgh. I am afraid that their issues were not being taken seriously inside the summit. There's a first hand account and a photo set on flickr. I can't decide if my favorite is "Repeal Power Laws" or "Safer Data Mining".



Posted by Matt Blackwell at 3:35 PM

September 29, 2009

Athey on "Sponsored Search Advertising Auctions"

Please join us at the Applied Statistics workshop this Wednesday, Sept 30th when we will be delighted to have the distinguished Susan Athey, Professor of Economics here at Harvard, presenting on "A Structural Model of Equilibrium and Uncertainty in Sponsored Search Advertising Auctions" (joint work with Denis Nekipelov). Susan has passed along the following abstract:

Sponsored links that appear beside internet search results on the major search engines are sold using real-time auctions, where advertisers place standing bids that are entered in an auction each time a user types in a search query. The ranking of advertisements and the prices paid depend on advertiser bids as well as "quality scores" that are assigned for each advertisement and user query. Existing models assume that bids are customized for a single user query and the associated quality scores; however, in practice that is impossible, as queries arrive more quickly than advertisers can change their bids, and advertisers cannot perfectly predict changes in quality scores. This paper develops a new model where bids apply to many user queries, while the quality scores and the set of competing advertisements may vary from query to query. In contrast to existing models that ignore uncertainty, which produce multiplicity of equilibria, we provide sufficient conditions for existence and uniqueness of equilibria, and we provide evidence that these conditions are satisfied empirically. We show that the necessary conditions for equilibrium bids can be expressed as an ordinary differential equation.
We then propose a structural econometric model. With sufficient uncertainty in the environment, the valuations are point-identified, otherwise, we propose a bounds approach. We develop an estimator for bidder valuations, which we show is consistent and asymptotically normal. We provide Monte Carlo analysis to assess the small sample properties of the estimator. We also develop a tractable computational approach to calculate counterfactual equilibria of the auctions.
Finally, we apply the model to historical data for several keywords. We show that our model yields lower implied valuations and bidder profits than approaches that ignore uncertainty. We find that bidders have substantial strategic incentives to reduce their expressed demand in order to reduce the unit prices they pay in the auctions, and in addition, these incentives are asymmetric across bidders, leading to inefficient allocation. We show that for the keywords we study, the auction mechanism used in practice is not only strictly less efficient than a Vickrey auction, but it also raises less revenue.

The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.

Posted by Matt Blackwell at 10:47 AM

September 23, 2009

The placebo effect is growing?

Wired has a fascinating article about the placebo effect and how the pharmaceutical companies deal with it. Not only is there evidence that the placebo effect is growing (some drugs approved in the 80s and 90s would struggle to pass the FDA now), but it turns out there may be significant geographic differences in the strength of the effect:

Assumption number one was that if a trial were managed correctly, a medication would perform as well or badly in a Phoenix hospital as in a Bangalore clinic. Potter discovered, however, that geographic location alone could determine whether a drug bested placebo or crossed the futility boundary. By the late '90s, for example, the classic antianxiety drug diazepam (also known as Valium) was still beating placebo in France and Belgium. But when the drug was tested in the US, it was likely to fail. Conversely, Prozac performed better in America than it did in western Europe and South Africa. It was an unsettling prospect: FDA approval could hinge on where the company chose to conduct a trial.

I'm not sure how you separate out the geographic confounding of the drug response versus the geographic confounding of the placebo response when looking at differences between the two, but it is interesting nonetheless.

(via kottke)

UPDATE: I just wanted to clarify why I thought this article was interesting so that folks do not think that I believe all the analysis contained in the article. The "effect" of the placebo treatment is clearly nonsensical as effects always need to about comparisons. What is identified from a clinical trial is the difference between the placebo response and the treatment response. My interpretation of the article (which is different than the author's interpretation) is that there is a lot of variation in that difference, both over time and over geography within the same drug. Since I have not read the academic articles that inform the article, I'm not sure if this variation is about what we would expect or not giving sampling variation, but the possibility of a systematic relationship is intriguing.

As Kevin notes in the comments below, there are some that are criticizing the article. It took a bit of searching (not that simple!), but I found a good response:


The author of the response simply claims that variation in the placebo response is simply sampling variance.

Posted by Matt Blackwell at 10:04 AM

September 21, 2009

Van Alstyne on "Network Structure and Information Advantage"

Please join us this Wednesday, September 23rd at the Applied Statistics Workshop when we will be fortunate to have Marshall Van Alstyne presenting "Network Structure and Information Advantage: The Diversity--Bandwidth Tradeoff." Marshall is an Associate Professor at Boston University in the Department of Management Information Systems as well as Research Associate at MIT's Center for E-Business. Marshall passed along the following abstract:

To get novel information, we propose that actors in brokerage positions face a tradeoff between network diversity and communication channel bandwidth. As the structural diversity of a network increases, the bandwidth of communication channels in that network decreases, creating countervailing effects on the receipt of novel information. This argument is based on the observation that diverse networks are typically made up of weaker ties, characterized by narrower communication channels across which less diverse information is likely to flow. The diversity-bandwidth tradeoff is moderated by (a) the degree to which topics are uniformly or heterogeneously distributed over the alters in a broker's network, (b) the dimensionality of the information in a broker's network (whether the total number of topics communicated by alters is large or small) and (c) the rate at which the information possessed by a broker's contacts refreshes or changes over time. We test this theory by combining social network and performance data with direct observation of information content flowing through email channels at a medium sized executive recruiting firm. These analyses unpack the mechanisms that enable information advantages in networks and serve as a 'proof-of-concept' for using email content data to analyze relationships among information flows, networks, and social capital.

A copy of the paper is also available.

The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.

Posted by Matt Blackwell at 10:26 AM

September 15, 2009

Goodrich on "Bringing Rank-Minimization Back In"

Please join us tomorrow, September 16th when we are excited to have Ben Goodrich (Government/Social Policy) presenting "Bringing Rank-Minimization Back In: An Estimator of the Number of Inputs to a Data-Generating Process," for which Ben has provided the following abstract:

This paper derives and implements an algorithm to infer the number of inputs to a data-generating process from the outputs. Previous working dating back to the 1930s proves that this inference can be made in theory, but the practical difficulties have been too daunting to overcome. These obstacles can be avoided by looking at the problem from a different perspective, utilizing some insights from the study of economic inequality, and relying on modern computer technology.

Now that there is a computational algorithm that can estimate the number of variables that generated observed outcomes, the scope for applications is quite large. Examples are given showing its use for evaluating the reliability of measures of theoretical concepts, empirically testing formal models, verifying whether there is an omitted variable in a regression, checking whether proposed explanatory variables are measured without error, evaluating the completeness of multiple imputation models for missing data, and facilitating the construction of matched pairs in randomized experiments. The algorithm is used to test the main hypothesis in
Esping-Andersen (1990), which has been influential in the political economy literature, namely that various welfare-state outcomes are a function of only three underlying variables.

The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm.

We hope you can make it.

Posted by Matt Blackwell at 10:47 AM

September 8, 2009

Grimmer on "Quantitative Discovery from Qualitative Information"

Please join us tomorrow, September 9th for our first workshop of the year when we are happy to have Justin Grimmer presenting joint work with Gary King entitled "Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology."

Justin and Gary have provided the following abstract for their paper:

Many people attempt to discover useful information by reading large quantities of unstructured text, but because of known human limitations even experts are ill-suited to succeed at this task. This difficulty has inspired the creation of numerous automated cluster analysis methods to aid discovery. We address two problems that plague this literature. First, the optimal use of any one of these methods requires that it be applied only to a specific substantive area, but the best area for each method is rarely discussed and usually unknowable ex ante. We tackle this problem with mathematical, statistical, and visualization tools that define a search space built from the solutions to all previously proposed cluster analysis methods (and any qualitative approaches one has time to include) and enable a user to explore it and quickly identify useful information. Second, in part because of the nature of unsupervised learning problems, cluster analysis methods are not routinely evaluated in ways that make them vulnerable to being proven suboptimal or less than useful in specific data types. We therefore propose new experimental designs for evaluating these methods. With such evaluation designs, we demonstrate that our computer-assisted approach facilitates more efficient and insightful discovery of useful information than either expert human coders using qualitative or quantitative approaches or existing automated methods. We (will) make available an easy-to-use software package that implements all our suggestions.

The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 1215 and we usually wrap up around 130 pm.

Posted by Matt Blackwell at 12:00 PM

August 20, 2009

The changing nature of R resources

There was a time that the only place to find R help was through the R-help listserv. But things have changed pretty drastically in just the last year or so as R has gained users from all different disciplines. I wanted to just point out a few resources that I have found useful over the last few months.

The #rstats hashtag on Twitter has a good following and a number of consistent contributors. If you already use Twitter, this is a great way to hear about interesting new applications of R or the growing number of R tutorials and meetups (Los Angeles and New York have already had a few well attended meetups).

Partially born from the #rstats group is the R tag on StackOverflow, a website dedicated to asking and answering programming questions. The R questions have only recently started to appear on StackOverflow, but if it takes off, it might be a smarter way to match up R users who need help and R experts who can help. The site has voting on answers so that unhelpful or repetitive answers will be weeded out. And since all of this is on one website, searching through the questions is quite a bit easier than trying to track down an R-help thread from 2004. Exciting stuff.

Posted by Matt Blackwell at 1:55 PM

May 13, 2009

Natural Languages

The social sciences have long embraced the idea of text-as-data, but in recent years, increasing numbers of quantitative researchers are investigating how to have computers find answers to questions in texts. This task might appear easy on the outset (as it apparently did to early researchers in machine translation), but, as we know, natural languages are incredibly complicated. In most of the applications in social science, analysts end up making a "bag of words" assumptions--the relevant part of a document are the actual words, not their order (this is not a unreasonable assumptions, especially given the questions being asked).

When I see applications of natural language processing (NLP) in the social sciences, I typically think very quickly to its future. Computers are making strides at being able to understand, in some sense, what they are reading. Two recent articles , however, give a good overview of the challenges that NLP faces. First, John Seabrook of the New Yorker had an article last summer, Hello, Hal, which states the problem clearly:

The first attempts at speech recognition were made in the nineteen-fifties and sixties, when the A.I. pioneers tried to simulate the way the human mind apprehends language. But where do you start? Even a simple concept like "yes" might be expressed in dozens of different ways--including "yes," "ya," "yup," "yeah," "yeayuh," "yeppers," "yessirree," "aye, aye," "mmmhmm," "uh-huh," "sure," "totally," "certainly," "indeed," "affirmative," "fine," "definitely," "you bet," "you betcha," "no problemo," and "okeydoke"--and what's the rule in that?

The article is mostly about speech recognition, but it definitely hits the main points about why human-generated language is so hard tricky. The second article, in the New York Times recently, is a short story about Watson, the computer that IBM is creating to compete on Jeopardy! IBM is trying to push the field of Question Answering quite a bit forward with this challenge. This goal is to create a computer that you can ask a natural language question to and get the correct answer. A quick story in the article indicates that they may a bit to go:

In a demonstration match here at the I.B.M. laboratory against two researchers recently, Watson appeared to be both aggressive and competent, but also made the occasional puzzling blunder.

For example, given the statement, "Bordered by Syria and Israel, this small country is only 135 miles long and 35 miles wide," Watson beat its human competitors by quickly answering, "What is Lebanon?"

Moments later, however, the program stumbled when it decided it had high confidence that a "sheet" was a fruit.

This whole Watson enterprise makes me wonder if there are applications for this kind of technology within the social sciences. Would this only be useful as a research aid, or are there empirical discoveries to be made with this? I suppose it comes down to this: if a computer could answer your question, what would you ask?

Posted by Matt Blackwell at 9:43 AM

March 25, 2009

How to teach methods

Over on the polmeth mailing list there is a small discussion brewing about how to teach undergraduate methods classes. Much of the discussion is on how to manage the balance between computation and statistics. A few posters are using R as their main data analysis tool, which provoked others to comment that this might push a class too far away from its original intent: to learn research methods (although one teacher of R indicated that a bigger problem was the relative inability to handle .zip files). This got me thinking about how research methods, computing and statistics fit into the current education framework.

As a gross and unfair generalization, much of college is about learning how take a set of skills and use them to make effective and persuasive arguments. In a literature class, for instance, one might use the skills of reading and writing to critical engage a text. In mathematics, one might take the "skill" of logic and use it to derive a proof.

The issue with introductory methods classes is that many undergraduates come into school without a key skill: computing. It is becoming increasingly important to have proficient computing skills in order to make cogent arguments with data. I wonder if it is time to rethink how we teach computing at lower levels of education to adequately prepare students for the modern workplace. There is often emphasis on using computers to teach students, but I think it will become increasingly important to teach computers to students. This way courses on research methods can focus on how to combine computing and statistics in order to answer interesting questions. We could spend more time matching tools to questions and less time simply explaining the tool.

Of course, my argument reeks of passing buck. A broader question is this: where do data analysis and computing fit in the education model? Is this a more fundamental skill that we should build up in children earlier? Is it perfectly fine where it is, being taught in college?

Posted by Matt Blackwell at 3:08 PM

March 11, 2009

Differences-in-Differences in the Rubin Causal Model

At today's Applied Statistics Workshop, Dan Hopkins gave a talk on contextual effects on political views in the United States and United Kingdom. Dan presented evidence that national political discussions increase the salience of local context for opinion formation. Namely, those who live in areas of high immigrant populations tend to react more strongly to changes in the national discussion of immigration than others. The data and analysis are interesting, but the talk's derailment interested me slightly more.

The derailment involved Dan's choice of method, a version of difference-in-difference (DID) estimator and how to represent it in the Rubin Causal Model. Putting this model in terms of the usual counterfactual framework is slightly nuanced, but not impossible.

The typical setup for a DID estimator is that there are two groups G = {0,1} and two time periods T={0,1}. Between time 0 and time 1, some policy is applied to group 1 and not applied to group 0. What we are interested in is the effect of that policy. For instance, if Y is the outcome in time 1 and Y(1) is the potential outcome (in time 1) in the counterfactual world where we forced the policy to be implemented, then we can define a possible quantity of interest: the average treatment effect on the treated (ATT): E[Y(1) - Y(0) | G = 1].

We could proceed from here by simply making an ignorability assumption about the treatment assignment. Unfortunately, policies are often not randomly assigned to the groups and the groups may differ in ways that affect the outcome. For instance, an example from the Wooldrige textbook is the effect of the placement of trash processing facility on house prices. The two groups in this case are "houses close to the facility" and "houses far from the facility" and the policy is the facility's placement. It would be borderline insane to imagine city planners randomly assigning the location of the facility and these two groups will differ in ways that are very related to house prices (I don't think I have seen too many newly minted trash dumps in rich neighboorhoods). Thus, we cannot simply use the observed data from the control group to make the counterfactual inference.

What we can do, however, is look at how changes in the dependent variable occur for the two groups and use these changes to identify the model. For instance, if we assume that X is the outcome in period 0, then the DID identifying assumption is

E[Y(0) - X(0) | G = 1] = E[Y(0) - X(0) | G = 0],

which is simply saying that the change in potential outcomes under control is the same for both groups. Or, that group 1 would have followed the same "path" as group 0 if they had not received treatment. With this assumption in hand, we can identify the ATT as the typical DID estimator

E[Y(1) - Y(0) | G =1] = (E[Y|G=1] - E[X|G=1]) - (E[Y|G=0] - E[X|G=0]).

The proof is short and can be found in Abadie (2005) and Athey & Imbens (2006) also show (these papers also go into considerable depth on how to simple schemes).

Two issues always arise for me when I see DID estimators. First is the incredibly difficult task of arguing that the policy is the only thing that changed between time 0 and time 1 with respect to the two groups. That is, perhaps the city also placed a freeway through the part of town where the trash processing facility was built at the same time. The DID estimator would not be able to differentiate effects. Thus, it is up to the practitioner to argue that all other changes in the period are orthogonal to the two groups. Second, I have very little insight about how identification or estimands change as we move from a simple non-parametric world to a highly parametric world (where most applied researchers live). If and how do inferences change when we move away from simple conditional expectations?

Posted by Matt Blackwell at 2:13 PM

February 25, 2009

Missingness Maps and Cross Country Data

I've been doing some work on diagnostics for missing data issues and one that I have found particularly useful and enlightening has been what I've been calling a "missingness map." In the last few days, I used it on some World Bank data I downloaded to see what missingness looks like in a typical comparative political economy dataset.


View image

The y-axis here are country-years and the x-axis are variables. We draw a red square where the country-year-variable cell is missing and a light green square where the cell is observed. We can see immediately that a whole set of variables in the middle columns are almost always unobserved. These are variables measuring income inequality and they are known to have extremely poor coverage. This plot very quickly shows us how listwise deletion will affect our analyzed sample and how the patterns of missingness occur in our data. For example, in these data, it seems that if GDP is missing, then many of the other variables, such as imports and exports are also missing. I think this is a neat way to get a quick, broad view of missingness.

(Another map and some questions after the jump...)

We can also change the ordering of the rows to give a better sense of missingness. For the World Bank data, it is wise to resort the data by time and see how missingness changes over time.


View image

A clear pattern emerges that the World Bank has better and better data as we move forward in time (the map becomes more "clear"). This is not surprising, but it is an important point when, say, deciding the population under study in a comparative study. Clearly, listwise deletion will radically change the sample we analyze (the answers will be biased toward more recent data, at the very least). The standard statistical advice of imputation or data augmentation is tricky as well here because we need to choose what to impute. Should we carry forth with imputation given that income inequality measures seem to be completely unavailable before 1985? If we remove observations before this, how do we qualify our findings?

Any input on the missingness map would be amazing, as I am trying to add as a diagnostic it to a new version of Amelia. What would make these plots better?

Posted by Matt Blackwell at 2:58 PM

February 3, 2009

What is Japan doing at 2:04pm?

You can now answer that question and so many more. The Japanese Statistics Bureau conducts a survey every five years called the "Survey on Time Use and Leisure Activities" where they give people journals to record their activities throughout the day. Thus, they have a survey of what people are in Japan at any given time of the day. This is fun data in of itself, but it was made downright addictive by Jonathan Soma who created a slick Stream Graph based on the data. (via kottke)

There are actually three Stream Graphs: one for the various activities, another for how the current activity differs between sexes and a final for how the current activity breaks down by economic status. Thus, the view contains not only information about daily routines, but also how those routines vary across sex and activity. For instance, gardening tends to happen in the afternoon and evening at around equal intensity and is fairly evenly distributed between men and women. Household upkeep, on the other hand, is done mostly by women and mostly in the morning. This visualization is so compelling, I think, because it allows for deep exploration of rich and interesting data (to be honest, though, I find the economic status categories a little strange and not incredibly useful).

I think there are two points that come to mind when seeing this. First is that it would fascinating to see how these would look across countries, even if it was just one other country. The category of this survey on the website for the Japanese Bureau of Statistics is "culture." Seeing the charts actually makes me wonder how different this culture is from other countries. Soma does point out, though, that Japanese men are rather interested in "productive sports" which is perhaps unique to the island.

Second, I think that Stream Graphs might be useful for other time-based data types. Long term survey projects, such as the General Social Survey, track respondent spending priorities. It seems straightforward to use a Stream Graph to capture how priorities shift over time. Other implemented Stream Graphs are the NYT box-office returns data and Lee Byron's last.fm playlist data. This graph type seems best suited for showing how different categories change over time and how rapidly they grow and how quickly they shrink. They also seem to require some knowledge of Processing. There are still some open questions here: What other types of social science data might these charts be useful for? How or should we incorporate uncertainty? (Soma warns that the Japan data is rather slim on the number of respondents)

Also: October 18th is Statistics Day in Japan. There are posters. And a slogan: "Statistical Surveys Owe You and You Owe Statistical Data"!

Posted by Matt Blackwell at 5:37 PM

November 19, 2008

Election Wrap-up: Ballot Design

I like the noise of democracy.
--James Buchanan

There has been quite a bit of popular and scholarly interest in the mechanics of voting over the last decade, especially after the 2000 Florida Presidential election threw the concepts of butterfly ballots, residual votes and chads into the spotlight. The recount of the U.S. Senate race in Minnesota between Norm Coleman and Al Franken has brought the voting-error fun right on back. Minnesota Public Radio has compiled a list of challenged ballots for you to judge (via kottke). You can even use the Minnesota state statues governing voter's intent. I think the write-in for "Lizard People" is one of the best.

It is refreshing to see that in spite of all of the attention toward electronic voting problems, the old paper method can still make a mess. Things have changed a bit since the blanket ballots of the nineteenth-century, but ballot design still has quite a few problems. The most obvious case is the butterfly ballot of Palm Beach County in 2000 which almost certainly changed the outcome of the presidential election (see Wand, et al (2001)). Laurin Frisina, Michael Herron, James Honaker, and Jeff Lewis recently published an article in the Election Law Journal about undervoting in Florida's 13th Congressional District, a phenomenon they attribute to poor (electronic) ballot design. Other examples abound.

The good folks at AIGA put together an interactive guide for designing ballots and the problems with current designs. A lot of these suggestions are really spot on and would help to solve a lot of the errors in the Minnesota ballots. Especially important are the "if you make a mistake..." guidelines. This was posted at the New York Times in late August, which seems to me to be plenty of time for registrars to get these issues worked out. On the other hand, some of the Minnesota ballot problems do seem to transcend clear design. Depressingly, this probably brings a smile to faces of anti-plebian elites.

If you are a sucker, like me, for images of old ballots, you can find plenty of old California ballots at the voting technology project. Melanie Goodrich put this together. The real gem of this collection is the Regular Cactus Ticket of 1888.

Posted by Matt Blackwell at 2:10 PM