May 2013
Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries

Categories

Blogroll

SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


28 April 2013

App Stats: Roberts, Stewart, and Tingley on "Topic models for open ended survey responses with applications to experiments"

We hope you can join us this Wednesday, May 1, 2013 for the Applied Statistics Workshop. Molly Roberts, Brandon Stewart, and Dustin Tingley, all from the Department of Government at Harvard University, will give a presentation entitled "Topic models for open ended survey responses with applications to experiments". A light lunch will be served at 12 pm and the talk will begin at 12.15.

"Topic models for open ended survey responses with applications to experiments"
Molly Roberts, Brandon Stewart, and Dustin Tingley
Government Department, Harvard University
CGIS K354 (1737 Cambridge St.)
Wednesday, May 1st, 2013 12.00 pm

Abstract:

Despite broad use of surveys and survey experiments by political science, the vast majority of survey analysis deals with responses to options along a scale or from pre-established categories. Yet, in most areas of life individuals communicate either by writing or by speaking, a fact reflected in earlier debates about open and closed ended survey questions. Despite good reasons to collect and analyze open ended data, it is relatively rare in the discipline and almost exclusively done through a process involving human coding of survey responses. We present an alternative, semi-automated approach, the Structural Topic Model (STM) (Roberts et al. 2013), that draws on recent developments in machine learning based analysis of textual data. A crucial contribution of the method is that it incorporates information about the text, such as the author's gender, country of origin, treatment status, or when something was written. This paper focuses on how the STM is extremely helpful for descriptive, exploratory, or inferential purposes for survey researchers and experimentalists. The STM makes analyzing open ended responses easier, more revealing, and capable of being used to estimate treatment effects. We illustrate these innovations with several experiments.

Posted by Konstantin Kashin at 11:25 PM | Comments (2)

22 April 2013

App Stats: Vadhan on "Privacy Tools for Sharing Research Data"

We hope you can join us this Wednesday, April 24, 2013 for the Applied Statistics Workshop. Salil Vadhan, Professor of Computer Science and Applied Mathematics from the School of Engineering & Applied Sciences at Harvard University, will give a presentation entitled "Privacy Tools for Sharing Research Data". A light lunch will be served at 12 pm and the talk will begin at 12.15.

"Privacy Tools for Sharing Research Data"
Salil Vadhan
School of Engineering & Applied Sciences, Harvard University
CGIS K354 (1737 Cambridge St.)
Wednesday, April 24th, 2013 12.00 pm

Abstract:

I will give an overview of a large, new multidisciplinary project at Harvard on "Privacy Tools for Sharing Research Data." The project is a collaborative effort between the Center for Research on Computation and Society, the Institute for Quantitative Social Science, and the Berkman Center for Internet and Society, and is funded as a Frontier grant in the NSF Secure and Trustworthy Cyberspace Program, building on seed funding from Google. The goal of the project is to help enable the collection, analysis, and sharing of personal data for research in social science and other fields while providing privacy for individual subjects. Bringing together computer science, social science, statistics, and law, we seek to refine and develop definitions and measures of privacy and data utility, and design an array of technological, legal, and policy tools for social scientists to use when dealing with sensitive data. These tools will be tested and deployed at the Harvard Institute for Quantitative Social Science's Dataverse Network, an open-source digital repository that offers the largest catalogue of social science datasets in the world. In addition to contributing to research infrastructure for social scientists around the world, the ideas developed in the project may benefit society more broadly as it grapples with data privacy issues in many other domains, including public health and electronic commerce.

Posted by Konstantin Kashin at 12:14 AM | Comments (3)

15 April 2013

Guest Post by Patrick Lam on "Estimating Individual Causal Effects"

Last week, I gave the applied statistics talk at IQSS on some of my research on estimating individual causal effects. Since there was some interest from folks who could not attend, I thought I would give a brief overview of my argument and research.

In the majority of empirical research, the quantity of interest is likely to be some type of average treatment effect, either through a regression model or some other clever research design. For example, we often run a regression of an outcome Y on some treatment W and covariates X and interpret the beta coefficient on W as the "effect" of W on Y given assumptions of ignorability of treatment assignment and no interference across units. While this average treatment effect ATE (and its fancier cousins ATT, ATC, CATE, LATE, etc.) is the easiest causal quantity to estimate, I argue that an ATE is not a very useful or interpretable quantity. Define an individual causal effect (ICE) as Y_i(1) - Y_i(0) for any individual i. An ATE is simply the average of all the individual causal effects in the data or in some larger population: E[Y(1) - Y(0)]. An ATE is not the effect for any specific individual or groups of individuals. It is not even the effect for the average individual. However, implicitly we often have a tendency to attribute the ATE as THE EFFECT for any individual, which is only true if we make the usually unreasonable assumption of constant treatment effects. In short, the ATE is a one-number summary that applies to exactly no individual of interest.

To see this in a trivial and simple example, suppose we have a female birth control pill that in reality prevents pregnancy for every woman that takes it. Now suppose that we didn't know that, but we wanted to test how effective the pill was. So we randomly assign the pills to a evenly distributed sample of men and women. Our results would suggest that the pill was effective in preventing pregnancy approximately 50% of the time. We would then conclude based on the data that the pill is only effective half the time and thus is basically useless as a contraceptive. However, it is trivially obvious that the 50% result is derived from a 100% success rate for women and a 0% success rate for men. The 50% result is not the success rate for any individual and estimating the ATE masked important treatment effect heterogeneity.

One way to account for this heterogeneity is by estimating the conditional average treatment effect (CATE). In this example, we would condition on gender and estimate an average treatment effect for men and one for women. This requires leveraging additional information and defining a variable to condition upon. This is a top down approach in which we subset the data in some way and then estimate an ATE. Of course the example here is trivial, but in most empirical research, it may not be obvious which variables to condition on. Furthermore, the CATE still assumes a constant treatment effect for all individuals within the same covariate strata.

I argue for a different bottom-up approach in which we try to estimate each of the individual causal effects directly. The benefits of directly estimating the ICEs are that

1) they directly estimate the actual quantities of interest, such as an effect for a certain individual or groups of individuals.
2) they allow for discovery of treatment effect heterogeneity through graphical and exploratory approaches.
3) they bridge the gap between quantitative and qualitative research by allowing for small n estimands in a large n framework
4) any other causal quantity such as any ATE can be calculated directly from the ICEs, so estimating ICEs is a more flexible approach.

Of course, the main problem with estimating ICEs is that they are not identified in the data, so the data strictly speaking gives no information about the likelihood of any particular value for any ICE.

To estimate the ICEs, I present introduce a broad framework that leverages the usual causal inference assumptions of treatment assignment ignorability and SUTVA and use existing matching methods coupled with a Bayesian framework to give hints and uncertainty intervals for the ICEs. The Bayesian approach allows for prior qualitative information to be incorporated and also sidesteps the identification issue by defining a posterior over the ICEs. None of the methods used are new, and many date back several decades. But I argue that we can put these existing methods together in a novel way to estimate quantities which are much more important and relevant to researchers.

The basic idea of the estimation process is to impute the missing potential outcomes for each individual. Once the outcomes are imputed, then the ICEs can be calculated in a straightforward manner. The matching algorithms define pools of observations that we can use to help with the imputation and the Bayesian framework gives us uncertainty for the imputations that incorporates uncertainty in both the matching algorithms and the normal estimation uncertainty. The idea of Bayesian imputation of missing potential outcomes dates back at least to Rubin (1978) and Don actually has told me a few times that the imputation idea in general dates by much longer than that, at least back to Neyman. The matching idea and the algorithms used also date back at least to Don's work in the 1970s.

In my talk, I introduced a (hopefully coherent) framework that laid out the assumptions and a model to estimate the ICEs. I also conducted many simulations to test the ability of the model to recover ICEs and also tested several matching specifications. The results suggest that the model actually does a fairly good job of recovering the ICEs although the uncertainty intervals can be quite wide. Nevertheless, they give us hints about plausible ranges of values for the ICEs and aggregating the ICEs to estimate average effects produce nearly identical results to traditional methods. One noteworthy conclusion from the simulations is that the use of regression imputation in which we impute with the predicted values from a regular linear regression generally produces good average results but very poor calibration for individual results. Therefore, one takeaway is that we can use ICEs to estimate both individual and average estimands, but we can only estimate average estimands with ATEs with any accuracy and attempts to get at individual level estimates through ATEs are likely to be incorrect. The last part of my talk uses an existing example from economics and politics on monitoring corruption to demonstrate the flexibility of the approach. I adapt ICE estimation to both binary and continuous treatments and one-stage and two-stage IV type approaches.

For more information and copies of the presentation slides and a rough draft a paper describing the general model and framework, please see http://www.patricklam.org/research.html.

Posted by Konstantin Kashin at 10:54 PM | Comments (1)

App Stats: Pakes on "Moment Inequalities for Semiparametric Multinomial Choice with Fixed Effects"

We hope you can join us this Wednesday, April 17, 2013 for the Applied Statistics Workshop. Ariel Pakes, Professor of Economics from the Department of Economics at Harvard University, will give a presentation entitled "Moment Inequalities for Semiparametric Multinomial Choice with Fixed Effects". A light lunch will be served at 12 pm and the talk will begin at 12.15.

"Moment Inequalities for Semiparametric Multinomial Choice with Fixed Effects"
Ariel Pakes
Department of Economics, Harvard University
CGIS K354 (1737 Cambridge St.)
Wednesday, April 17th, 2013 12.00 pm

Abstract:

We propose a new approach to identi cation for multinomial choice models with a group (or panel) structure. We take a standard random utility model of choice, where the utility for each choice is additively separable in a choice-speci c fixed eff ect, a disturbance, and an index function of covariates and parameters. Observations in the same group are assumed to share the same fixed eff ects. Examples of this structure include; (i) Chamberlain's (1980) conditional likelihood estimator for panel data problems with choice speci c fixed eff ects and i.i.d. logistic disturbances, and (ii) models of product demand where markets are the grouping device, the within group observations are consumers, and the choice-speci c fixed effects represent product level unobservables.

We place no restriction on the variance-covariance of the disturbance vector across choices. The only restriction on the disturbances is a group homogeneity assumption. The main cost of the semiparametric flexibility in our model is that the conditional moment inequalities will, in general, only partially identify the index function parameters. The advantages are that it; (i) is non-parametric in the joint distribution of the disturbance vector across choices, (ii) allows for incidental choice speci c e ffects whose cardinality can grow with sample size, and (iii) can be extended to allow for certain types of endogeneity.

Posted by Konstantin Kashin at 10:10 AM | Comments (0)

8 April 2013

App Stats: Lam on "Estimating Individual Causal Effects"

We hope you can join us this Wednesday, April 10, 2013 for the Applied Statistics Workshop. Patrick Lam, a Ph.D. candidate from the Department of Government at Harvard University, will give a presentation entitled "Estimating Individual Causal Effects". A light lunch will be served at 12 pm and the talk will begin at 12.15.

"Estimating Individual Causal Effects"
Patrick Lam
Government Department, Harvard University
CGIS K354 (1737 Cambridge St.)
Wednesday, April 10th, 2013 12.00 pm

Abstract:

The literature on causal inference has focused primarily on estimating average treatment effects, which aggregate over many individual effects. However, this aggregation often misses treatment effect heterogeneity, which may be of extreme importance. In addition, researchers often estimate average effects but their real quantity of interest is individual effects. In this paper, I develop methods to estimate individual causal effects based on commonly used matching procedures. I show that predictive mean matching performs the best in imputing missing potential outcomes to estimate the individual effects. I then demonstrate the flexibility of estimating individual causal effects and how they can be used to explore questions of interest, recover any other causal quantity, and be adapted to more complicated data structures. I conclude with empirical examples from political science.

Posted by Konstantin Kashin at 12:41 AM | Comments (3)

1 April 2013

App Stats: Killewald on "His Gain, Her Pain? The Motherhood Penalty and the Fatherhood Premium within Coresidential Couples"

We hope you can join us this Wednesday, April 3, 2013 for the Applied Statistics Workshop. Sasha Killewald, Assistant Professor of Sociology at Harvard University, will give a presentation entitled "His Gain, Her Pain? The Motherhood Penalty and the Fatherhood Premium within Coresidential Couples". A light lunch will be served at 12 pm and the talk will begin at 12.15.

"His Gain, Her Pain? The Motherhood Penalty and the Fatherhood Premium within Coresidential Couples"
Sasha Killewald
Department of Sociology, Harvard University
CGIS K354 (1737 Cambridge St.)
Wednesday, April 3rd, 2013 12.00 pm

Abstract:

Prior research on the association between parenthood and wages has focused at the individual level, documenting a substantial motherhood wage penalty and a smaller fatherhood premium. However, the majority of births occur to coresidential couples, yet we know little about the within-couple association between the motherhood penalty and fatherhood premium. Specialization suggests that women who experience the largest motherhood penalty will tend to be partnered with fathers with the largest premium. However, it is also possible that some couples are better able to defray the wage costs of parenthood for both parents. We bring a dyad perspective to the study of the interaction between parenthood and wages and use random-coefficients models to answer the following questions: 1) What is the average association between the motherhood penalty and the fatherhood premium within couples? 2) How does assortative mating on the basis of race, education, and post-parenthood specialization on paid and unpaid labor time contribute to this association?

Posted by Konstantin Kashin at 10:59 AM | Comments (0)

25 March 2013

App Stats: Fowler and Hall on "Do Legislators Cater to the Priorities of Their Constituents?"

We hope you can join us this Wednesday, March 27, 2013 for the Applied Statistics Workshop. Anthony Fowler and Andrew B. Hall, Ph.D. Candidates from the Department of Government at Harvard University, will give a presentation entitled "Do Legislators Cater to the Priorities of Their Constituents?". A light lunch will be served at 12 pm and the talk will begin at 12.15.

"Do Legislators Cater to the Priorities of Their Constituents?"
Anthony Fowler and Andy Hall
Government Department, Harvard University
CGIS K354 (1737 Cambridge St.)
Wednesday, March 27th, 2013 12.00 pm

Abstract:

Republican and Democratic legislators vote differently on a large number of bills even when representing constituents of identical preferences. Because constituencies care about some issues more than others, representatives may give short shrift to the district's preferences on some topics while carefully mirroring them on others. The more a district cares about an issue, the more loyally we should see its legislators voting. As a consequence, we should expect the partisan gap in representation -- the difference in voting behavior between a Democrat and a Republican representing the same constituents -- to shrink on issues of greater concern to the district. We test this hypothesis in eight issue areas: agriculture, civil rights, defense, education, energy, public transportation, senior citizens' issues, and welfare. Contrary to expectation, we find little evidence that representational quality improves when constituents have strong personal interests. Across all issues examined, the representational gap between the parties is massive and does not shrink meaningfully in especially-interested districts.

Posted by Konstantin Kashin at 10:34 AM

11 March 2013

App Stats: Chamberlain on "Predictive Effects of Teachers and Schools on Test Scores, College Attendance, and Earnings"

We hope you can join us this Wednesday, March 13, 2013 for the Applied Statistics Workshop. Gary Chamberlain, Louis Berkman Professor of Economics from the Department of Economics at Harvard University, will give a presentation entitled "Predictive Effects of Teachers and Schools on Test Scores, College Attendance, and Earnings". A light lunch will be served at 12 pm and the talk will begin at 12.15.

"Predictive Effects of Teachers and Schools on Test Scores, College Attendance, and Earnings"
Gary Chamberlain
Department of Economics, Harvard University
CGIS K354 (1737 Cambridge St.)
Wednesday, March 13, 2013 12.00 pm

Abstract:

I study predictive effects of teachers and schools on test scores in fourth through eighth grade and outcomes later in life such as college attendance and earnings. The predictive effects have the following form: predict the fraction of a classroom attending college at age 20 given the test score for a different classroom in the same school with the same teacher, and given the test score for a classroom in the same school with a different teacher. I would like to have predictive effects that condition on averages over many classrooms, with and without the same teacher. I set up a factor model which, under certain assumptions, makes this feasible. Administrative school district data n combination with tax data were used to calculate estimates and do inference.

Posted by Konstantin Kashin at 4:14 AM

4 March 2013

App Stats: Goodman on "Seeing More in Data"

We hope you can join us this Wednesday, March 6, 2013 for the Applied Statistics Workshop. Alyssa Goodman, a Professor of Astronomy from the Harvard-Smithsonian Center for Astrophysics at Harvard University, will give a presentation entitled "Seeing More in Data". A light lunch will be served at 12 pm and the talk will begin at 12.15.

"Seeing More in Data"
Alyssa A. Goodman
Harvard-Smithsonian Center for Astrophysics
CGIS K354 (1737 Cambridge St.)
Wednesday, March 6th, 2013 12.00 pm

Abstract:

Some scientists still think that good data visualization is only necessary when presenting work to "the public." In truth, thinking hard about how to learn the most from any data set should always involve some form of graph, map, chart, or other visual statistical display. This talk will demonstrate how visualization techniques that include so-called "linked views" offer new insights to researchers visualizing large and/or diverse data sets. In particular, the talk will highlight a few high-dimensional visualization examples where ideas about linked views first put forth by John Tukey are extended beyond two-dimensional displays and point clouds. Examples will be principally drawn from astronomy and medical imaging, and software highlighted will include the Universe Information System known as "WorldWide Telescope" (worldwidetelescope.org) and a new python-based linked-view system called "Glue" (glueviz.org).

Posted by Konstantin Kashin at 1:43 AM

26 February 2013

App Stats: Mozaffarian on "Estimating the Global Impact of Poor Dietary Habits on Chronic Diseases"

We hope you can join us this Wednesday, February 27, 2013 for the Applied Statistics Workshop. Dariush Mozaffarian, Associate Professor in the Department of Epidemiology at the Harvard School of Public Health, will give a presentation entitled "Estimating the Global Impact of Poor Dietary Habits on Chronic Diseases". A light lunch will be served at 12 pm and the talk will begin at 12.15.

"Estimating the Global Impact of Poor Dietary Habits on Chronic Diseases"
Dariush Mozaffarian
Department of Epidemiology, Harvard School of Public Health
CGIS K354 (1737 Cambridge St.)
Wednesday, February 27, 2013, 12.00pm

Abstract:

Nearly every nation in the world is undergoing rapid epidemiologic transition toward noncommunicable chronic diseases (NCDs) including cardiovascular disease (CVD), obesity, diabetes, and cancers. Numerous organizations including the United Nations, World Health Organization, US Centers for Disease Control and Prevention, and other national and international organizations have emphasized the importance of dietary habits as a key risk factor for NCDs. Yet, the burdens of suboptimal dietary habits on NCDs globally, as well as heterogeneity in these burdens by region, country, age, and sex, are not established. Quantification of these burdens has been limited by inadequate or absent data on dietary habits in many nations, not only for each country as a whole, but also for age- and sex-specific strata. As part of our work in the 2010 Global Burden of Diseases Nutrition and Chronic Diseases Group, we systematically identified and obtained data on national and subnational individual-level surveys of dietary consumption worldwide; and used a Bayesian hierarchical model to evaluate and account for differences in comparability, assessment methods, representativeness, and missingness. We also quantified effects of dietary habits on NCDs, including differences by age, in new meta-analyses. We compiled additional data to quantify the alternative optimal distribution of key dietary risk factors, and the numbers of cause-specific deaths by country, age, and sex. Using this compilation of global data, we used comparative risk assessment to quantify the impacts of current dietary habits on NCDs in each nation around the world. The case of sugar-sweetened beverages (SSBs) and CVD, adiposity-related cancers, and diabetes will be presented as an example of our newest findings.

Posted by Konstantin Kashin at 12:43 AM