30 November 2009
We hope you can join us this Wednesday, December 2nd for the final Applied Statistics Workshop of the term, when we will have Adam Glynn (Department of Government) presenting his talk entitled "What Can We Learn with Statistical Truth Serum?" Adam has provided the following abstract:
Due to the inherent sensitivity of many survey questions, a number of researchers have adopted indirect questioning techniques in order to minimize bias due to dishonest or evasive responses. Recently, one such technique, known as the list experiment (and also known as the item count technique or the unmatched count technique), has become increasingly popular due to its feasibility in online surveys. In this talk, I will present results from two studies that utilize list experiments and discuss the implications of these results for the design and analysis of future studies. In particular, these studies demonstrate that, when the key assumptions hold, standard practice ignores relevant information available in the data, and when the key assumptions do not hold, standard practice will not detect some detectable violations of these assumptions.
The workshop will begin at 12 noon with a light lunch and wrap up by 1:30. We meet in room K354 of CGIS Knafel (1737 Cambridge St). We hope you can make it.
A paper just published in PNAS finds that armed conflict in Africa in recent decades has been more likely in hotter years, and projects that warming in the next twenty years will result in roughly 54% more conflicts and almost 400,000 more battle deaths. This is an important paper and it probably will attract significant attention from the media and policymakers. I think it's a good paper too -- seems fairly solid in the empirics, nice presentation, and admirably forthright about the limitations of the study. I'll explain a bit about what the paper does and what questions it leaves.
To establish the historical connection between temperature and conflict in Africa, the authors conduct a panel regression with country-level fixed effects, meaning that they are examining whether conflict is more likely in a given country in unusually hot years. Their main model also includes country time trends, so it seems that they are not merely capturing the fact that the 1990s had more conflict than the 1980s due to the end of the Cold War and also happened to be hotter due to the overall warming trend. In a supplement that I was not able to access, they show that the correlation between temperature and conflict is robust to a number of other specifications. So the pattern of more conflict in especially-hot years seems fairly robust. (Arguments to the contrary very welcome.) They then link this model up to a climate model to produce predictions of conflict in the next twenty years, under the assumption that the relationship between temperature and conflict will remain the same in the future.
Given that hot years saw more conflict in the 1980s and 1990s, should we expect a hotter Africa to have more conflict in the future? To some extent this depends on why hot years saw more conflict in the past. The authors note that hotter temps can depress agricultural productivity by evaporating more water and speeding up crop development, and they view this as the most likely channel by which hot temperatures lead to more conflict. They also note that hot weather has been shown in other settings to increase violent crime and make people less productive, and they admit that they can't rule out these channels in favor of the agricultural productivity story. If it's a matter of agricultural productivity, then adjusting farming techniques or improving social safety nets could avoid some of the projected conflict; if the main issue is that especially hot weather makes people rash and violent then there may not be much to do other than reverse the global process (and possibly provide air conditioning to potential insurgents). Overall the "policy implication" from the paper seems to be that changes that should happen anyway seem more urgent.
To be more confident about the paper's projection I'd like to see more detail about the mechanism -- even just anecdotal evidence. I'd also like to hear about whether politics in Africa seems to have changed in any way that would make a model of weather and conflict from the 1980s and 1990s less applicable now and in the future.
28 November 2009
Slightly off-topic insights from Adam Gopnik:
All this is true, and yet the real surprise of the cookbook, as of the constitution, is that it sometimes makes something better in the space between what's promised and what's made...Between the rule and the meal falls the ritual, and the real ritual of the recipe is like the ritual of the law; the reason the judge sits high up, in a robe, is not that it makes a difference to the case but that it makes a difference to the clients. The recipe is, in this way, our richest instance of the force and the power of abstract rules.
There's a research agenda somewhere in those sentences, I believe. Rules lead to rituals and yet rules are simply codified rituals. A small point, perhaps a bit obvious, yet it speaks more broadly to social science research. And highlights where qualitative scholars get it right: looking for correlations between rules (or structure?) and outcomes often averages out the most intriguing part of the story.
(hat tip, MR)
Judea Pearl describes his new article Causal inference in statistics: An Overview as "a recent submission to Statistics Survey which condenses everything I know about causality in only 40 pages." That seemed like a bold claim, but after reading it I'm sold. I don't come from Pearl's "camp" per se, but I found this a really impressive overview of his approach to causation. His overtures to folks like me who use the potential outcomes framework were much appreciated, although it is clear throughout that there is still intense debate on some of the issues. The bottom line: if you've ever wondered what the structural equation modeling approach to causal inference is all about, this is your one-stop, must-read introduction (and an insightful, engaging, and thorough one at that).
26 November 2009
I went to law school before I ended up as a graduate student, so I read with some interest a recent essay by Vanderbilt Law Professor Herwig Schlunk entitled "Mamas, don't let your babies grow up to be...lawyers" (an online version is at the Wall Street Journal's Law Blog).
Maybe the title gives it all away, but the gist is that a legal education might not always pay off. While I wholeheartedly agree with this, I'm less enthusiastic about the author's methodology. The author essentially constructs three hypothetical law students: "Also Ran," a legal slacker who attends a lower-ranked law school; "Solid Performer," a middling kind of person who attends a middling kind of law school; and "Hot Prospect," a high-flying and well-placed law student. The essay then more or less "follows" them through their legal "careers" to see if their discounted expected gains in salary match what they "paid" in terms of opportunity costs, tuition, and interest on their student loans. (I know, I'm using a lot of air quotes here.) Unsurprisingly, a legal education isn't a very good investment for any of the individuals.
What's interesting about the paper is that it's essentially an exercise in counterfactuals -- what Also Ran would have earned after going to law school, what Hot Prospect would have made had she not gone, etc., etc. To that extent, it's very fun think about. But, on the flip side, that's kind of what it is -- a thought experiment. Maybe an interesting extension would be a do an empirical causal analysis -- maybe matching pre-law undergraduates along a slew of covariates and then seeing how the "treatment" of law school affects or does not affect their earnings. I'd certainly find that a lot more persuasive (although I imagine that the kind of data that you'd need to pull this off would be well-nigh impossible to collect).
Last Thursday, I posted about the recent government recommendations regarding breast cancer screening in women ages 40-49. At least one of you wrote me to say that one of my calculations might have been slightly off (they were), and so I did some more investigation on this issue, as well as on new recommendations on cervical pap smears. (Sorry --it took
me a few days to get around to all of this!)
To back up a second, here's what the controversial new recommendations (made by the US Preventative Services Task Force) say:
So all of this got me thinking that this could maybe be a straightforward application of the "rare diseases" example of Bayes' Rule (which many people see in their first probability course). I did some (more) digging around in one of the government reports, and here's
how the probabilities break down:
Now bear with me while I go through the mechanics of Bayes' Rule. For women in their 40s, here are the pertinent probabilities:
P(cancer) = 1/69
P(no cancer) = 1-1/69 = 68/69
P(positive|no cancer) = 97.8/1000
P(negative|no cancer) = 1 - 97.8/1000 = 902.2/1000
P(positive|cancer) = 1 - 1/1000 = 999/1000
And for women in their 50s:
P(cancer) = 1/38
P(no cancer) = 1-1/38 = 37/38
P(positive|no cancer) = 86.6/1000
P(negative|no cancer) = 1 - 86.6/1000 = 913.4/1000
P(positive|cancer) = 1 - 1.1/1000 = 998.9/1000
The probability we are interested is the probability of cancer given that a woman has tested positive, P(cancer|positive). Using Bayes' Rule:
P(cancer|positive) = P(positive|cancer)*P(cancer)/P(positive)
P(cancer|positive) = P(positive|cancer)*P(cancer)/P(positive|no
We now have all of the moving parts. Let's first look at a woman in her 40s:
P(cancer|positive) = (999/1000*1/69)/(97.8/1000*68/69+999/1000*1/69)
and for a woman in her 50s:
P(cancer|positive) = (998.9/1000*1/38)/(86.6/1000*37/38+998.9/1000*1/38)
All this very simple analysis suggests is that mammograms do appear to be a less reliable test for younger women. Whether these recommendations make sense is another matter. Insurance companies might use these recommendations as an excuse to deny coverage for women with higher than average risk. In addition, as some of you noted, a 13% risk is nothing to sneeze at, and it's much higher than the 1/69 rate for women in their 40s (though comparable to a woman's lifetime 12% risk). Lastly, I also refer folks to Andrew Thomas's post , where he discusses the metrics used by the task force and notes that the confidence interval for women in their 40s lies completely within the confidence interval for women in their 50s.
I also did some very brief investigation regarding the new cervical cancer guidelines. For those of you unfamiliar with this story, the American College of Obstetricians and Gynecologists recently issued recommendations that women up to the age of 21 no longer receive pap test and that older women receive paps less often -- also advice contrary to what women have been told for decades.
It was much harder to pinpoint the false positive and false negative rates involved with pap tests (a lot of medical jargon, different levels of detection, and human and non-human error made things confusing). I did manage to find this article in the NEJM. The researchers there looked at women ages 30 to 69 (a different subgroup, unfortunately, from the under 21 group), but they do report that the sensitivity of Pap testing was 55.4% and the specificity was 96.8%. This corresponds to a false negative rate somewhere around 44.6% and a false positive rate somewhere around 3.2%. (Other references I've seen elsewhere hint that the false negative rate could be anywhere from 15 to 40%, depending on the quality of the lab and the cells collected.)
The other thing to note is that cervical cancer is very rare in young women and, unlike forms of cancer, it grows relatively slowly. According to the New York Times, 1-2 cases occur per 1,000,000 girls ages 15 to 19. This, combined with the high false negative rates, resulted in the ACOG recommendations.
My sense is that the ACOG recommendations are on more solid footing, but if people have comments, I'd be keen to hear them.
21 November 2009
Network methods and methods for causal inference are popular areas of research in social sciences. Often they are considered separately due to a fundamental difference in their basic assumptions. Network methods assume that individual units are interdependent, that one network member's actions have consequences for other members of the network. Methods for causal inference, in contrast, often rest on the Stable Unit Treatment Value Assumption (SUTVA). SUTVA requires that the response of a particular unit depends only on the treatment to which he himself was assigned, not the treatments of others around him. It is a useful assumption, but as with all assumptions, there are circumstances in which it is not credible. What can be done in these circumstances?
When researchers suspect that there may be spillover between units in different treatment groups, they can change their unit of analysis. Students assigned to attend a tutoring program to improve their grades might interact with other students in their school who were not assigned to the tutoring program and influence the grades of these control students. To enable causal inference, the analysis might be completed at the school level rather than the individual level. SUTVA would then require no interference across schools, a more plausible assumption than no interference across students. However, this approach is somewhat unsatisfactory. It generally entails a sharp reduction in sample size. More importantly, it changes the question that we can answer: no longer can we learn about the performance of individual students, we can only learn about the performance of schools.
I have not come across a more satisfactory statistical solution for circumstances in which SUTVA is violated. In an interesting new paper, Manski provides some bounds on treatment effects in the presence of social interactions. Unfortunately, these bounds are often uninformative, since when SUTVA is violated random assignment to treatment arms does not identify treatment effects. Sinclair suggests using multi-level experiments to empirically identify spillover effects. This approach (which relies on multiple rounds of randomization to test if treatment effects are overidentified, as we would expect if there were no spillovers) is appealing, as the process of diffusion within networks is of great scientific interest. However, it does not help identify treatment effects when spillovers are present. Neither can we simply assume that effects estimated under SUTVA represent upper bounds on the true effects, because it is possible that interference across units intensifies the treatment effects rather than diluting them. Manski's paper seems like a useful foray into an open area of research. Let me know of other work on methods for causal inference in network-like situations where interference across units is likely.
17 November 2009
I have been toying around with dynamic panel models from the econometrics literature and I have hit my head up against a key set of assertions. First, a quick setup. The idea with these models is that we have a set units which we measure at different points in time. For instance, perhaps we survey a group of people multiple times in the course of an election and ask them how they are going to vote, do they plan to vote, how do they rate the candidates, etc. We might then want to know how these answers vary over time or with certain covariates.
Here is a typical model:
There are two typical features of these models that seem relevant. First, most include a lagged dependent variable (LDV) to account for persistence in the responses. If I was going to vote for McCain the last time you called, I'll probably still want to do that this time. Makes sense. Second, we include a unit-specific effect, alpha, to account for all other relevant factors. Dynamic panel models tend to identify their effects with a simple differencing by running the following model:
Which eliminates the unit-specific effect by the differencing, but our parameters remain, ready to be estimated. I should note that there are some identification issues left to solve and the differences between estimators in this field mostly have to do with how to instrument for the differenced LDV.
Reading these models, I have two questions. One, is there a reason to expect that we need both a LDV and a unit-specific effect? This means that we expect that there is a shock to a unit's dependent variable that is constant across periods. I find this a strange assumption. I understand a unit-specific shock to the initial level and then using LDV thereafter, but in every period?
Two, the entire identification strategy here is based on the additivity of the model, correct? If we were to draw a directed acyclic graph of these models, it would be trivially obvious that we could never identify this model nonparametrically. I understand that we sometimes need to use models to identify effects, but should these identifications depend so heavily on the functional form? It seems that this problem is tied up in the first. We are allowing for the unit-specific effect as a way to free the model of unnecessary assumptions, yet this forces our hand into making different, perhaps stronger assumption to get identification.
Please clear up my confusion in the comments if you are more in the know.
16 November 2009
Please join us at the Applied Statistics workshop this Wednesday, November 18th at 12 noon when we will be happy to have Jim Greiner of the Harvard Law School presenting on "Exit Polling and Racial Bloc Voting: Combining Individual-Level and R x C Ecological Data." Jim has provided a companion paper with the following abstract:
Despite its shortcomings, cross-level or ecological inference remains a necessary part of many areas of quantitative inference, including in United States voting rights litigation. Ecological inference suffers from a lack of identification that, most agree, is best addressed by incorporating individual-level data into the model. In this paper, we test the limits of such an incorporation by attempting it in the context of drawing inferences about racial voting patterns using a combination of an exit poll and precinct-level ecological data; accurate information about racial voting patterns is needed to trigger voting rights laws that can determine the composition of United States legislative bodies. Specifically, we extend and study a hybrid model that addresses two-way tables of arbitrary dimension. We apply the hybrid model to an exit poll we administered in the City of Boston in 2008. Using the resulting data as well as simulation, we compare the performance of a pure ecological estimator, pure survey estimators using various sampling schemes, and our hybrid. We conclude that the hybrid estimator offers substantial benefits by enabling substantive inferences about voting patterns not practicably available without its use.
The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.
12 November 2009
Today I'm going to talk about a particular problem from my own research and will outline a method for choosing variances in general linear models (GLMs), but I am also asking a question.
The standard setup of GLMs is (roughly) the following. One hypothesizes that the conditional mean of the outcome variable (y), E[y|x], can be expressed as a function of a linear predictor x'b, or:
The function μ is referred to as the link function. Common choices for μ include both the identity and log link. One common question is why one would choose to use a GLM with, for example, a log link instead of estimating via OLS the regression model:
There are two principle objections to the OLS method. First, in the presence of heteroskedasticity it is difficult to transform predicted values of ln(y) into predicted values of y, although it is possible. Second, the OLS method throws out any data coming from observations with y=0.
Unfortunately, the choice of a link function is comparatively easy (in my view) compared with the next step of choosing an appropriate function for the variance of y given x, which must be prespecified in most GLMs*. In my work I have focused on choosing variance functions that are proportional to some power of the variance:
The trick, then, is to choose the correct power with various powers of the mean corresponding to Poisson (k=1), Gamma (k=2), and Wald (k=3), for example. In health econometrics this can be accomplished by using a modified Park test (due to Manning and Mullahy). In this procedure one first computes tentative parameter estimates for a GLM based on one's prior beliefs about the appropriate variance function (I typically use Gamma-like regressions for this). The linear predictors from the tentative regression can be used to get raw-scale residuals by applying the inverse link function. The modified Park test is to then regress the squared raw-scale residuals on a constant and the linear predictor in a GLM with a log link and the coefficient on the linear predictor then indicates which variance structure is most appropriate.
Now for the question. In health utilization data one often has data with a large number of zeros, for example, less than 10% of my sample uses mental health services in any given year. While GLMs are typically well behaved, in the presence of so many zeros this need not be the case. One common practice is to then use a "two part" model in which one uses an initial probit or logit regression to estimate the probability of any utilization and then estimate the second stage GLM model among users only. My question relates to the appropriate sample to use for the modified Park test--users or everybody? It turns that in this case it matters since when I look at everyone I get evidence in support of Gamma-like regressions (i.e. k=2 in my Park test), but when I only consider users in the Park test I get estimates of k=2.6, or so, which is more consistent with Wald-type variances.
My strong suspicion is that the latter approach is more appropriate since the GLM is only estimated among users, but I've hunted in the literature and found no specific advice on this point and many examples that seem to indicate that the test should be done on everybody.
* One exception is the Extended Estimating Equations method proposed by Basu and Rathouz (implemented as pglm in Stata).
There was a lot of press on the 1,000+-page length of the House health care bill, H.R. 3962. That got me thinking... didn't we hear the same thing about the stimulus bill and the Patriot Act? Aren't most "controversial" bills also very long?
It would make sense. Controversial bills require a lot more ink -- pork, special cases, exceptions -- to reel in support. Uncontroversial bills can be written succinctly and pass as is.
To assess this I scraped bills from OpenCongress, which maintains the full text, voting results and amendment history of House and Senate Resolutions. You can even comment on specific portions of bills. There's already a bunch of neat comments on potential loopholes in H.R. 3962.
I downloaded the text and voting results for all 152 House resolutions passed by the 111th House. A boxplot of page length against support appears below. Each page length group represents roughly 20% of House resolutions. The plot shows the suspected trend, that longer bills have less support. One-page bills almost always pass unanimously!
11 November 2009
Brandon Stewart pointed me to an interesting blog post by Andrew Gelman that touches on the issue of explaining the "causes of effects." The basic point is that "why" questions are difficult to answer in a potential outcomes framework but often we really care about them. Some folks in political science have gone so far as to argue that researchers using "qualitative" methods are more inclined (and better able) to tackle these "why" questions than their "quantitative" colleagues who mostly focus on "effects of causes."
This has been on my mind lately -- as part of a class in the statistics department, I've had several conversations with Don Rubin about how retrospective "case-control" studies might fit into the potential outcomes framework. The goal of the medical researchers that execute these studies is usually a "why" question: why did an outbreak of rare disease X occur, which genes might cause breast cancer, etc. Case-control studies and their variants are great for searching over a number of possible causes and pulling out the ones that have strong associations with the outcome, but they aren't so great for estimating treatment effects. Rubin suggests that the proper way to proceed is probably to first use a case-control study to search over a number of possible causes and then estimate treatment effects for the most likely causes using a different sampling method (matched sampling for situations where the research has to be observational, experimentation when it's possible). It seems like this already happens to some extent in biostatistics and epidemiology and it also happens informally in political science.
I think this formulation suggests that answering a "why" question requires both "causes of effects" and "effects of causes" approaches; we need to search over a number of possible causes to identify likely causes, but we also need to test the effectiveness of each likely cause before we can say much about the causal effect. We probably still can't answer questions like "what caused World War I" but maybe this gets us somewhere with more tractable types of "why" questions.
7 November 2009
A friend recently pointed me to a 2007 New Republic article in which the author, Noam Scheiber, argues that the "Freakonomics" phenomenon is lamentable because it represents a trend toward research in which clever identification strategies are prized over attempts to answer what Scheiber calls "truly deep questions." Although two years and the publication date of a second Levitt and Dubner book have since passed, the article caught my attention because I have been considering a related issue of late. We are all well aware of how difficult it is to make causal inferences in the social sciences, so it is not surprising that researchers are drawn to settings in which some source of exogenous variation allows for identification of the influence of a specific causal factor. In fact, progress on those "truly deep questions" depends in part on this type of work. However, focus on clean identification has some potentially negative implications. Scheiber names one: answering questions of peripheral interest. A second, which is of greater concern for me, is concentrating on population subgroups that may or may not be of scientific interest in and of themselves and that, in either case, are unable to provide direct insights into broader population dynamics.
Thanks to Imbens and Angrist, we know that even when it is not possible to identify the population average effect of a "treatment" (i.e., causal factor of interest) on a given outcome, it is often possible to identify a "local average treatment effect," that is, the average effect of a treatment for the subpopulation whose treatment status is affected by changes in the exogenous regressor. This subpopulation is composed of so-called "compliers," who will take the treatment when assigned to take it and will not when they are not. Sometimes this subpopulation is of scientific or policy interest (for example, we may be interested in knowing the effect of additional schooling on earnings for those students who might drop out of high school but for compulsory education laws). Oftentimes, it is not. In contrast, the broader population and the portion of the population that receives treatment are almost always of interest. These groups are certainly policy-relevant (it would be misleading to project the effect of a drug on public health based only on the drug's effect amongst those who were induced to take the drug) and they are needed to generate "stylized facts" that help us organize our understanding of the social world. (Also, these groups can be observed whereas compliers are not a generally identified subpopulation.)
Unfortunately, when treatment effects are heterogeneous, the identified local average effect does not provide direct information about the wider population. This is problematic since treatment effects are likely to be heterogeneous in social science applications. In fact, this heterogeneity is one of the reasons why identifying causal effects is so difficult (individuals' self-selection into a treatment status based in part on anticipated treatment effects induces endogeneity problems).
A number of demographers have discussed the problem of extrapolating local average treatment effect estimates to the broader population. Greg Duncan, in his presidential address to the Population Association of America, stated that although causal inference is "often facilitated by eschewing full population representation in favor of an examination of an exceedingly small but strategically selected portion of a general population with the 'right kind' of variation in the key independent variable of interest.... a population-based understanding of causal effects should be our principal goal." Robert Moffitt writes that although "some type of implicit weighting is needed" to help us understand how to trade off internal and external validity, "this problem has not really been addressed in the applied research community." Some researchers have suggested using bounds for average treatment effects that are not point-identified (for example, Manski). Of course, the usefulness of bounding techniques depends on the tightness of the bounds, which in turn depends on what assumptions we are willing to impose - and it is exactly scholars' discomfort with prevailing assumptions (e.g., lack of correlation between the error and the treatment indicator) that drove the current focus on non-representative population subgroups. It seems to me that there is still work to be done to connect subpopulation causal estimates to broader population trends. I would be interested to hear of work in this area that you think is promising.
3 November 2009
I hope you can join us at the Applied Statistics Workshop this Wednesday, November 4th, when we will be happy to have Edo Airoldi, Assistant Professor in the Department of Statistics here at Harvard. Edo will be presenting a talk entitled "A statistical perspective on complex networks" for which he has provided the following abstract:
Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of science, as many scientific inquiries involve collections of measurements on pairs of objects. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. In this talk, I will review a few ideas that are central to this burgeoning literature. I will emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. I will conclude by describing open problems and challenges for machine learning and statistics.