30 April 2007
The final session of the Applied Statistics workshop will be held this week. We will present a talk by Adam Glynn, assistant professor of political science at Harvard. Professor Glynn received his Ph.D. in statistics from the University of Washington. His research and teaching interests include political methodology, inference for combined aggregate and individual level data, causal inference, and sampling design. His current research involves optimal sampling design conditional on aggregate data and the use of aggregate data for the reduction of estimation error.
Professor Glynn will present a talk entitled "Alleviating Ecological Bias in Generalized Linear Models with Optimal Subsample Design." A background paper is posted on the course website. The presentation will be at noon on Wednesday, May 2 in Room N354, CGIS North, 1737 Cambridge St. As always, lunch will be provided.
25 April 2007
Tomorrow afternoon, the Harvard-MIT Positive Political Economy seminar will be presenting at talk by Robert Erikson, professor of Political Science at Columbia University. He will be giving a talk entitled "Are Political Markets Really Superior to Polls as Election Predictors?". The seminar will meet on Thursday, April 26 at 4:30 in room N354 at CGIS North (this is also the room where the Applied Statistics workshop meets on Wednesdays). An abstract follows on the jump:
Election markets have been praised for their ability to forecast election outcomes, and to forecast better than trial-heat polls. This paper challenges that optimistic assessment of election markets, based on an analysis of Iowa Electronic Market (IEM) data from presidential elections between 1988 and 2004. We argue that it is inappropriate to naively compare market forecasts of an election outcome with exact poll results on the day prices are recorded, that is, market prices reflect forecasts of what will happen on Election Day whereas trial-heat polls register preferences on the day of the poll. We then show that when poll leads are properly discounted, poll-based forecasts outperform vote-share market prices. Moreover, we show that win-projections based on the polls dominate prices from winner-take-all markets. Traders in these markets generally see more uncertainty ahead in the campaign than the polling numbers warrant—in effect, they overestimate the role of election campaigns. Reasons for the performance of the IEM election markets are considered in concluding sections.
24 April 2007
Several units on campus are sponsoring a lecture series by Michael Stein, professor of statistics and director of the Center for Integrating Statistical and Environmental Science at the University of Chicago. He will be talking about issues in space-time statistical modeling. There will be three lectures from April 25-27, but the lectures are at different times and locations; click on the links for an abstract of the lecture:
Models and Diagnostics for Spatial and Spatial- Temporal Processes
Wednesday, April 25: 3:30-5:00
HSPH Kresge G1
(simulcast to CGIS N031)
Models and Diagnostics for Spatial and Spatial- Temporal Processes
Thursday, April 26: 3:30-5:00
HSPH Kresge G2
(simulcast to CGIS N031)
Statistical Processes on a Global Scale
Friday, April 27: 11:00-12:00
(simulcast to HSPH Kresge G3)
23 April 2007
The American Economic Association has announced that this year's John Bates Clark Medal has been awarded to Susan Athey, professor of economics here at Harvard. The Clark Medal is awarded every other year to an American economist under the age of 40 who has made a significant contribution to economic thought. Previous winners include Kenneth Arrow, Dale Jorgenson, James Heckman, Jerry Hausman, and (most recently) Daron Acemoglu. Professor Athey stands out in one respect, however; she is the first woman to be awarded the Clark Medal (and about time, too!). For more information, see the AEA announcment or coverage in the Harvard Crimson.
This week, the Applied Statistics Workshop will present a talk by John Campbell, the Morton L. and Carole S. Olshan Professor of Economics at Harvard University. Professor Cambell received his Ph.D. from Yale University and served on the faculty at Princeton before coming to Harvard in 1994. He is the author or editor of four books, and he has published widely in journals in economics and finance, including the American Economic Review, Econometrica, and the Quarterly Journal of Economics. He recently served as the president of the American Finance Association.
Professor Campbell will present a talk entitled "Fight or Flight: Portfolio Rebalancing By Individual Investors." The talk is based on joint work with Laurent E. Calvet and Paolo Sodini; their paper is available from the course website. The presentation will be at noon on Wednesday, April 25 in Room N354, CGIS North, 1737 Cambridge St. As always, lunch will be provided. An abstract of the talk follows on the jump:
Fight Or Flight? Portfolio Rebalancing By Individual Investors Laurent E. Calvet, John Y. Campbell and Paolo Sodini
This paper investigates the dynamics of individual portfolios in a unique dataset containing the disaggregated wealth and income of all households in Sweden. Between 1999 and 2002, stockmarket participation slightly increased but the average share of risky assets in the financial portfolio of participants fell moderately, implying little aggregate rebalancing in response to the decline in risky asset prices during this period. We show that these aggregate results conceal strong household level evidence of active rebalancing, which on average offsets about one half of idiosyncratic passive variations in the risky asset share. Sophisticated households with greater education, wealth, and income, and holding better diversified portfolios, tend to rebalance more aggressively. We also study the decisions to enter and exit risky financial markets. More sophisticated households are more likely to enter, and less likely to exit. Portfolio characteristics and performance also influence exit decisions. Households with poorly diversified portfolios and poor returns on their mutual funds are more likely to exit; however, consistent with the literature on the disposition effect, households with poor returns on their directly held stocks are less likely to exit.
19 April 2007
Since Jim's post has brought us back to the SUTVA problem, here is another situation to consider. Let's say that I am interested in the effect of starting order on the performance of athletes in some competition. For the sake of argument, let's say cycling. We might conjecture that starting in the first position in a pack of cyclists conveys some advantage, since the leader can stay out of trouble in the back of the pack. On the other hand, there might be an advantage to starting in a lower position so that the cyclist can take advantage of the draft behind the leaders.
It is pretty clear what we would like to estimate in this case. If X is the starting position from 1 to n, and Y is the length of time that it takes the athlete to complete the race, then the most intuitive quantity for the causal effect of starting first instead of second is E[Y_i|X_i=1] - E[Y_i|X_i=2], etc. We still have the fundamental problem of causal inference in that we only observe one of the potential outcomes, but average treatment effects also make sense in this case, defining the ATE as E[Y|X=1] - E[Y|X=2]. Moreover, there is a clear manipulation involved (I can make you start first or I can make you start second) and such a manipulation would be easy to implement using a physical randomization to ensure balance on covariates in expectation. Indeed, this procedure is used in several sports; one example is the keiren race in cycling, which is a paced sprint competition among 6-9 riders.
So far, so good, but there is a problem...
It is pretty clear that we have a SUTVA violation here. It is not that if Cyclist A is assigned to start in position 2, then Cyclist B has to be assigned to start in some other position; SUTVA (as I understand it) doesn't require that it be possible for all subjects to be assigned to all values of the treatment. The problem is that the potential outcome for Cyclist A starting in position 2 may depend on whether Cyclist B is assigned to position 1 and Cyclist C is assigned to position 3 or vice versa. What if B is a strong cyclist who likes to lead from the front, enabling A to draft for most of the race, while C is a weak starter who invariably falls to the back of the pack? In that case, E[Y_A| X_A= 2, X_B = 1, X_C = 3] will not be equal to E[Y_A| X_B = 3, X_A = 2, X_C = 1]. In other words, in this case there is interference between units. So, the non-interference aspect of SUTVA is violated and therefore E[Y|X=1] - E[Y|X=2] isn't a Rubin causal effect. Bummer.
On the other hand, if we are able to run this race over and over again with the same cyclists, we are in a sense going to average over all of the assignment vectors. If we then take the observed data and plot E[Y|X = x], we are going to get a relationship in the data that is purely a function of the manipulation that we carried out. How should we think about this quantity? I would think that a reasonably informed lay person would interpret the difference in race times in a causal manner, but what, precisely, are we estimating and how should we talk about it? I'd love to hear any suggestions, particularly since it relates to a project that I've been working on (and might have more to say about in a few weeks).
18 April 2007
Around a month ago, I blogged about the dangers of using appellate case outcomes as datapoints. The basic idea is that most models or inference structures assume some kind of independence among the units, perhaps independence given covariates (in which case the residuals are assumed to be i.i.d.), or perhaps the "Stable Unit Treatment Value Assumption" in the causal inference context. When applied to appellate cases in the United States legal system, these analyses assume away precedent. The instincts I developed as a practicing litigator tell me not to believe a study that assumes away precedent.
One solution to this problem previously proposed in the causal inference literature is to match "treated" and "control" appellate cases that are very close in time to each other (whatever "treated" and "control" are here). After a conversation I had with Mike Kellermann a week or so ago, I think this cure may be worse than the disease. The idea behind comparing cases very close in time to one another is that the general state of the law (in part defined by precedent) for the two cases will be similar. That's right, but recent developments in the law are more on the minds of judges.
Suppose Case A got treatment, and Case B got control. If the matching algorithm has worked, Case A and Case B will be similar in all ways except the treatment. If Case A and Case B are also close in time to one another, how plausible is it the judges who decide both will decide them without regard to each other?
17 April 2007
The Economist and Time Magazine recently published interesting articles on a new type of twins. Apparently some twins are neither identical nor fraternal, but are `semi-identical'. That is, one twin is male and the other `intersex' (both male and female). You can read a short discussion on the biology in the articles, which also note that it’s unknown how common this type of twins is. More to worry about for believers in twin studies (for other problems, see this earlier post).
16 April 2007
This week, the Applied Statistics Workshop will present a talk by Skyler Cranmer, a Ph.D. candidate in the Department of Political Science at the University of California - Davis and a visiting scholar at IQSS. He earned a BA in Criminal Justice and an MA in International Relations before starting the program at Davis. His research interests in political methodology include statistical computing, missing data problems, and formal theory.
Skyler will present a talk entitled "Hot Deck Imputation for Discrete Data." The paper is available from the course website. The presentation will be at noon on Wednesday, April 18 in Room N354, CGIS North, 1737 Cambridge St. As always, lunch will be provided. An abstract follows on the jump:
Hot Deck Imputation for Discrete Data
Skyler J. Cranmer
In this paper, I develop a technique for imputing missing observations in discrete data. The technique used is a variant of hot deck imputation called fractional hot deck imputation. Because the imputed value is a draw from the conditional distribution of the variable with the missing observation, the discrete nature of the variable is maintained as its missing values are imputed. I introduce a discrete weighting system to the fractional hot deck imputation method. I weight imputed values by the fraction of the original weight of the missing element assigned to the value of the donor observation based on its degree of affinity with the incomplete observation and am thus able to make confidence statements about imputed results; hot decking in the past has been limited by the inability to make such confidence statements.
11 April 2007
I've posted before about the various ways that the mass media of today interacts badly with cognitive heuristics people use, in such a way as to create apparently irrational behavior. Spending a fair amount of time recently standing in long security lines at airports crystallized another one to me.
The availability heuristic describes people's tendency to judge that events that are really emotionally salient or memorable are more probable than events that aren't, even if the ones that aren't are actually statistically more likely. One classic place you see this is in estimates of risk of dying in a terrorist attack: even though the odds are exceedingly low of dying this way (if you live in most countries, at least), we tend to spend far more resources, proportionally, fighting terror than in dealing with more prosaic dangers like automobile accidents or poverty. There might be other valid reasons to spend disproportionately -- e.g., terrorism is part of a web of other foreign-policy issues that we need to focus on for more long-term benefits; or people don't want to sacrifice the freedoms that would be necessary (like more restrictive speed limits) to make cars safer; or it's not very clear how to solve some problems (like poverty) -- and I really don't want to get into those debates -- the point is just that I think most everyone would agree that in all of those cases, at least part of the reason for the disproportionate attention is because dying in a terrorist attack is much more vivid and sensational than dying an early death because of the accumulated woes of living in poverty. And there's plenty of actual research showing that the availability heuristic plays a role in many aspects of prediction.
There's been a lot of debate about whether this heuristic is necessarily irrational. Evolutionarily speaking, it might make a lot of sense to pay more attention to the more salient information. To steal an example from Gerd Gigerenzer, if you live on the banks of a river and for 1000 days there have been no crocodile sightings there, but yesterday there was, you'd be well-advised to disregard the "overall statistics" and keep your kids from playing near the river today. It's a bit of a just-so story, but a sensible one, from which we might infer two possible morals: (a) as Steven Pinker pointed out, since events have causal structure, it might make sense to pay more attention to more recent ones (which tend to be more salient); and (b) it also might make sense to pay more attention to emotionally vivid ones, which give a good indication of the "costs" of being wrong.
However, I think the problem is that when we're talking about information that comes from mass media, both of these reasons don't apply as well. Why? Well, if your information doesn't come from mass media, to a good approximation you can assume that the events are statistically representative of the events that you might be likely to encounter. If you get your information from mass media, you cannot assume this. Mass media reports on events from all over the world in such a way that they can have the same vividness and impact as if they were in the next town over. And while it might be rational to worry a lot about crime if you consistently have shootings your neighborhood, it doesn't make as much sense to worry about it if there are multiple shootings in cities hundreds of miles away. Similarly, because mass media reports on news - i.e., statistically rare occurrences - it is easy to get the dual impression that (a) rare events are less rare than they actually are; and (b) that there is a "recent trend" that needs to be paid attention to.
In other words, while it might be rational to keep your kids in if there were crocodile attacks at the nearby river yesterday, it's pretty irrational to keep them in if there were attacks at the river a hundred miles away. Our "thinking" brains know this, but if we see those attacks as rapidly and as vividly as if they were right here -- i.e., if we watch them on the nightly news -- then it's very hard to listen to the thinking brain... even if you know about the dangers. And cable TV news, with its constant repetition, makes this even harder.
The source of the problem is due to the sampling structure of mass media, but it's of course far worse if the medium makes the message more emotional and vivid. So there's probably much less of a problem if you get most of your news from written sources -- especially multiple different ones -- than TV news. That's what I would guess, at least, though I don't know if anyone has actually done the research.
10 April 2007
I was recently involved in a discussion among fellow grad students about what determines which statistical software package people use to analyze their data. For example, this recent market survey lists 44 products selected from 31 vendors and they do not even include packages like R that many people around Harvard seem to use. Another survey conducted by Alan Zaslavsky lists 15 packages while `just’ looking at the available software for the analysis of surveys with complex sample designs. So how do people pick their packages given the plethora of options? Obviously, many factors will go into this decision (departmental teaching, ease of use, type of methods used, etc. etc. etc. ). One particularly interesting factor in our discussion concerned the importance of academic discipline. It seems to be the case that different packages are popular in different disciplines. But how exactly usage patterns vary across fields remains unclear. We wondered whether any systematic data exists on this issue? For example, how many political scientists use R compared to other programs? What about statisticians, economists, sociologists, etc.? Any information would be highly appreciated.
9 April 2007
This week, the Applied Statistics Workshop will present a talk by Gary King, the David Florence Professor of Government at Harvard and the Director of the Institute for Quantitative Social Science. He has published over 100 articles, and his work has appeared journals in public heath, law, sociology, and statistics, as well as in every major journal in political science. He is the author or co-author of seven books, many of which are standards in their field. His research has been recognized with numerous awards, and he is one of the most cited authors in political science. He is also the faculty convenor of this blog.
Professor King will present a talk entitled "How to How to Read 100 Million Blogs (and How to Classify Deaths without Physicians)." The talk is based on two papers, one co-authored with Dan Hopkins and the other with Ying Lu. The presentation will be at noon on Wednesday, April 11 in Room N354, CGIS North, 1737 Cambridge St. As always, lunch will be provided. An abstract of the talk and links to the papers follow on the jump:
How to Read 100 Million Blogs (and How to Classify Deaths without Physicians) Gary King We develop a new method of computerized content analysis that gives approximately unbiased and statistically consistent estimates of quantities of theoretical interest to social scientists. With a small subset of documents hand coded into investigator-chosen categories, our approach can give accurate estimates of the proportion of text documents in each category in a larger population. The hand coded subset need not be a random sample, and may differ in dramatic but specific ways from the population. Previous methods require random samples, which are often infeasible in social science text analysis applications; they also attempt to maximize the percent of individual documents correctly classified, a criterion which leaves open the possibility of substantial estimation bias for the aggregate proportions of interest. We also correct, apparently for the first time, for the far less-than-perfect levels of inter-coder reliability that typically characterize human attempts to classify documents, an approach that will normally outperform even population hand coding when that is feasible. We illustrate the effectiveness of this approach by tracking the daily opinions of millions of people about candidates for the 2008 presidential nominations in online blogs, data we introduce and make available with this article. We demonstrate the broad applicability of our approach through additional evaluations in a variety of available corpora from other areas, including large databases of movie reviews and university web sites. We also offer easy-to-use software that implements all methods described.
The methods for a key part of this paper build on King and Lu (2007), which the talk will also briefly cover. This paper offers a new method of estimating cause-specific mortality in areas without medical death certification from "verbal autopsy data" (symptom questionnaires given to caregivers). This method turned out to give estimates considerably better than the existing approaches which included expensive and unreliable physician reviews (where three physicians spend 20 minutes with the answers to the symptom questions from each deceased to decide on the cause of death), expert rule-based algorithms, or model-dependent parametric statistical models.
Copies of the two papers are available at:
It looks like those of us who would like more sophisticated reporting of statistical results in major media outlets have an ally in Byron Calame, the public editor for the New York Times. We've blogged before about his concerns about the Times' coverage of statistical data. This week, he's taking on the ubiquitous Nielsen television ratings, ratings generated from surveys yet never reported with uncertainty estimates. The best paragraph from the piece:
Why not at least tell readers that Nielsen didn’t provide the margin of error for its “estimates”? I put that question to Bruce Headlam, the editor in charge of the Monday business section, where charts of Nielsen’s audience data appear weekly. “If we run a large disclaimer saying, in effect, this company is withholding a critical piece of information, I imagine many readers would simply turn the page,” he wrote in an e-mail.
Imagine that; readers might want their news to be, well, news!
6 April 2007
We've talked a lot on the blog about good ways of visualizing data. For something a little lighter this Friday, here is one of the more unusual visualizations that I've come across: a time series of real housing prices represented as a roller coaster, which you can 'ride'. It isn't perfect; they need a little ticker that shows you what year you are in, but it is a neat idea. It would be fun to do something similar with presidential approval.
(hat tip: Big Picture)
4 April 2007
With a coauthor, I am involved in a project which in part attempts to assess the effect of assigning judge A versus judge B to outcomes at the trial level in criminal cases. I've begun a literature search on this, and it seems like most attention thus far has focused on the sentencing stage (particularly relating to the controversy over the federal sentencing guidelines), and that few authors have used what one might call modern or cutting edge causal inference thinking. Can anyone out there help here? I'm I missing important studies?
(Feel free to email me off-blog if you'd prefer.)
The Cambridge Colloquium on Complexity and Social Networks is sponsoring a talk tomorrow that may be of some interest to readers of this blog. Details below:
"Taking Person, Place, and Time Seriously in Infectious Disease Epidemiology and
Devon D. Brewer, University of Washington
Thursday, April 5, 2007
12:00 - 1:30 p.m.
CGIS North, 1737 Cambridge Street, Room N262
Abstract: Social scientists and field epidemiologists have long appreciated the role of social networks in diffusion processes. The cardinal goal of descriptive epidemiology is to examine "person, place, and time" in relation to the occurrence of disease or other health events. In the last 20 years, most infectious disease epidemiologist have moved away from the field epidemiologistÿÿs understanding of transmission as embedded in contact structures and shaped by temporal and locational factors. Instead, infectious disease epidemiologists have employed research designs that are best suited to studying non-infectious chronic diseases but unable to provide meaningful insight on transmission processes. A comprehensive and contextualized infectious disease epidemiology requires assessment of person (contact structure and individual characteristics), place, and time, together with measurement of specific behaviors, physical settings/fomites, and the molecular biology of pathogens, infected persons, and susceptible persons. In this presentation, I highlight examples of research that include multiple elements of this standard. From this overview, I show in particular how the main routes of HIV transmission in poor countries remain unknown as a consequence of inappropriate design in epidemiologic research. In addition, these examples highlight how diffusion research in the social sciences might be improved with greater attention to temporal and locational factors.
Devon D. Brewer, Ph.D., Director, has broad training and experience in thesocial and health sciences. Much of his past research has focused onsocial networks, research methods and design, memory and cognition, drug abuse, violence, crime, sexual behavior, and infectious disease (including sexually transmitted diseases, HIV, and hepatitis C). He earned his
bachelor's degree in anthropology from the University of Washington and his doctorate in social science from the University of California, Irvine. Prior to founding Interdisciplinary Scientific Research, Dr. Brewer held research positions at the University of Washington, an administrative position with Public Health-Seattle and King County, and teaching positions at the University of Washington, Pacific Lutheran University, and Tulane University. He has been a principal investigator on federal research grants and authored/co-authored more than 60 scientific publications.
3 April 2007
Here some inspiration on how to present data to a non-expert audience: www.gapminder.org. The goal of that site is ``to make sense of the world by having fun with statistics’’, by making publicly available but highly complex data understandable to the general public. Their reasoning is that the best data won’t make any difference unless you can communicate it well to a large audience. And they do a fantastic job at just that.
There are two neat things on this site. First is the Trendalyzer, an interactive tool for visualizing data. The software takes boring statistical tables and juices them up in an interactive fashion. For example you can watch the world income distribution evolve over time, and single out particular regions and countries to get a better sense of what’s driving the trends. It also shows how aggregates can be deceiving within regions and countries. Many of the pre-designed presentations are on human development, but you can pick your own indicators. I saw this in a lecture on income inequalities, and it was a major hit. The software has been acquired by Google which apparently wants to add features and make it freely available.
The second interesting item is a presentation by Hans Rosling, the founder of Gapminder at the TED 2007 conference (Technology Entertainment Design, which aims to gather inspiring minds). He debunks ``myths about the developing world’’ using the Trendalyzer and plenty of personal animation. He does such a great job at engaging this audience that many a workshop presenter could learn from watching him. He’s more like a sports presenter than academic, jumping up and down in front of the screen and still getting his message across.
2 April 2007
This week, the Applied Statistics Workshop will present a talk by Richard Berk, professor of criminology and statistics at the University of Pennsylvania. Professor Berk received his Ph.D. from Johns Hopkins University and served on the faculties of Northwestern, UC-Santa Barbara and UCLA before moving to Penn in 2006. He has published widely in journals in statistics and criminology. His research focuses on the application of statistical methods to questions arising in the criminal justice system. One of his current projects is the development and application of statistical learning procedures to anticipate failures on probation or parole and to forecast crime “hot spots” a week in advance.
Professor Berk will present a talk entitled "Counting the Homeless in Los Angeles County," which is based on joint work with Brian Kriegler and Donald Ylvisaker. Their paper is available through the workshop website. The presentation will be at noon on Wednesday, April 2 in Room N354, CGIS North, 1737 Cambridge St. As always, lunch will be provided. An abstract of the paper follows on the jump:
Counting the Homeless in Los Angeles County
Department of Criminology
University of Pennsylvania
Over the past two decades, a variety of methods have been used to count the homeless in large metropolitan areas. In this paper, we report on a recent effort to count the homeless in Los Angeles County. A number of complications are discussed including the need to impute homeless counts to areas of the County not sampled and to take the relative costs of underestimates and overestimates of the number of homeless individuals into account. We conclude that despite their imperfections, the estimated counts provided useful and credible information to the stakeholders involved. Of course, not all stakeholders agreed.
Joint work with Brian Kriegler and Donald Ylvisaker.