31 March 2008
Here is a neat application of simulation from this weekend's New York Times. The authors, a graduate student and professor at Cornell, simulated the entire history of Major League Baseball 10,000 times to see just how "mythic" Joe DiMaggio’s 56-game hitting streak really is. They find that 56-game streaks are not at all unusual, and furthermore that Joe DiMaggio wasn't even the most likely to set the record!
For those who are interested in doing some simulations of their own, my guess is that the authors used the Lahman Baseball Database, which is freely available online. Perhaps in some future post I'll take a look at some simulations of other baseball records. Any suggestions for what to look at?
Please join us this Wednesday when Nicholas Christakis--Professor, Department of Sociology (Harvard University) and Medical Sociology (Harvard Medical School)--who will be present "Eat Drink and Be Merry: The Spread of Health Phenomena In Social Networks". Nicholas provided the following abstract:
Our work has involved the quantitative investigation of whether and how various health-related phenomena might spread from person to person. For example, we explored the nature and extent of person-to-person spread of obesity. We developed a densely interconnected network of 12,067 people assessed repeatedly from 1971 to 2003. We used longitudinal statistical models and network-scientific methods to examine whether weight gain in one person was associated with weight gain in friends, siblings, spouses, and neighbors. Discernible clusters of obese persons were present in the network at all time points, and the clusters extended three people deep. These clusters were not solely due to selective formation of social ties. A friend becoming obese in a given time interval increased a person's chances of becoming obese by 57% (95% CI: 6%-123%). Among pairs of adult siblings, one becoming obese increased the chance that the other became obese by 40% (21%-60%). Among spouses, one becoming obese increased the likelihood that the other became obese by 37% (7%-73%). Among those working in small firms, a co-worker becoming obese increased a person's chances of becoming obese by 41% (17-59%). Immediate neighbors did not exhibit these effects. We have also conducted similar investigations of other health behaviors, such as smoking, drinking, exercising, and the receipt of health screening, and of other health phenomena, such as happiness and depression. Various aspects of our findings suggest that the spread of social norms may partly underlie inter-personal health effects. Our findings have implications for clinical and public health interventions, and for cost-effectiveness assessments of preventive and therapeutic interventions. They also lay a new foundation for public health by providing a rationale for the claim that health is not just an individual, but also a collective, phenomenon.
Nicholas also provided a link to his paper here
The applied statistics workshop meets in room N354 in CGIS-Knafel, (1737 Cambridge st.) A light lunch will be served at 12 noon with the presentation beginning around 1215. Please contact me with any questions
28 March 2008
A friend just referred me to Processing, a powerful language for visualizing data:
Processing is an open source programming language and environment for people who want to program images, animation, and interactions. It is used by students, artists, designers, researchers, and hobbyists for learning, prototyping, and production. It is created to teach fundamentals of computer programming within a visual context and to serve as a software sketchbook and professional production tool. Processing is developed by artists and designers as an alternative to proprietary software tools in the same domain.
Their exhibition shows some very impressive results. For example, I liked the visualization of the London Tube map by travel time. I lived in Russel Square once, so this invoked pleasant memories:
If you can spare a minute also take a look at the other exhibited pieces. Most are art rather than statistics. For chess friends I especially recommend the piece called "Thinking Machine 4" by Martin Wittenberg, who gave a talk at the IQSS applied stats workshop in the fall. Enjoy!
27 March 2008
Recently I read an article written by Erin Leahey, talking about how the usage of statistical significance testing, the 0.05 cut-off value and the three-star system becomes legitimized and dominant in mainstream sociology. According to Erin, one star stands for p<=.05, two stars p<=.01 and three stars p<=.001. But I feel the cut-off values are something like .01, .05 and .10 respectively. Anyway, Erin attributed the first usage of .05 significance level to R. A. Fisher’s book, Design of Experiments in 1935. Erin noticed that other forms of significance testing besides the .05 test were already very popular in the 1930s, when close to 40 percent of articles published in ASR and AJS applied one or another form of significance testing procedure. Based on the articles she sampled from ASR and AJS, Erin showed that the popularity of the usage of statistical significance testing and the 0.05 cut-off value roughly took an “S” shape. The usage rose firstly from the 1930s to 1950, declined afterwards until 1970 and then revived since then. Currently, around 80 percent of articles published in ASR and AJS employ both practices. The three-star system emerged in the 1950s, but became popular only after 1970. Now there were slightly above 40 percent of articles published in the above top two sociological journals use this procedure.
So what account for the diffusion of such practices? Erin brought out several arguments to answer this question. For examples, she argued that institutional factors like investment in research and computer, graduate training and institution’s academic status, and journal editor’s individual preference, etc., could be some of the most important factors in the diffusion process of these practices. Interestingly, she found that graduating from Harvard had a significant negative “effect” on adopting these statistical practices. :-)
Of course, as it happens to almost all research, Erin’s study can not avoid some minor drawbacks either. For example, her sample is only drawn from the top two sociological journals and hence the generalization power of her findings could be limited. But overall, it is a fun reading. And if you are interested in more historical account of how the statistical practices were introduced to and became legitimized in social sciences in general, Camic and Xie (1994) is a very good start.
Leahey, Erin. 2005. Alphas and Asterisks: the Development of Statistical Significance Testing Standards in Sociology. Social Forces 84: 1-24.
Camic, Charles, and Yu Xie. 1994. “The Statistical Turn in American Social Science: Columbia University, 1890-1915.” American Sociological Review 59:773-805.
26 March 2008
I'm guessing many of the readers of this blog will get a kick out of this article, which I received in a Research Methodology class I am taking with Prof. Richard Hackman of the Psychology Department.
A joint project by Andy Eggers and Jens Hainmueller, two long-time contributors to this blog, is the basis of a piece in The Guardian this Monday. Check out the article "How election paid off for postwar Tory MPs" and the paper "MPs For Sale? Estimating Returns to Office in Post-War British Politics". Congrats to Andy and Jens!
A few weeks ago I attended a talk by David Card, a Berkeley economist currently on leave here at Harvard. Card's talk was on a new paper, written with Carlos Dobkin and Nicole Maestos, entitled "Does Medicare Save Lives?"
In the paper, Card and his coauthors analyze data on over 400,000 hospital emergency room encounters in California for "non-deferrable" admissions, which are defined as conditions for which daily admissions rates do not differ during the week.
Given the rather strict age cutoff for Medicare* eligibility (which, with a few exceptions, starts in the month one turns 65), and the fact that using non-deferrable ER admissions helps ensure that individuals within a narrow age band have similar underlying health, the authors are able to employ a regression discontinuity design to estimate the effect on mortality of becoming eligible to receive benefits under Medicare. Strikingly, their principal finding is that Medicare eligibility reduces mortality among their study cohort by 20 percent. That is a huge result!
Card mentioned in his talk that he and his coauthors were fairly surprised by the magnitude of this finding. So, what could explain this large decrease in mortality?
As the authors note, the magnitude is too large to be explained by the added health benefit of gaining coverage for the 8 percent of their sample that was previously uninsured. Moreover, the drop in mortality is also seen among individuals who had other coverage prior to Medicare. As an alternative explanation, the authors suggest that the result may be driven by improved "insurance generosity" of gaining Medicare coverage at age 65. That is, if a typical insurance policy for a non-Medicare eligible near-elderly citizen contains a lot of restrictions or administrative hurdles, then the more generous coverage and fewer restrictions provided by Medicare may result in more timely delivery of care, thus reducing mortality.
Here's one mechanism through which I think this explanation could be working. One question I raised during the talk was whether they had any data on the mode of arrival to the ER (unfortunately, they don't). Several years ago I actually worked as an Emergency Medical Technician for an ambulance service in rural Tennessee, and one of the most striking things about working in prehospital care is that the vast majority of ambulance calls are for Medicare recipients. Now, in part this is the result of the obvious fact that folks on Medicare are, on average, in poorer health than everyone else. But, I wonder whether the generous coverage of prehospital care under Medicare causes beneficiaries to call the ambulance, and thus receive earlier medical intervention, more than they would under a standard insurance policy (under which coverage for ambulances is more variable). Given the enormous clinical impact of early intervention on mortality, particularly for conditions such as heart attacks and strokes (which likely make up a good portion of the ER sample used here), this fact could help explain much of the drop in mortality.
In any case, I think the Card paper is a neat example of the use of a regression discontinuity design. The major downside to these designs, however, is that they gobble up the effective sample size, since identification essentially comes from individuals who are (assumed) to be randomly distributed around the cutoff point. So even with 400,000 observations, it's tough for the authors to really drill down to see which specific health events are showing the biggest declines in mortality.
*For those not familiar, Medicare is a social health insurance program provided to elderly U.S. citizens; it's sometimes confused with Medicaid, which is an insurance program for very low-income families.
20 March 2008
We're lucky to have two contested Presidential primaries. One of my favorite habits is to look at cross-tabs of candidate preferences by party and county. Here's an example of an Iowa cross-tab, showing the number of Iowa counties by Republican winner and Democratic winner:
We can visualize cross-tabs using mosaic plots as in "Visualizing Categorical Data." I did it for nine primary states in the image below. The green represents Obama counties, the orange Hillary counties and the purple Edwards counties. Across the columns are the Republican candidates: McCain, Romney, Huckabee. Across the rows, Obama, Hillary and Edwards. Check it out here. If you instead prefer an inverted version, with Republicans across the rows and Democrats across the columns (this makes it easier to compare the Democrats), check it out here.
The conclusions are the same over most states: Huckabee and Edwards are clearly the most complementary candidates. They shared counties whenever Edwards was in play (Iowa, Florida); after that, Huckabee shared Clinton counties. In Missouri every single county he won was a Clinton county! Huckabee and Clinton are somewhat complementary. Neither McCain nor Romney is particularly complementary with any Democrat (see California, where McCain and Romney split the Hillary-Obama counties), though both did better in Obama counties when Huckabee was in play.
One distracting feature of the plots above is that counties aren't uniformly populous. Obama won Missouri by winning only six counties. An alternative interpretation is to view this as an ecological inference problem, in which we are trying to determine the population totals in each of the cross-tab cells. This isn't perfectly accurate, since Edwards voters don't actually also vote for Huckabee. But it does provide a nice framework for scaling the mosaic plot by population size, and making it look generally less degenerate. I did that using Ryan Moore's eiPack and got this.
Yesterday I went to Professor Stanley Lieberson’s class, Issue in the Interpretation of Empirical Evidence. We discussed a paper, written by Stan and Glenn Fuguitt, titled Correlation of Ratios or Difference Scores Having Common Terms. The basic argument of this paper is that although ratios and difference scores are often used as dependent variables in traditional regression analysis, if there are some independent variables who share the same common term with those dependent variables, the estimated coefficients could be severely biased due to the spurious correlation brought about by this common term (whether it is in the denominator or numerator). For examples, if dependent variables are in the form of X/Z while independent variables are something like Y/Z, Z, or Z/X, etc., the estimated coefficients between the dependent and independent variable could become statistically significant simply due to chance.
For some concrete examples, criminologist often use crime rate (adjusted by city population size) as dependent variable while at the same time using city population size as independent variable; organizational researchers are interested in the relationship between the relative size of administration of organization and the absolute size of organization; and economists often regress GDP per capita on such variables as population growth rate, and/or even population size, etc. According to Stan and Fuguitt’s research, all the above examples will provide spurious coefficients since the dependent variable and the independent variable include common terms. In their paper, they attributed this finding back to a paper written by Kail Pearson in 1897 in which Pearson presented rigorously how the spurious correlation came from and a proximate formula for computing correlations of ratios, etc.
We were asked to do an experiment to prove the above spurious correlation, in which we generated three sets of random integers (namely, X, Y, Z) ranging from 1 to 99, presented the pairwise correlation matrix among them and found no significant correlations between any pair of variables. But we found significant correlation between Y/X and X, and when we regressed Y/X on X, the coefficient became significant too. So after such manipulations like division or subtraction, we artificially build significant correlation among two originally insignificant correlated random integers.
Why not try the following in Stata to see if the above claims are overstated or not?
set obs 50
pwcorr x y z, sig
gen ydx = y/x
pwcorr x ydx, sig
reg x ydx
gen xdz = x/z
gen ydz = y/z
pwcorr xdz ydz, sig
reg xdz ydz
gen zdy = z/y
pwcorr xdz zdy, sig
reg xdz zdy
Are you convinced by now? If not, please go read the source paper below (or just write back and say what is wrong with Stan and Fuguitt’s argument). If yes, the question now becomes what should we do with the spurious correlation. Shall we just use the original forms of variables? Shall we re-specify the Solow model? But what if our research interest is about ratio or difference? … …
Stanley Lieberson and Glenn Fuguitt, 1974. Correlation of Ratios or Difference Scores Having Common Terms, in Sociological Methodology (1973-1974), edited by Herbert Costner, San Francisco: Jossey-Rass Publishers.
18 March 2008
In a conversation with Kevin Quinn this week I was reminded of a fascinating lecture given at Google in 2006 by Luis von Ahn, an assistant professor in computer science at Carnegie Mellon. Von Ahn gives a very entertaining and thought-provoking talk on ingenious ways to apply human intelligence and judgment on a large scale to fairly small problems that computers still struggle with.
(Or watch video on Google video.)
Von Ahn devises games that produce data, the best-known example being the ESP Game, which Google acquired and developed as Google Image Labeler. In the game, you are paired with another (anonymous) player and shown an image. Each of you feverishly types in words describing the image (eg, "Spitzer", "politician", "scandal", "prostitution"); you get points and move to the next image when you and your partner agree on a label. The game is fun, even addictive, and of course Google gets a big, free payoff -- a set of validated keywords for each image.
I'm curious about how these approaches can be applied to coding problems in social science. A lot of recent interesting work has involved developing machine learning techniques to teach computers to label text, but there are clearly cases where language is just too subtle and complex to accurately extract meaning, and we need real people to read the text and make judgments. Mostly we hire RAs or do it ourselves; could we devise games instead?
17 March 2008
Please join us this Wednesday as we welcome, Kenneth Hill--Harvard School of Public Health, Department of Population and International Health-- who will present his research "Global Health and Global Goals: Do Targets Make a Difference?" Kenneth provided the following paper as background for his presentation:
The applied statistics workshop meets in room N-354 in CGIS-Knafel, 1737 Cambridge st. The workshop begins at 12 noon with a light lunch, with presentations usually beginning around 1215.
Please contact me with any questions
11 March 2008
While the Democratic nomination contest drags on (and on and on...; Tom Hanks declared himself bored with the race last week), attention is turning to hypothetical general election matchups between Hilary Clinton or Barack Obama and John McCain. Mystery Pollster has a post up reporting on state-by-state hypothetical matchup numbers obtained from surveys of 600 registered voters in each state conducted by Survey USA. There is some debate about the quality of the data (Survey USA uses Interactive Voice Response to conduct its surveys, there is no likely voter screen, etc.). But we have what we have.
At this point, the results are primarily of interest to the extent that they speak to the "electability" question on the Democratic side; who is more likely to beat McCain? MP goes through the results state by state, classifying each state into Strong McCain, Lean McCain, Toss-up, etc. From this you can calculate the number of electoral votes in each category, which provides some information but isn't exactly what we're interested in.
This problem is a natural one for the application of some simple, naive Bayesian ideas. If we throw on some flat priors, make all sorts of unreasonably strong independence assumptions, and assume that the results were derived from simple random sampling, we can quickly get posterior distributions for the support for each candidate in each state and can calculate estimates of the probability of victory. From there, it is easy to calculate the posterior distribution of the number of electoral votes for each candidate and find posterior probabilities that Obama beats McCain, Clinton beats McCain, or the probability that Obama would receive more electoral votes than Clinton.
While I was sitting around at lunch yesterday, I ran a very quick analysis using the reported SurveyUSA marginals. Essentially, I took samples from 50 independent Dirichlet posteriors for both hypothetical matchups, assuming a flat prior and multinomial sampling density (to allow for undecideds); to avoid dealing with the posterior predictive distributions, I'm just going to assume that all registered voters will vote so I can just compare posterior proportions. When you run this, you obtain estimates (conditional on the data and, most importantly, the model) that the probability of an Obama victory over McCain is about 88% and the probability of a Clinton victory is about 72%. There is a roughly 70% posterior probability that Obama would win more electoral votes than Clinton.
As I mentioned, this is an extremely naive Bayesian approach. There are a lot of ways that one could make the model better: adding additional sources of uncertainty, allowing for correlations between the states, using historical information to inform priors, and imposing a hierarchical structure to shrink outlying estimates toward the grand mean. One place to start would be by modeling the pairs of responses to the two hypothetical matchup questions. Any of these things, however, is going to be much easier to do in a Bayesian framework, since calculating posterior distributions of functions of the model parameters is extremely easy.
10 March 2008
This Wednesday we are excited to welcome Andy Eggers and Jens Hainmueller, Government Department, Harvard University, who will present, "MPs for Sale? Estimating Returns to
Office in Post-War British Politics'. Andy and Jens provided the following abstract:
While the role of money in policymaking is a central question in political
economy research, surprisingly little attention has been given to the rents
politicians actually make from politics. Using an original dataset on the
size of British politicians' estates, we find that gaining a seat in the
House of Commons had a large effect on personal wealth: Conservative Party
MPs died with almost twice as much money, on average, as very similar
Parliamentary candidates who were defeated. We find no financial benefits
for candidates from the Labour party. We argue that Conservative MPs
profited from office in a lax regulatory environment by using their
political positions to obtain outside work as directors, consultants, and
lobbyists, both while in office and after retirement. Our results are
consistent with anecdotal evidence on MPs' outside financial dealings but
suggest that the magnitude of influence peddling was larger than has been
The paper is available here:
The applied statistics workshop meets in room N-354 in CGIS-Knafel, 1737 Cambridge st. The workshop begins at 12 noon with a light lunch, with presentations usually beginning around 1215.
Please contact me with any questions
5 March 2008
The dramatic increase in cases of autism in children over the past few years has been in the news again in recent days. Most notably, presumptive Republican presidential nominee John McCain said at a recent stop, "there’s strong evidence that indicates that it’s got to do with a preservative in vaccines." Which would be fine if such strong evidence existed; unfortunately, that is a mischaracterization of the current state of the literature to say the least. McCain has since backed away from his initial comments (see this article in yesterday's New York Times), but the debate prompted by his comments will undoubtedly continue.
By coincidence, the Robert Wood Johnson program at Harvard is sponsoring a talk tomorrow on this topic. Professor Peter Bearman (chair of the Statistics Department at Columbia) will be speaking on "Early Thoughts on the Autism Epidemic." Professor Bearman is currently leading a project on the social determinants of autism. The talk is in N262 on the second floor of the Knafel Building at CGIS from 11:00 to 12:30.
3 March 2008
Please join us this Wednesday as we welcome Joseph Blitzstein, Department of Statistics, Harvard University, who will present 'In and Out of Network Sampling'.
Joe provided the following abstract for his talk ,
In recent years it has become extremely common to need to work with
network data, in applications such as the study of social networks,
protein interaction networks, and the Internet. This has required the
development of new generative models such as exponential random graph
models and power law models. Yet it is usually prohibitively expensive
to observe or work with the full network, so sampling within the
network is generally required.
Various approaches to network sampling, such as respondent-driven
sampling, have been proposed. But when will the generative mode mesh
well with the sampling scheme? This question is crucial for reliable
inference about networks, yet the question is seldom addressed and
much remains unknown. We will discuss generating random networks and
sampling within a network, and their interactions. Based on joint work
with Ben Olding.
The workshop will begin at 12 noon with a light lunch and the presentation will begin at 1215. The workshop is help in room N354, CGIS-Knafel, 1737 Cambridge St.