30 November 2007
IQSS is sponsoring a conference next Friday on the emerging area of computational social science. Below is the announcement:
The Conference on Computational Social Science (part of the Eric M. Mindich Conference series)
Friday, December 7, 2007
Center for Government and International Studies South, Tsai Auditorium (Room S010)
1730 Cambridge Street, Cambridge, MA
The development of enormous computational power and the capacity to collect enormous amounts of data has proven transformational in a number of scientific fields. The emergence of a computational social science has been slower than in the sciences. However, the combination of the still exponentially increasing computational power with a massive increase in the capturing of data about human behavior makes the emergence of a field of computational social science desirable, but not inevitable. The creation of a field of computational social science poses enormous challenges, but offers enormous promise to achieve the public good. The hope is that we can produce an understanding of the global network on which many global
problems exist: SARS and infectious disease, global warming, strife due to cultural collisions, and the livability of our cities. That is, can sensing our society lead to a sensible society?
To solve these problems will require trading off privacy versus convenience, individual freedom versus societal benefit, and our sense of individuality versus group identity. How will we decide what the sensible society will look like? This conference brings together the wide array of individuals who are working in this emerging research area to discuss how we might address these global challenges, and to evaluate the potential emergence of a field of "computational social science.
Registration is required; more information is available here.
28 November 2007
In political science, as in many other branches of social science, more attention is being paid to the genetic bases of political behavior (I won't say effects, because that opens a whole other barrel of worms). As I was looking around for an overview of some of the statistical issues involved, I came across a couple of blog posts by Cosma Shalizi at Carnegie Mellon that were both informative and amusing. An excerpt:
When we take our favorite population of organisms (e.g., last year's residents of the Morewood Gardens dorm at CMU), and measure the value of our favorite quantitative trait for each organism (e.g., their present zip code), we get a certain distribution of this trait:
(Note to our institutional review board: No undergraduates had their DNA sequenced in the writing of this essay.)
If we are limited to the tools of early 20th century statistics (in particular, if we are the great R. A. Fisher, and so simultaneously forging those tools while helping to found evolutionary genetics), we summarize the distribution with a mean and a variance. We can inquire as to where the variance in the population comes from. In particular, assuming the organisms are not all clones, it is reasonable to suppose that some of the variation goes along with differences in genes. The fraction of variance which does so is, roughly speaking, the "heritability" of the trait.
The most basic sort of analysis of variance (see also: Fisher) would make this conceptually simple, though practically unsuccessful. Simply take all the organisms in the population, and group them by their genotypes. For each group of genetically identical organisms, compute the average value of the trait. Compare the variance of these within-genotype averages (that is, the across-genotype variance) to the total population variance; this is the fraction of variation associated with genotypes. In most mammalian populations, where clones (identical twins, triplets, ...) are rare and every organism otherwise has a unique genotype, this would tell you that almost all of the variance of any trait is associated with genetic differences. On such an analysis, almost all of the variance in zip codes in my example would be "due to" genetic differences, and the same would be true of telephone numbers, social security numbers, etc.
To see why, look at my table again. With one exception (the twins who live in 15213 and 48104), in this population changing zip code means changing your genotype. The vast majority (81%) of the variance in zip codes is between genotypes, not within them. With real human data, a quarter of the people wouldn't be twins living apart, and the proportion of variance in zip codes "due to" genotype would be even higher.
Naively, then, on this analysis we would say that the "heritability" of zip code, the fraction of its variance which goes along with genetic variations, is 81%. It is crucial to be clear on what this means, which is merely and exactly this: in this population, if we take a random group of genetically identical people, the variance within that group should be 19% (=100-81) of the total variance in the population.
26 November 2007
The Applied statistics workshop reconvenes this Wednesday, 11/28, for Esther Duflo, the Abdul Latif Jameel Professor of Poverty Alleviation and Development Economics, who will present her work on "Improving School Quality". Esther provided the following link describing the project:
As a reminder, the workshop begins with a light lunch at 12 noon. We are located in room N354, Cgis Knafel (1737 Cambridge st).
21 November 2007
I have been studying the family trees of 20 successful African-Americans, people in fields ranging from entertainment and sports (Oprah Winfrey, the track star Jackie Joyner-Kersee) to space travel and medicine (the astronaut Mae Jemison and Ben Carson, a pediatric neurosurgeon). And I’ve seen an astonishing pattern: 15 of the 20 descend from at least one line of former slaves who managed to obtain property by 1920 — a time when only 25 percent of all African-American families owned property.
The question is, how astonishing is the pattern Gates points out?
Whether we should be impressed that 15 of 20 successful African-Americans had landowning ancestors depends a lot on what we assume about patterns of intermarriage among landowning and non-landowning African-Americans. Let's consider two extreme possibilities. First, assume that landowners never married non-landowners. In that case, one's grandparents would have been either all landowners or all non-landowners (more precisely, all from landowning families or all from non-landowning families). Given that 25% of all African-American families were landowners in 1920, 25% of African Americans of Oprah's generation would have had landowning grandparents (assuming that landowners and non-landowners had equally sized families, and also assuming that African-Americans had only African-American grandparents). The fact that 75% of Gates's successful African-Americans had landowning grandparents would indeed be remarkable, supporting his claim that having landowning ancestors helped them succeed.
At the other extreme, assume that landowners intermarried perfectly freely with non-landowners. In that case, there is an almost 70% chance that a randomly-selected person of Oprah's generation would have at least one landowning grandparent. (To see this: given that 3/4 of all grandparents were not landowners, the probability that a randomly selected person would have no landowning grandparents is (3/4)^4 = 31.6%.) If that were true, we would be quite likely to see as many as 15 people with landowning ancestors in Gates's sample even if landowning had nothing to do with success (p-value = .19). The pattern he observes would thus provide only very weak evidence for his argument about the importance of landowning.
Gates's book appears typical of a genre of case study that focuses on remarkable people (or companies, or countries) in order to determine how they became that way. The problem with these studies (as pointed out in KKV, among many other places) is that they assume too much about the characteristics of unremarkable people (or companies, or countries). In the above example, Gates implicitly assumes that far fewer than 75% of unsuccessful African-Americans had a landowning ancestor in 1920. Instead of relying on this (possibly erroneous) assumption, he could have explicitly compared the family histories of his sample of remarkable African-Americans with those of another sample of unremarkable African-Americans. (This research design is known as "case-control" in epidemiology.) I doubt it would help with book sales to include a few chapters about thoroughly unfamous people, but it would make his arguments more convincing.
19 November 2007
There was a good non-technical article by Adam Liptak in the New York Times this weekend reviewing the renewed debate about the supposed deterrent effect of capital punishment (The web version of the article linked to seven different academic articles; many thanks to the editorial staff). I've blogged about this before (here) and tend to agree with those who say that there just isn't enough information in the data. In that context, I particularly liked the quote from Justin Wolfers at the end of the article:
Professor Wolfers said the answer to the question of whether the death penalty deterred was “not unknowable in the abstract,” given enough data.
“If I was allowed 1,000 executions and 1,000 exonerations, and I was allowed to do it in a random, focused way,” he said, “I could probably give you an answer.”
16 November 2007
In general, my impression is that cutting-edge research in social science rarely makes the leap from academic interest to media coverage to popular culture, but there are always some studies that capture the public's attention. One such study was the recent article by Nicholas Christakis from Harvard and James Fowler from UCSD ("The Spread of Obesity in a Large Social Network over 32 Years"). They find evidence that clusters of obese individuals are present and that they do not appear to be driven entirely by selection effects. This received widespread media attention, and was picked up by the writers of Boston Legal. At the end of this promo, we see how the character Denny Crane (played by William Shatner, a man who does not appear to push away from the table all that often himself) interprets the results of this study:
15 November 2007
From Andrew Gelman, I saw a link to an interesting "art exhibit" that's actually all about statistics and language. In some ways it reminded me of this other art exhibit that's actually all about statistics -- in this case, the meaning of some of the very large numbers we read about all the time, but find difficult to grasp on an intuitive level.
Both are worth checking out online. And if you live somewhere that you can visit either, lucky you!
14 November 2007
There was an interesting article this weekend in the Washington Post reviewing research on the relationship between the time at which adolescents become sexually active and subsequent anti-social behaviors ("Study debunks theory on teen sex, delinquency", Nov. 11, 2007). To sum up, existing research shows a strong and stable correlation between early "sexual debut" and delinquency later in life. This has often been interpreted as a causal relationship by policy advocates, despite the obvious potential for confounding. It seems clear that unobserved characteristics - thrill-seeking, risk-taking preferences (or even a simple lack of adult supervision) - would encourage both early sexual activity and delinquency.
The WaPo article contrasts these existing results with a new study by researchers at the University of Virginia, who look at differences in the timing of sexual debut and delinquency among pairs of twins. As the Post reports, "Other things being equal, a more probing study has found, youngsters who have consensual sex in their early-teen or even preteen years are, if anything, less likely to engage in delinquent behavior later on." This is a fairly accurate and measured interpretation of the results of the paper, which is worth commending since we often give the media a hard time on this blog for over-selling the results of scientific papers. (Now if we can just get them to link to the papers from their website, as the New York Times does fairly regularly.)
The authors of the twin study are somewhat more ambitious in their claims. Here is the abstract to the paper:
Rethinking Timing of First Sex and Delinquency
K. Paige Harden , Jane Mendle, Jennifer E. Hill, Eric Turkheimer and Robert E. Emery
(1) Department of Psychology, University of Virginia, Charlottesville, VA 22904-4400, USA
Abstract The relation between timing of first sex and later delinquency was examined using a genetically informed sample of 534 same-sex twin pairs from the National Longitudinal Study of Adolescent Health, who were assessed at three time points over a 7-year interval. Genetic and environmental differences between families were found to account for the association between earlier age at first sex and increases in delinquency. After controlling for these genetic and environmental confounds using a quasi-experimental design, earlier age at first sex predicted lower levels of delinquency in early adulthood. The current study is contrasted with previous research with non-genetically informative samples, including Armour and Haynie (2007, Journal of Youth and Adolescence, 36, 141–152). Results suggest a more nuanced perspective on the meaning and consequences of adolescent sexuality than is commonly put forth in the literature.
The current study suggests that there may be positive functions for early initiation of sexual activity, in that the co-twin with earlier age at first sex demonstrated lower levels of delinquency in early adulthood
Twin studies have been quite influential in a number of areas, and they have many benefits; they allow for balance on genetic and common environmental characteristics that would be exceedingly difficult to achieve in a typical observational study. Moreover, studies comparing idential and fraternal twins at least offer the possibility of teasing out the effects of genetic and environmental factors. At the same time, as twin studies move from biomedical to behavioral questions, there are some issues that deserve further consideration.
The first of these problems is selection within the sets of twins. If there is one thing that we know, it is that sex involves selection. Moreover, not to put to fine a point on it, but one of the parties involved in that selection process is choosing between twins (in many cases, identical twins!). The fact that the non-twin partner chose one twin over the other suggests that unobserved differences between the twins play an important role. The authors allude to this problem and describe their results as ``quasi-causal'', but they may be underestimating the importance of these ``uncontrolled confounds'' given the non-random character of the assignment process. Focusing on twins to achieve balance on genetic and shared environmental characteristic may end up increasing the overall bias of the estimates by increasing the imbalance in the unobserved unit-specific characteristics.
The second, and in my opinion more interesting, problem with the study is that it doesn't take into account the interaction within each set of twins. In effect, the researchers are conflating two treatments: the timing of each subject's sexual debut and the timing of the sexual debut for each subject's twin. This suggests an interference problem, because the delinquency outcome for subjects may depend on whether they became active before or after their twin did. One could easily imagine a scenario in which one twin becomes active and the other twin responds by acting out due to frustrations of one sort or another. In that case, it isn't so much that earlier sexual activity has a "positive function" for the twin engaging in it but rather a "negative function" for the twin that is not active; this is something that the data cannot answer.
I think that this is a general problem when using twin studies to estimate the effects of behavioral treatments. Some treatments will have a greater effect on the untreated twin that others, and my guess is the more the treatment is in the realm of social science, the more we should worry about these issues. At the very least, we should be skeptical about how the estimates obtained from twin studies would generalize to the population at large given the inherent interference problems.
13 November 2007
The Applied Statistics Workshop returns tomorrow (11/14) with Chris Paciorek, Department of Biostatistics in the School of Public Health, presenting this work on , 'Spatial scale and bias in regression models with spatial confounding'. Chris provided the following abstract for his talk:
When unmeasured confounders vary spatially, a common technique in regression modeling, including spatial epidemiology applications, is to try to account for the unmeasured confounding by modeling residual spatial correlation. The intuition is that modeling the spatial structure will remove large scale variation and allow one to estimate the effect of the covariate of interest based on variation in the outcome isolated at smaller scales. Previous work in the temporal setting indicates that when the variable of interest has an uncorrelated component then such an approach can minimize bias. Here I consider the situation that the variable of interest varies at multiple spatial scales but may not have a non-spatial component. I develop a framework for understanding bias using a simple generalized least squares model with data collected at point locations and fixed and known spatial scales. I show that bias is substantial even when the scales are known, unless the variable of inte rest has an unconfounded component that varies at a finer spatial scale than the confounder. Using simulation I consider the effect of estimating the scale of the residual spatial correlation on bias, showing that bias is similar when variance and scale parameters are estimated to when they are known. I discuss extensions to data aggregated into areal units and to the setting of measurement error in the covariate of interest.
As always, the workshop will convene at 12 noon, in room N-354, CGIS-Knafel. And a light lunch will be served.
Hope you all can make it!
8 November 2007
Are scatterplots confusing? Turns out the graphics people at the New York Times, who I think have been putting out some outstanding work in the past few years, think so.
Matthew Ericson, Deputy Graphics Editor at the NYT, gave a talk recently at the Infovis conference in which he described some of the techniques his staff uses to communicate information to readers. I wasn't there, but I looked through his slides (70 M zip file), which provide both highlights from the NYT's recent graphics and some indication of the process by which they arrive at a final product. Particularly interesting is the set of slides from 35-62, in which he shows how they developed a graphic depicting partisan shifts between the 2004 and 2006 Congressional elections. Early in the sequence (at page 38), you see a draft of what seems like an adequate approach -- a scatterplot depicting 2004 vote margin vs 2006 vote margin. It turns out (I'm basing this on Fernanda Viegas' description of the talk on the infosthetics blog) that the NYT graphics staff has found that lay readers don't really understand scatterplots, in part because they are so used to seeing time on the x-axis. So Ericson and his staff went back to the drawing board and developed something different (shown on page 61 of the slides; you can also see the a one-page pdf here or by clicking on the thumbnail below). Their new graphic orders the districts vertically by their 2006 vote share and shows the vote outcome on the horizontal axis, depicting the 2004-2006 shift by a horizontal arrow originating at the 2004 vote margin and ending at the 2006 margin. This approach conveys the information much less compactly (for one thing, all of the information in the y-axis is also in the x-axis) but communicates the partisan shift in a more intuitive way, while also giving a better sense of the distribution of partisanship across districts than did the scatterplot. Even though I'm used to seeing scatterplots I think I get a lot more out of this figure, especially with the extra summary stats they are able to depict by continuing on the theme of horizontal arrows depicting 2004-2006 shifts.
6 November 2007
This semester I am taking a hands-on (gasp!) class on the ``Design and Analysis of Sample Surveys’’ with Alan Zaslavsky. The design part of the course includes the basics of writing surveys, and the background reading includes a text by Fowler* which might be interesting to applied-minded readers. Here some thoughts on the book, I’d be curious to hear about alternative views or materials.
Fowler provides a quick and informative reading on how to ask about objective and subjective states, and how to pre-test and validate survey questions and answer categories. The book also discusses the design implications of different survey modes. Most items are particularly informative to novices in this area, and often they provoke a ``d’oh, obviously’’ reaction. But Fowler does a good job at alerting the reader to problematic examples might have slipped by. He also offers some advice on how to fix problems, and provides practical tips for implementing pre-tests which he strongly advocates. The chapters end with a useful summary of the key points which can serve as a reference to items in the chapter.
The book has a few shortcomings though, notably its somewhat confusing organization within the chapters, and lengthy wordings. Since some issues are cutting across chapters, the index ought to list more than 50 keywords to be useful as reference. And, being published in 1995, the book provides no background on web-based or email surveys.
I found that the book offers basic insights and is a useful introduction. It certainly raises awareness about the issues in survey design that users should be aware of. Designers of surveys might find the treatment too basic and general. For a more detailed treatment Krosnick and Fabrigar’s forthcoming ``Handbook of Questionnaire Design’’ looks promising (see here for a post on its presentation at IQSS in 2006).
The applied statistics workshop will take a one week hiatus this week (11/7). But be sure to join us next week (11/14) for Chris Paciorek, Department of Biostatistics, who will present 'Spatial scale and bias in regression models with spatial confounding'.
Hope that you all can make it next week--
1 November 2007
I often share the mixed feelings about media coverage of scientific papers that Amy discussed in her post yesterday on the statistics of race. Apparently we aren't the only ones; Mark Liberman at Language Log linked to yesterday's Dilbert cartoon:
Language Log is one of my favorite blogs, and many of the posts there are relevant for those of us reporting our own statistical results and trying to promote better coverage in the media. Some of my favorites:
I know that I've committed some of these sins myself; in fact, I think I need to go reinterpret some odds ratios...