27 January 2010
Joe Blitzstein, ever-popular professor of Statistics 110, is offering a seminar course this term. Statistics 340: Random Network Models will meet on Tuesdays (starting February 2nd) between 2pm - 4pm in Science Center 706. Joe invites all interested to come check it out.
22 January 2010
Unemployment remains at 10 percent, according to today's Bureau of Labor Statistics' news release. In light of ongoing economic troubles, Goldman Sachs is "trimming" its bonus pool to only $16.2 billion, for an average bonus of almost $500,000 per employee according to the New York Times. It is worth remembering that such extreme displays of economic inequality have not been permanent features of the U.S. economy. Rather, inequality has grown substantially over the last four decades. Watch it spread from the South across the country and intensify everywhere in the figure below.
Check out these introductory slides and R code for great tips on spatial statistics, including information on how to create a figure like the one shown above and much, much more!
21 January 2010
Political scientists David Brady and Doug Rivers, along with business and law professor Daniel Kessler wrote an op-ed for the WSJ arguing that the health care bill is hurting the Democrats. Their evidence is that states with lower support for the bill also have lower support for incumbent Democratic senatorial candidates:
Health reform is more popular in some of these states than in others. Where it's popular, Democratic candidates don't have too much of a problem, but where it's unpopular--and that includes most states--the Democratic Senate candidates are fighting an uphill battle. Support for health reform varies in these 11 states from a low of 33% in North Dakota to a high of 48% in Nevada. Democrats trail Republicans in six of the states; three are toss-ups; and in two, Democrats have a solid lead.I hate to fill any kind of institutional stereotype, but the causal reasoning here leaves much to be desired. The argument of the essay is that BECAUSE of health care, Democrats are doing worse in the polls. On this question, obviously, we have no data: this is why speculation is running rampant. The counterfactual would be: what would have happened to Democratic senatorial candidates if there had been no (or a substantially smaller) health care bill? Pundits can hardly type fast enough to get answers to this question out right now. Certainly, though, a correlation of support for health-care and support for Democrats will not provide the answer (since, you know, there is no variation on the treatment--all states are in the health care reform world).
Despite the general tone of the piece ("The culprit is the unpopularity of health reform...") , I believe the authors are making a different argument. Namely, that voters are responding to their senator's vote on health care. Based on their evidence, however, I think this is a flawed argument as well.
Confounding is an obvious problem here. There are many factors that could influence opinions on health care and the Democrats (ideology, economic performance, etc). The authors clearly consider possible problems of confounding:
How do we know that it's the health-reform bill that's to blame for the low poll numbers for Democratic Senate candidates and not just that these are more conservative states?
First, we asked voters how their incumbent senator voted on the health-care bill that passed on Christmas Eve. About two-thirds answered correctly. Even now, long before Senate campaigns have intensified, voters know where the candidates stand on health care. And second, we asked voters about their preference for Democrat versus Republican candidates in a generic House race. As in the Senate, the higher the level of opposition to health reform, the greater the likelihood that the state's voters supported Republicans.
It might be the case that voters are punishing known health care supporters! But, again, I am not sure that these polls show this. The Senate vote was party-line. If someone knew their senator's party, then they could infer their vote without actually knowing it. They could simply know that Democrats are trying to reform health care and their senator is a Democrat. Under this scenario, the actual vote of the senator would make not difference since our hypothetical voter equates Democrats with health care reform.
Put it this way: do you think that House Democrats that voted against the bill are going to have easy reelection campaigns? That seems like the real test of this hypothesis.
A simple gut check would be to run the same analysis with the stimulus instead of health care. I imagine you would get similar results. The point is that the advice from this article for Democrats--withdraw support for health care reform--is not supported by the data.
UPDATE: Brendan Nyhan over at Pollster makes essentially the same argument.
My colleague, Brandon Stewart, oriented me to this neat webpage, manyeyes.alphaworks.ibm.com, an IBM-developed web site that allows you to upload data quickly and visualize it using a variety of techniques.
Many Eyes lets you use textual data, so I just tried it out using the majority and dissenting opinions from Citizens United v. FEC, today's Supreme Court's decision striking down existing campaign finance law. (Note: Let's just say it's not a bad idea to use publicly available, non-copyrighted data.)
The resulting visualizations are just terrific, and they actually go far in illustrating the substantive differences between the conservative and liberal Justices on the campaign finance issue.
The first figure represents the majority opinion (written by Justice Kennedy, a moderate-conservative), with the larger words representing phrases used most frequently in the course of the opinion. Obviously, what we see is a strong consideration of "speech" interests -- no doubt discussed in the context of First Amendment issues.
By contrast, take a look at the dissenting/concurring opinion (written by Justice Stevens, a liberal). The most frequently used words here are "corporate," "corporation," "corruption," etc. The actual phrase "speech" is much less frequent, suggesting that the liberal Justices were more concerned with corporations influencing elections than free speech issues.
It's amazing how much information we can glean from these visualizations, even without having perused either opinion. If anybody has thoughts on this, I'd be keen to hear them.
19 January 2010
I'm into biking (mostly road-biking these days) so I was interested to read a post on the New York Times' "Freakonomics" blog about a study that uses variation in bike helmet laws across US states to show that helmet laws decrease bike riding among kids and teens. Since I think that most people should ride bikes most of the time AND I have been known to bug people to wear helmets, perhaps I've been working against myself.
A few things came to mind while reading the study. First, the study shows that helmet laws have an effect on bike safety for kids in the same age ranges. Unless I missed something, it seems like part of this effect could be due to fewer kids riding bikes (in addition to the obvious safety improvement that comes from actually wearing a helmet). I'd be curious how much the decrease in bike use is influencing the increase in safety, especially if kids are simply deciding to do other things like skateboarding that are perhaps equally dangerous but don't require helmets (a possibility mentioned by the authors). This may mean that the total effect of helmet laws on child safety is less than the effect estimated in the paper because some of the decreases in bike injuries are counter-balanced by increases in other types of injury that aren't part of the study.
Second, the authors use some fixed effects and diff-in-diff models, but I think this paper is calling out for the synthetic control method developed by Abadie, Diamond, and Hainmueller. The policy intervention is clean and there are a reasonable number of states that don't have laws, so building synthetic matches might be feasible. There might be some interference problems with states that pass helmet laws later, but those are details...
I'll end this post with a shameless plug: bike more! (and wear a helmet)
18 January 2010
I read in this morning's New York Times about research being conducted by two sociologists, Neil Gross (British Columbia) and Ethan Fosse (Harvard), on why academics tend to be left of center. That professors are more liberal than non-academics is a pretty well-known fact; at the same time, we don't have a good idea as to why this is. Previous research on this point has largely relied on anecdotal or qualitative techniques, so Gross and Fosse's paper, which relies on survey data, looks promising. A copy of the working paper is here.
The paper uses data from the General Social Survey pooled over time (1974-2008, n = 325, once observations with missing outcomes were removed), where the dependent variable is a respondent's self-described ideological orientation on a seven-point scale.
The technique the authors use to test various hypotheses explaining the ideology gap is the Blinder-Oaxaca decomposition, which was developed by labor economists to estimate the role of individual predictors in observed divisions. For example, you could use the technique to try to gain traction on the factors that drive the wage gap between men and women. In this case, the authors used the technique to figure out the role of different variables -- religion, parents' education, tolerance, verbal skills, having lived in a foreign country, etc. -- in the ideological gap between professorial and non-professorial populations. (Note: In the interest of full disclosure, this technique is new to me.)
In terms of the ideology gap, it appears that having a graduate degree, being generally tolerant of people different than yourself, and lacking a strong religious affiliation are some of the factors most strongly explaining ideological differences between academics and non-academics. In fact, these variables explain roughly half of the observed gap. (Of course, this does not rule out that some confounder could explain all of these factors, as well as self-described ideology.)
Gross & Fosse go on to posit that a professorial career has developed over time a liberal reputation so that liberal people are more likely to be drawn to it. Their results do not seem to provide direct support for this, a fact they acknowledge; nonetheless, their research is interesting and is drawing some attention.
My take-away is that there still seems to be a lot of room for quantitative research on this navel-gazing question. If people have thoughts, I'd be keen to hear them.
15 January 2010
Complementing Matt's post about TV-watching patterns below, here's an interesting article from The Guardian about how three British computer scientists are using content analysis techniques to parse out what makes (or does not make) a hit TV script.
9 January 2010
The New York Times has put together an awesome data visualization on the geography of Netflix. For each zip code they have the top 50 rentals of 2009 and they use these ranks to draw heat maps for each movie. There are all kinds of interesting patterns that point to both how preferences cluster and information spreads. My favorite two maps are the following, which I reference after the jump (darker colors indicate more rentals in that areas):
Mad Men, Season 1 Disc 1:
Paul Blart, Mall Cop:
First, an Oscar nomination seems to put you at the top of everyone's list, regardless of geography. Thus, Slumdog Millionaire, Benjamin Button, Gran Torino and Doubtall have high ranks. Second, box-office blockbusters do fairly poorly across the board, seemingly because most people saw those movies in the theaters (Wall-E, Dark Knight, etc).
Finally, the remaining movies show a great deal of geographic variation. There is a fairly pronounced difference between urban centers and the suburbs. Unsurprisingly, movies that have high Metacritic scores do very well in the urban centers, whereas they seem absent from the outlying areas. Reversely, movies that critics consider terrible (and are usually marketed toward teenagers) mostly ship to the suburbs.You can see this in the stark difference between the critically acclaimed TV show Mad Men and the slapstick comedy Paul Blart: Mall Cop (full disclosure: Mad Men was on my queue last year, Paul Blart was not, and I live in Cambridge/Somerville).
The other obvious divide that arises is race. Tyler Perry's two movies were only on the top 50 lists for a handful of neighborhoods that predominantly African-American. In Boston, for example, the movies cluster heavily in Dorchester and Mattapan.
Tyler Perry's The Family That Prays:
How people form preferences is one my favorite subjects and I love visualizations like these. My instinct is that there is a lot of preference clustering happening, based largely on age, class and, to a lesser extent, race. But above and beyond this, I imagine the information networks vary by geography--urbanites may hear about movies from certain blogs, while folks in the suburbs (who probably have more children and teens) might rely more on national TV advertisements. The Oscars tend to cross geographic and social lines because they are a widely-visible, low-cost indicator of movie quality. All of this points to a key fact: how information gets into and flows through our social network(s) is an important aspect of how our preferences come to be.
Also, this is begging for someone to put together a list of "Democrat" movies and "Republican" movies based on party affiliation in each zip code.
Simon Jackman puts together a plot of how the estimation of ideal points of the 111th U.S. Senate changes as he adds each roll call. Every Senator starts the term at 0 and then branches out. It illustrates an interesting feature of these IRT models:
The other thing is that there doesn't seem to be any obvious "vote 1″ update for ideal points. That is, there is no simple mapping from the ideal point estimate based on m roll call to ideal point estimates based on m+1 roll calls. You have to start the fitting algorithm from scratch each time (and hence the appeal of exploiting multiple cores etc), although the results from the previous run giving pretty good start values.
8 January 2010
When comparing how different groups fare on a particular measure (for example, the life expectancy of immigrants versus native born individuals or the wages of workers in 1950 versus 2000), we often focus on the difference in the averages of the two distributions. Sometimes we also examine disparity in distributional spreads, inquiring whether one group's outcomes are more variable than the other's. Of course, summarizing distributions with one or two parameters discards a lot of potentially useful information. Enter Relative Distribution Methods in the Social Sciences, a clever book by Mark Handcock and Martina Morris. In what follows, I explore the basic insight of the book and test out some techniques myself (with graphs!).
Handcock and Morris present a neat way to compare the whole distributions of a reference group and a comparison group by asking, "at what quantile on the reference group's distribution would someone from the comparison group fall?" If the distributions are the same, we expect this "relative data" to be uniformly distributed. Deviations from uniformity provide insights into group differences.
A bit of formality helps explain the utility of this framework. If Y0 and Y are random variables representing the measurement of interest on the reference population and comparison population, respectively, with CDFs F0(y) and F(y) and PDFs f0(y) and f(y), Handcock and Morris define a new random variable R (for relative data) with R = F0(Y). The CDF of R is F(F0-1(r)) [readers may recognize this as a probability-probability plot] and the PDF is f(F0-1(r))/ f0(F0-1(r)) for 0<r<1. Notice that this PDF, the "relative density," is simply a ratio of densities, each evaluated at a given quantile of the reference distribution. Because of the transformation from the original variable scale using the quantile function F0-1(r), this relative density is a valid PDF, integrating to 1. Researchers can thus use the random variable R and its "relative distribution" for inference.
I have been interested in trying out this technique for a while, and after listening to a recent podcast on why there are relatively few female scientists and then stumbling onto some old discussion on Andrew Gelman's blog on a similar topic, I figured comparing male and female test scores on a standardized math exam might be a good test case. Studies have shown that on average male math test scores are higher than female scores and they are also more variable. Can relative distribution methods provide any extra insight? In this example, I use data from Project Talent. The sample represents the population of US high school students in 1960. The data are out of date; I use them for illustrative purposes only. As expected, overlaid densities show that male high school students' math scores are on average higher as well as more variable than female students' scores.
The relative distribution graph below combines the information in the two curves into a single line. In our sample, women are more likely than men to be in the lower tail (about 1.5 times as likely to achieve the lowest score), about equally likely to be in the middle, and less likely than men to be amongst the highest scorers. Because the relative data have a valid density function, we can also examine 95% confidence intervals. These intervals show that the differences we observe in our sample are only statistically significant in the upper tail, with female test takers about half as likely as their male counterparts to earn the highest scores. This finding is substantively useful, suggesting that average differences are driven by differences in very high math test-taking abilities.
We can decompose the overall relative distribution into differences in the locations and shapes of the male and female math score distributions. The below figure suggests that any differences between men's and women's math scores at the low end of the distribution are due to men's higher median score, while gender difference in high scorers are driven by the greater spread in men's scores. (If all distributional differences were due to shape differences the second panel of the below figure would show a horizontal lines at 1 and the third panel would look just like the first.)
There are several neat extensions of relative distribution methods, including covariate adjustments and nonparametric summary measures of distributional divergence. I haven't seen these techniques applied frequently, but they seem useful to me. Let me know what you think.
5 January 2010
How do we learn about causal relationships when we can't run experiments? In my own work, the answer has been to look around for "natural experiments" in which something important varies for roughly random reasons: for example, the winners of close elections are selected almost at random, which allows you to draw conclusions about the effect of being elected on various outcomes (like the winner's wealth).
I recently read a paper by David Jensen and coauthors from the UMass Knowledge Discovery Laboratory that proposes a systematic way of uncovering causal relationships from databases. Their approach (which they call AIQ -- "Automated Identification of Quasi-experiments") is not to mine the joint density of variables for independencies that can produce a causal graph (as discussed in Jamie Robins' talk last March), but rather to produce a list of feasible quasi-experiments based on a standard database schema that has been augmented with some causal information (e.g. A might cause B, C does not cause A or B) and some temporal information (i.e. ordering and frequency of events). In the paper, the authors provide an overview of the approach as applied to three commonly-used databases, including some candidate quasi-experiments that the algorithm suggests.
My impression after reading the paper was that AIQ's discovery potential is pretty limited (at least at this stage), because most users who could provide the inputs AIQ needs could very likely think up the quasi-experimental design themselves. Any valid quasi-experiment design that AIQ can discover at this point appears to come from the user specifying that the treatment and outcome have no common cause or confounding factors, which is a very unusual situation that is either quite obvious (e.g. because there is a lottery or other explicit randomization) or requires significant substantive knowledge. I wonder how commonly a researcher would a) have in mind a causal model that is sufficiently restrictive to produce plausible quasi-experimental designs through AIQ, and b) not have already thought of those designs.
The example of causal discovery the authors provide comes from a combined IMDB/Netflix movie database; they assert that winning an Oscar improves the reviews a movie receives on Netflix. In order for AIQ to suggest this quasi-experiment, the authors had to specify in advance that the Oscar-winning film is chosen from among nominees at random. One can of course criticize that assumption, but the point is that once you make that assumption it should be quite obvious that you have a quasi-experiment with which to study the effect of winning the Oscar on various outcomes; any film-specific, post-awards ceremony outcome should do. AIQ may provide a structured way to go through that exercise, but I'm not convinced there are many circumstances in which it would be useful to a researcher.