October 2007
Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Authors' Committee


Matt Blackwell (Gov)


Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries



SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
The Education Wonks
Empirical Legal Studies
Free Exchange
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science



Powered by
Movable Type 4.24-en

« September 2007 | Main | November 2007 »

31 October 2007

The statistics of race

Amy Perfors

There's an interesting article at Salon today about racial perception. As is normally the case for scientific articles reported in the mainstream media, I have mixed feelings about it.

1) First, a pet peeve: just because something is can be localized in the brain using fMRI or similar techniques, does not mean it's innate. This drives me craaazy. Everything that we conceptualize or do is represented in the brain somehow (unless you're a dualist, and that has its own major logical flaws). For instance, trained musicians devote more of their auditory processing regions to listening to piano music, and have a larger auditory cortex and larger areas devoted toward motor control of the fingers used to play their instrument. [cite]. This is (naturally, reasonably) not interpreted as meaning that playing the violin is innate, but that the brain can "tune itself" as it learns. [These differences are linked to amount of musical training, and are larger the younger the training began, which all supports such an interpretation]. The point is, localization in the brain != innateness. Aarrgh.

2) The article talks about what agent-based modeling has shown us, which is interesting:

Using this technique, University of Michigan political scientist Robert Axelrod and his colleague Ross Hammond of the Brookings Institution in Washington, D.C., have studied how ethnocentric behavior may have evolved even in the absence of any initial bias or prejudice. To make the model as simple as possible, they made each agent one of four possible colors. None of the colors was given any positive or negative ranking with respect to the other colors; in the beginning, all colors were created equal. The agents were then provided with instructions (simple algorithms) as to possible ways to respond when encountering another agent. One algorithm specified whether or not the agent cooperated when meeting someone of its own color. The other algorithm specified whether or not the agent cooperated with agents of a different color.

The scientists defined an ethnocentric strategy as one in which an agent cooperated only with other agents of its own color, and not with agents of other colors. The other strategies were to cooperate with everyone, cooperate with no one and cooperate only with agents of a different color. Since only one of the four possible strategies is ethnocentric and all were equally likely, random interactions would result in a 25 percent rate of ethnocentric behavior. Yet their studies consistently demonstrated that greater than three-fourths of the agents eventually adopted an ethnocentric strategy. In short, although the agents weren't programmed to have any initial bias for or against any color, they gradually evolved an ethnocentric preference for one's own color at the expense of those of another color.

Axelrod and Hammond don't claim that their studies duplicate the real-world complexities of prejudice and discrimination. But it is hard to ignore that an initially meaningless trait morphed into a trigger for group bias. Contrary to how most of us see bigotry and prejudice as arising out of faulty education and early-childhood indoctrination, Axelrod's model doesn't begin with preconceived notions about the relative values of different colors, nor is it associated with any underlying negative emotional state such as envy, frustration or animosity. Detection of a difference, no matter how innocent, is enough to result in ethnocentric strategies.

As I understand it, the general reason these experiments work the way they do is that the other strategies do worse given the dynamics of the game (single-interaction Prisoner's Dilemma): (a) cooperating with everyone leaves one open to being "suckered" by more people; (b) cooperating with nobody leaves one open to being hurt disproportionately by never getting the benefits of cooperation; and (c) cooperating with different colors is less likely to lead to a stable state.

Why is this last observation -- the critical one -- true? Let's say we have a red, orange, and yellow agent sitting next to each other, and all of them decide to cooperate with a different color. This is good, and leads to an increased probability of all of them being able to reproduce, and the next generation has two red, two yellow, and two orange agents. Now the problem is apparent: each of the agents is now next to an agent (i.e., the other one of its own color) that it is not going to cooperate with, which will hurt its chances of being able to survive and reproduce. By contrast, subsequent generations of agents that favor their own color won't have this problem. And in fact, if you remove "local reproduction" -- if an agent's children aren't likely to end up next to it -- then you don't get the rise of ethnocentrism... but you don't get much cooperation, either. (Again, this is sensible: the key is for agents to be able to essentially adapt to local conditions in such a way that they can rely on the other agents close to them, and they can't do that if reproduction isn't local). I would imagine that if one's cooperation strategy didn't tend to resemble the cooperation strategy of one's parents, you wouldn't see either ethnocentrism (or much cooperation) either.

3) One thing the article didn't talk about, but I think is very important, is how much racial perception may have to do with our strategies of categorization in general. There's a rich literature studying categorization, and one of the basic findings is of boundary sharpening and within-category blurring. (Rob Goldstone has been doing lots of interesting work in this area, for instance). Boundary sharpening refers to the tendency, once you've categorized X and Y as different things, to exaggerate their differences: if the categories containing X and Y are defined by differences in size, you would perceive the size difference between X and Y to be greater than it actually is. Within-category blurring refers to the opposite effect: the tendency to minimize the differences of objects within the same category -- so you might see two X's as being closer in size than they really are. This is a sensible strategy, since the more you do so it, the better you'll be able to correctly categorize the boundary cases. However, it results in something that looks very much like stereotyping.

Research along these lines is just beginning, and it's too early to go from this observation to conclude that part of the reason for stereotyping is that it emerges from the way we categorize things, but I think it's a possibility. (There also might be an interaction with the cognitive capacity of the learning agent, or its preference for a "simpler" explanation -- the more the agent can't remember subtle distinctions, and the more the agent favors an underlying categorization with few groups or few subtleties between or within groups, the more these effects occur).

All of which doesn't mean, of course, that stereotyping or different in-group/out-group responses are justified or rational in today's situations and contexts. But figuring out why we think this way is a good way to start to understand how not to when we need to.

[*] Axelrod and Hammond's paper can be found here.

Posted by Amy Perfors at 2:32 PM

30 October 2007

Andrew C. Thomas on "Symmetry and Competition in State Legislative Election Systems".

This week, the Applied Statistics Workshop is happy to have Andrew C. Thomas, G-4 Department of Statistics, presenting his work on, "Symmetry and Competition in State Legislative Election Systems". Andrew has provided the following abstract for his presentation:

Drawing of legislative districts has historically been conducted by the legislators themselves; recently, some states have appointed redistricting commissions, the members of which cannot run for seats in the legislature for a period afterwards. I demonstrate that current methods, in particular the Gelman-King model and the JudgeIt R package, can easily diagnose the state of an electoral map given previous electoral conditions. In particular, competition increases in states with commissions, but the impact on symmetry is as yet unclear. I conclude with a discussion on techniques to improve the resolution and measurement of electoral symmetry within states.

Please join us this Wednesday at 12 noon for the presentation and a light lunch. We hold the workshop in Room N-354 of CGIS-Knafel (1737 Cambridge St).

Posted by Justin Grimmer at 1:46 AM

Clay Public Lecture: "Technology-driven statistics"

The Clay Mathematics Institute and the Harvard Mathematics Department are sponsoring a lecture by Terry Speed from the Department of Statistics at Berkeley on "Technology-driven statistics," with a focus on the challenges presented to statistical theory and practice presented by the massive amounts of data that are generated by modern scientific instruments (microarrays, mass spectrometers, etc.). These issues have not yet been as salient in the social sciences, but they are clearly on the horizon. The talk is at 7PM tonight (Oct. 30) in Science Center B at Harvard. The abstract for the talk is after the jump:

Technology-driven Statistics

Terry Speed, UC Berkeley and WEHI in Melbourne, Australia

Tuesday, October 30, 2007, at 7:00 PM

Harvard University Science Center -- Hall B

Forty years ago, biologists collected data in their notebooks. If they needed help from a statistician in analyzing and interpreting it, they would pass over a piece of paper with numbers on it. The theory on which statistical analyses was built a couple of decades earlier seemed entirely adequate for the task. When computers became widely available, analyses became easier and a little different. with the term "computer intensive" entering the lexicon. Now, in contemporary biology and many other areas, new technologies generate data whose quantity and complexity stretches both our hardware and our theory. Genome sequencing, genechips, mass spectrometers and a host of other technologies are now pushing statistics very hard, especially its theory. Terry Speed will talk about this revolution in data availability, and the revolution we need in the way we theorize about it.

Terry Speed splits his time between the Department of Statistics at the University of California, Berkeley and the Walter & Eliza Hall Institute of Medical Research (WEHI) in Melbourne, Australia. Originally trained in mathematics and statistics, he has had a life-long interest in genetics. After teaching mathematics and statistics in universities in Australia and the United Kingdom, and a spell in Australia's Commonwealth Scientific and Industrial Research Organization, he went to Berkeley 20 years ago. Since that time, his research and teaching interests have concerned the application of statistics to genetics and molecular biology. Within that subfield, eventually to be named bioinformatics, his interests are broad, including biomolecular sequence analysis, the mapping of genes in experimental animals and humans, and functional genomics. He has been particularly involved in the low level analysis of microarray data. Ten years ago he took the WEHI job, and now spends half of his time there, half in Berkeley, and the remaining half in the air somewhere in between.

Posted by Mike Kellermann at 12:08 AM

29 October 2007

Visualizing Electoral Data

Andy Eggers and I are currently working on a project on UK elections. We have collected a new dataset that covers detailed information on races for the House of Commons between 1950 and 1970; seven general elections overall. We have spent some time thinking about new ways to visualize electoral data and Andy has blogged about this here and here. Today, I'd like to present a new set of plots that we came up with to summarize the closeness of constituency races over time. This is important for our project because we exploit close district races as a source of identification.

Conventional wisdom holds that in Britain, about one-quarter of all seats are 'marginal', ie. decided within majorities of less than 10 percentage points. To visualize this fact Andy and I came up with the following plot. Constituencies are on the x axis and the elections are on the y axis. Colors indicate the closeness of the district race (ie. vote majority / vote sum) categorized into different bins as indicated in the colorkey on top. Color scales are from Colorbrewer. We have ranked the constituencies from close to safe from left to right. Please take a look:


The same plot is available as a pdf here. The conventional wisdom seems to hold. About 30 percent of the races are close. Also some elections are closer than others.

A long format of the plot is available here. It allows to identify individual districts, but requires some scrolling. We are considering developing an interactive version using javascript so that additional info pops up as one mouses over the plot. Notice that both plots exclude the 50 or so districts that changed names as a result of the 1951 redistricting wave.

Finally, Andy and I care about districts that swing between the two major parties. To visualize this we have produced similar plots where the color now indicates the vote share margins as seen by the Conservative party: ((Conservative vote - Labour vote)/vote sum). So negative values indicate a Labour victory and positive values a victory of the Conservative party. We only look at districts where Labour or the Conservative party took first and second place. Here it is:


The partisan swings from election to election are really clear. Finally, the long format is here. The latter plot allows to easily identify the party strongholds during this time period. Comments and suggestions are highly welcome. We wonder whether anybody has done such plots before or whether we can legitimately coin them as Eggmueller plots (lol).

Posted by Jens Hainmueller at 8:13 PM

26 October 2007

Income, partisanship, and voting

Andrew Gelman has an interesting post up about voting behavior in rich states and poor states, showing how voting patterns differ across the country when you condition on the income of the voters. There is not much of a relationship between per capita income and support for Democrats among poor voters, but there is a strong relationship among rich voters: rich voters in poor states are much more likely to support the Republicans than rich voters in rich states.

On a related note, Larry Bartels will be speaking at the Inequality seminar at Harvard on Monday, October 29 at noon in the Taubman Dining Room at the Kennedy School. His talk is entitled "Partisan Biases in Electoral Accountability," and draws on a forthcoming book. Much of his evidence focuses on differences in the reactions of lower, middle, and upper-income voters to economic performance. Gelman and Bartels are great examples of political scientists who are trying (with limited success, perhaps) to knock down some of the conventional wisdom about the "Red State, Blue State", "values voters" divide with careful data analysis, and are always well worth attention.

Posted by Mike Kellermann at 3:52 PM

25 October 2007

Visualizing UK Politicians

Since I saw Fernanda Viegas and Martin Wattenberg's presentation on Many Eyes a few weeks ago in our Applied Stats workshop, I've been itching to use their visualization tools on some of my own data. Tonight I made a treemap of the dataset of UK politicians that Jens Hainmueller and I have been developing. (The data consists of over 6000 candidates who ran for the House of Commons between 1950 and 1970.) I set up the visualization such that each box in the treemap is sized to indicate the number of campaigns for each combination of party and occupation (eg Conservative barristers) and the color reflects "proportion attending Oxbridge." But you can play with it via the menus at the bottom of the visualization and cut the data the way you want: you can make the size reflect "proportion female" and the color reflect "proportion elected," and you can make it show party by occupation instead of occupation by party. I've embedded the visualization here; you can make comments tied to a particular view of the data on the many-eyes site. I'm eager to hear reactions, whether on the visualization or on the brand-new data.

Posted by Andy Eggers at 1:50 AM

22 October 2007

James Stock on ‘Forecasting in Dynamic Factor Models’

The Applied Statistics Workshop is proud to present James Stock, Chair of the Economics Department, as he presents, “Forecasting in Dynamic Factor Models Subject to Structural Instability”. James has provided the following abstract:

Dynamic factor models (DFMs) express the comovements of time series at leads and lags in terms of a small number of latent factors. In macroeconomic applications, the latent factors can be thought of as theoretical constructs (income) that are linked to specific measurements (GDP). The large body of work on DFMs in macroeconomics assumes a stable structure. This paper develops time-varying DFMs and uses implications of time-varying DFMS to shed light on some ongoing macro puzzles such as the Great Moderation and the breakdown of the backward-looking Phillips curve.

The workshop will meet at 12-noon in room N-354, CGIS-Knafel. And a light lunch will be served.

Posted by Justin Grimmer at 8:45 PM

19 October 2007

Tim McCarver is a Bayesian with very strong priors....

The Red Sox beat the Indians last night in Game 5 of the ALCS, sending the series back to Fenway and enabling the majority of us at Harvard who are (at least fair-weather) Sox fans to, as Kevin Youkilis said last night, come down off the bridge for a few more days. Why do I bring this up? Well, after Boston's loss in Game 4, a commenter on this blog asked the following question:

In the disastrous inning of the Red Sox game tonight, the announcer (maybe Tim McCarver?) said “One would think that a lead-off walk would lead to more runs than a lead-off home-run, but it’s not true. We’ve researched it and this year a lead-off home-run has led to more multi-run innings than have lead-off walks.”

I must not be "one", b/c I think a lead-off home-run is much more likely to lead to multiple-run innings, b/c after the home-run, you have a run and need only 1 more to have multiple, and the actions after the first batter are mostly independent of the results of the first batter. So, I think he has it totally backwards. I was a fair stats student, so I need confirmation. He was backwards, right?

The short answer is that it was Tim McCarver, and as an empirical matter he was wrong to be surprised. I don't have access to full inning-by-inning statistics over a long period of time, but the most convincing analysis I found in a quick search (here) suggests that between 1974 and 2002, the probability of a multi-run inning conditional on a leadoff walk is .242 and the probability of a multirun inning after a leadoff home run is .276.

The blogosphere has had a lot of fun at McCarver's expense (not that it takes much to provoke such a reaction, granted): It's Math!, Zero > One, Tim McCarver Does Research, etc. His observation, though, is a good example of Bayesian updating at work: while I doubt that most baseball observers "would think that a lead-off walk would lead to more runs than a lead-off home-run," it is very clear that Tim McCarver thought that at some point. As evidence, in a 2006 game he made the following comment:

"There is nothing that opens up big innings any more than a leadoff walk. Leadoff home runs don't do it. Leadoff singles, maybe. But a leadoff walk. It changes the mindset of a pitcher. Since he walked the first hitter, now all of a sudden he wants to find the fatter part of the plate with the succeeding hitters. And that could make for a big inning."

In 2004, he said during the Yankees-Red Sox ALCS that "a walk is as good as a home run." And back in 2002, he made a similar comment during the playoffs; in fact, it was that comment that prompted the analysis that I linked to above! Clearly, he had a strong prior belief (from where, I don't know) that leadoff walks somehow get in the pitcher's head and produce more big innings. Now that he's been confronted by data, those belief are updating, but since his posterior has shifted so much from his prior it's not surprising that he thinks this is some great discovery. In a couple of years, he'll probably think that he always knew a leadoff home run was better.

As for the intuition, it looks like the commenter is also correct. Using the data cited above, the probability of scoring zero runs in an inning is approx. .723, while the probability of scoring no additional runs after a leadoff homer is approx. .724; the rest of distribution is similar as well.

Posted by Mike Kellermann at 1:02 PM

Martin-Quinn on SCOTUSblog

The Martin-Quinn estimates of judicial preferences, developed by Andrew Martin and our own Kevin Quinn, are an interesting example of top-notch methods work that has received fairly widespread attention outside of the methods community. On SCOTUSBlog, there is an interview with Andrew; while it's aimed at legal practitioners rather than statisticians, its good to see them getting some screen time.

Posted by Mike Kellermann at 12:55 PM

18 October 2007

R Quiz Anybody?

Perl has the Perl quiz, Python has the Python challenges, Ruby has the Ruby quiz, but what about our good old friend R?? Does such a thing exist anywhere? Would be a nice idea I think...

Posted by Jens Hainmueller at 8:52 PM

17 October 2007

How tall are you? No, really...

Continuing on the topic of self-reported health data, and how to correct for reporting (and other) biases, here an interesting paper on height and weight in the US. Those two measures have received a lot of interest in the past years, not least as components of the body-mass index BMI which is used to estimate the prevalence of obesity. BMI itself is not a great measure (more on that another day) but at least it’s relatively easy to collect via telephone and in-person interviews. Of course some people make mistakes while reporting their own vital measures, and some might do so systematically: a height of 6 foot sounds like a good height to have even to me, and I tend to think in the metric system!

Anyway, the paper by Ezzati et al examines the issue of systematic misreporting. They note that existing smaller-scale studies on this issue might in fact under-estimate the bias because of their design. People might limit their misreporting if they are measured before or after reporting their vitals, which is a challenge for validation studies. And participation might systematically differ with the interview modes of the analysis studies and a general health surveys (e.g. in-person versus telephone interviews) so that the studies are not directly comparable to population-level surveys.

The idea of the paper is to employ two nationally representative surveys to compare three different kinds of measurement for height and weight, by age group and gender. The first survey is the National Health and Nutrition Examination Survey NHANES which collects self-reported information through in-person interviews, and also through medical examination. The second survey is the Behavior and Risk Factor Surveillance Survey BRFFS, an annual cross-sectional telephone survey that is state-level representative and features widely in policy discussions.

The comparisons between the surveys might confirm your priors on misreporting. On average, women under-report their weight and men under 65 tend to over-report their height. The authors find that state-level obesity measures based on the BRFFS are too low – they re-calculate that a number of states in fact had obesity prevalences above 30% in 2000. Of course this is not a perfectly clean assessment, because the NHANES participants might have anticipated the clinical examination a few weeks after the in-person interview. But at the least this study is a good reminder that people do systematically misreport for some reason, and that analysts should treat self-reported BMI carefully.

Posted by Sebastian Bauhoff at 10:23 PM

16 October 2007

Sveriges Riksbanks pris i ekonomisk vetenskap till Alfred Nobels minne 2007

The winners of the 2007 Economics prize were announced yesterday in Stockholm; the award will go to Leonid Hurwicz, Eric Maskin, and Roger Myerson "for having laid the foundations of mechanism design theory." Not quite as well known as this year's Peace prize winner, but big names in the world of economic theory. Marginal Revolution has much more detail on the winners and their work (here, here, and elsewhere). I don't have much to add, other than a few comments on why I'm blogging this on a statistics blog.

I think it's fair to say that the Bank of Sweden Prize in Economic Sciences in honor of Alfred Nobel (yes, that's more or less the official name) is the most visible award in the social sciences. The prize has occasionally been awarded to econometricians (Engle and Granger in 2003, Heckman and McFadden in 2000, Haavelmo in 1989, and Klein in 1980), but it is striking how rare it is for econometrics (or, for that matter, empirical work in economics) to be recognized by the prize committee. This is not true of other fields. To get a sense of the discrepancy, compare economics with physics, a discipline not known for being particularly atheoretical. Each award carries with it a citation recognizing the work for which the prize was given. If we look at the ratio of citations with the word "theory" to those with the word "discovery", in economics the ratio is 19 to 1 (and the one "discovery" is the Coase Theorem), while in physics the ratio is more like 1 to 3.8. I think this reflects the productive interplay between theory and empirics in physics, and the lack of a similar dynamic in economics (and social science generally). It will be interesting to see when and if the current movement toward behavioral economics will be recognized by the selection committee.

Posted by Mike Kellermann at 1:15 PM

15 October 2007

Damon Centola on 'Diffusion in Social Networks'

The applied statistics workshop is back for another exciting installment. This week we have Damon Centola, RWJ Scholar, Harvard University presenting 'Diffusion in Social Networks: New Theory and Experiments' . Damon provided the following abstract for his talk:

The strength of weak ties is that they tend to be long – they connect
socially distant locations. Research on “small worlds” shows that these
long ties can dramatically reduce the “degrees of separation” of a
social network, thereby allowing ideas and behaviors to rapidly diffuse.
However, I show that the opposite can also be true. Increasing the
frequency of long ties in a clustered social network can also inhibit
the diffusion of collective behavior across a population. For health
related behaviors that require strong social reinforcement, such as
dieting, exercising, smoking, or even condom use, successful diffusion
may depend primarily on the width of bridges between otherwise distant
locations, not just their length. I present formal and computational
results that demonstrate these findings, and then propose an
experimental design for empirically testing the effects of social
network topology on the diffusion of health behavior.

The workshop is held on Wednesday at 12 noon in room N 354, CGIS Knafel (1737 Cambridge St). And a light lunch will be served.

Posted by Justin Grimmer at 5:59 PM

12 October 2007

Visualization for data cleaning

Speaking of Fernanda Viegas and Martin Wattenberg's excellent presentation on visualization, I recently came across a data cleaning problem where visualization was a big help. Data cleaning is all about having powerful ways of finding mistakes quickly. Much of the time, clever scripting is the best way to detect errors, but in this case a simple data visualization turned out to be the best tool. Screenshot after the jump.

First, a little background on the project, which is a collaboration with Jens Hainmueller. The Times of London published election guides throughout the 20th century including voting results and candidate bios for every constituency in every election to the House of Commons. We scanned and OCR'd seven volumes of this series and wrote scripts to extract information about each constituency race, including the name, vote total, and short bio of each candidate. The challenge then was to determine which appearances belonged to the same individual. For example, when "P G Agnew" runs in 1950 and "Peter Agnew" runs in 1955, are they the same person? We trained a clustering algorithm to do this matching based on name similarity, year of birth, party, and gender, and wrote some scripts to catch likely errors. When we thought we had done as well as we could, we decided to produce a little visualization to admire our perfectly cleaned data. To our surprise, the visualization revealed a number of hard-to-catch remaining errors.

As can be seen in the screenshot below, we listed the candidates alphabetically by surname and depicted their election career graphically with a colored rectangle for each appearance in a race. We selected the colors to reflect the margin in the race, with deep green indicating an easy victory and deep red indicating a resounding defeat.
Depicting the candidates' campaign history in this way helped us see patterns that suggested that a single candidate had been incorrectly coded as separate candidates. Brian Batsford, shown at the top of the screen shot, was one such case: the Brian Batsford who ran in 1959, 1964, and 1970 was very likely to be the same person as the Brian Batsford who ran in 1966. Indeed, it turned out that they were the same person; our clustering algorithm had mistakenly separated him in two because the year of birth had been miscoded as 1928 in his 1966 appearance.

The key point here is that the pattern that allowed us to see this mistake is easier to see than it is to articulate and, perhaps more importantly, than it is to write in a script. (OK, I'll try: "Find pairs of candidates who have similar names and did not appear in the same elections, especially if they appeared in contiguous elections and had similar results.") I prefer the pretty colors.

Posted by Andy Eggers at 12:35 PM

10 October 2007

Visualizing the evolution of open-edited text

Today's applied stats talk by Fernanda Viegas and Martin Wattenberg covered a wide array of interesting data visualization tools that they and their colleagues have been developing over at IBM Research. One of the early efforts that they described is an applet called History Flow, which allows users to visualize the evolution of a text document that was edited by a number of people, such as Wikipedia entries or computer source code. You can track which authors contributed over time, how long certain parts of the text have remained in place, and how text moves from one part of the document to another. To give you a flavor of what is possible, here is a visualization of the history of the Wikipedia page for Gary King (who is the only blog contributor who has one at the moment):


This shows how the page became longer over time and that it was primarily written by one author. The applet also allows you to connect textual passages from earlier versions to their authors. We noticed this one from Gary's entry:


"Ratherclumsy"'s contribution to the article only survived for 24 minutes, and was deleted by another user with best wishes for becoming "un-screwed". All kidding aside, this is a really interesting tool for text-based projects. Leaving aside the possibility for analysis, this would be useful for people working on coding projects. I can think of more than one R function that I've worked on where it would be nice to know who wrote a particular section of code....

Posted by Mike Kellermann at 5:52 PM

8 October 2007

Fernanda Viegas and Martin Wattenberg on Data Visualization

Dear Applied Statistics Community,

Please join us for this week's installment of the Applied Statistics workshop, where Fernanda Viegas and Martin Wattenberg will be presenting their talk entitled, "From Wikipedia to Visualization and Back'. The authors provided the following abstract for their talk:

This talk will be a tour of our recent visualization work, starting with a case study of how a new data visualization technique uncovered dramatic dynamics in Wikipedia. The technique sheds light on the mix of dedication, vandalism, and obsession that underlies the online encyclopedia. We discuss the reaction of the Wikipedia community to this visualization, and how it led to a recent ambitious project to make data visualization technology available to everyone. This project, Many Eyes, is a web site where people may upload their own data, create interactive visualizations, and carry on conversations. The goal is to foster a social style of data analysis in which visualizations serve not only as a discovery tool for individuals but also as a means to spur discussion and collaboration.

Martin and Fernanda have also provided the following set of links as background for the presentation:



And to a website based upon recent work in data visualization

Link to Many Eyes site:

As always, the workshop meets at 12 noon on Wednesday, in room N-354 CGIS-Knafel. A light lunch will be provided

Posted by Justin Grimmer at 12:02 PM

5 October 2007

Jezebel, Cassandra, and the happiness gap

On a lighter note this Friday afternoon, there has been an interesting and largely good-natured debate on various blogs in response to a recent New York Times article on the happiness gap (or the change in the happiness gap, or reversal, or something) between men and women (He's happier, she's less so). Much of the discussion has been on the substantive significance of the results and how those results are likely to be interpreted by the (non-statistically minded) public. This post at Language Log summarizes the debate and provides links to previous entries on both sides. Most of these are quite serious, while a few are (ahem) less so. On the other hand, any time that Stata code appears on a pop culture website, it is worth noting...

Posted by Mike Kellermann at 4:07 PM

4 October 2007

Another way of thinking about probability?

Amy Perfors

On Tuesday I went to a talk by Terrence Fine from Cornell University. It was one of those talks that's worth going to, if nothing else because it makes you re-visit and re-question the sort of basic assumptions that are so easy to not even notice that you're making. In this case, that basic assumption was that the mathematics of probability theory, which views probability as a real number between 0 and 1, is equally applicable to any domain where we want to reason about statistics.

Is this a sensible assumption?

As I understand it, Fine made the point that in many applied fields, what you do is start from the phenomenon to be modeled and then use the mathematical/modeling framework that is appropriate to it. In other words, you go from the applied "meaning" to the framework: e.g., if you're modeling dynamical systems, then you decide to use differential equations. What's odd in applications of probability theory, he said, is that you basically go from the mathematical theory to the meaning: we interpret the same underlying math as having different potential meanings, depending on the application and the domain.

He discussed four different applications, which are typically interpreted in different ways: physically-determined probability (e.g., statistical mechanics or quantum mechanics); frequentist probability (i.e., more data driven); subjective probability (in which probability is interpreted as degree of belief); and epistemic/logical (in which probability is used to characterize inductive reasoning in a formal language). Though I broadly agree with these distinctions, I confess I'm not getting the exact subtleties he must be making: for instance, it seems to me the interpretation of probability in statistical mechanics is arguably very different from in quantum mechanics and they should therefore not be lumped together: in statistical mechanics, the statistics of flow arise some underlying variables (i.e., the movements of individual particles), and in quantum mechanics, as I understand it, there aren't any "hidden variables" determining the probabilities as all.

But that technicality aside, the main point he made is that depending on the interpretation of probability and the application we are using it for, our standard mathematical framework -- in which we reason about probabilities using real numbers between 0 and 1 -- may be inappropriate because it is either more or less expressive than necessary. For instance, in the domain of (say) IQ, numerical probability is probably too expressive -- it is not sensible or meaningful to divide IQs by each other; all we really want is an ordering (and maybe even a partial ordering, if, as seems likely, the precision of an IQ test is low enough that small distinctions aren't meaningful[1]). So a mathematics of probability which views it in that way, Fine argues, would be more appropriate than the standard "numerical" view.

Another example would be in quantum mechanics, where we actually observe a violation of some axioms of probability. For instance, the distributivity of union and intersection fails: P(A or B) != P(A)+P(B)-P(A and B). This is an obvious place where one would want to use a different mathematical framework, but since (as far as I know) people in quantum mechanics actually do use such a framework, I'm not sure what his point was. Other than it's a good example of the overall moral, I guess?

Anyway, the talk was interesting and thought-provoking, and I think it's a good idea to keep this point in the back of one's mind. That said, although I can see why he's arguing that different underlying mathematics might be more appropriate in some cases, I'm not convinced yet that we can conclude that using a different underlying mathematics (in the case of IQ, say) would therefore lead to new insight or help us avoid misconceptions. One of the reasons numerical probability is used so widely -- in addition to whatever historical entrenchment there is -- is that it is an indispensible tool for doing inference, reasoning about distributions, etc. It seems like replacing it with a different sort of underlying math might result in losing some of these tools (or, at the very least, require us to spend decades re-inventing new ones).

Of course, other mathematical approaches might be worth it, but at this point I don't know how well-worked out they are, and -- speaking as someone interested in the applications -- I don't know if they'd be worth the work in order to see. (They might be; I just don't know... and, of course, a pure mathematician wouldn't care about this concern, which is all to the good). Fine gave a quick sketch of some of these alternative approaches, and I got the sense that he was working on developing them but they weren't that well developed yet -- but I could be totally wrong. If anyone knows any better, or knows of good references on this sort of thing, please let us know in comments. I couldn't find anything obvious on his web page.

[1] I really really do not want to get into a debate about whether and to what extent IQ in general is meaningful - that question is really tangential to the point of this post, and I use IQ as illustration only. (I use it rather than something perhaps less inflammatory because it's the example Fine used).

Posted by Amy Perfors at 12:40 PM

3 October 2007

Another cool visualization site

Zachary Johnson sent along a link to his new comparative politics data visualization website, the World Freedom Atlas. This is how he describes the site:

The World Freedom Atlas is a geovisualization tool for world statistics. It was designed for social scientists, journalists, NGO/IGO workers, and others who wish to have a better understanding of issues of freedom, democracy, human rights, and good governance. It covers the years 1990 to 2006.

When I took a look around, I was impressed. The site allows you to pick variables, compare variables from different years (which makes it easy to compare, say, polity scores in 1995 with the level of corruption 5 years later), produce interactive scatterplots and boxplots, etc. The data is taken from existing published sources, some of it good and some of it less so (I have a particular beef with the Vanhanen "Index of Democratization", which has always struck me as possibly the silliest attempt to measure a concept yet produced in the comparative politics literature). A couple of suggestions to incorporate in the next version: When you brush a point in the scatterplot, it only brings up the name of one country. Given the lumpiness of the data, this often conceals several other country names. Also, it would be nice to incorporate a function that allows you to print out nice image files of particular map/scatterplots/etc. I don't know how hard that would be to do. All in all, it's worth a look.

Posted by Mike Kellermann at 11:10 AM

2 October 2007

Applied Stats Workshop - Tom Cook

The Applied Statistics Workshop presents another installment this week with Thomas Cook, Department of Sociology, Northwestern University presenting a talk entitled, "When the causal estimates from randomized experiments and non-experiments coincide: Empirical findings from the within-study comparison literature." Here is an excerpt from the paper:

The present paper has several purposes. It seeks to up-date the literature since Glazerman et al. (2003) and Bloom et al. (2005) and to move it beyond its near exclusive focus on job training. We have examined the job training studies, and find nothing to challenge the past conclusions described above. However, the more recent studies allow us to broach three questions that are more finely differentiated than whether experiments and non-experiments produce comparable findings:

1. Do experiments and RDD studies produce comparable effect sizes? We have found three examples attempting this comparison.

2. Do comparable effect sizes result when the non-experiment depends on selecting one or more intact comparison groups that are deliberately matched on pretest measures of the posttest outcome, as recommended in Cook & Campbell (1979)? Thus, in a non-experiment with schools as the unit of assignment, intervention schools are carefully matched with intact non-intervention schools in the hope that the average treatment and comparison schools will not differ on pretest achievement, let us say, though they may differ on unobservables. We have found three studies with this focus.

3. Do experiments and non-experiments produce comparable effect sizes when the intervention and comparison units do differ at pretest and so statistical adjustments or individual matches are constructed to control for this demonstrated non-equivalence? This question has dominated the literature to date, and we found six studies outside of job training that asked this question

We will meet at 12 noon in CGIS-Knafel N354 and the talk will begin at 1215pm. And of course a delicious, free lunch will be provided.

Posted by Justin Grimmer at 1:19 AM

1 October 2007

The Changing Evidence Base of Political Science Research

Kay Schlozman and Norman Nie and I are preparing an edited volume in honor of Sidney Verba. The volume is entitled Political Science: What Should We Know? What Should They Know?. Instead of the usual 10 or so chapters representing something other than each contributor's best work, we invited 100 scholars to write about 1,000 words each -- basically one idea (similar to a blog entry) to address to one or both of the questions in the title. I include a draft of mine below. Comments welcome.

The Changing Evidence Base of Political Science Research

I believe the evidence base of political science and the related social sciences are beginning an underappreciated but historic change. As a result, our knowledge of and practical solutions for problems of government and politics will begin to grow at an enormous rate --- if we are ready.

For the last half-century, we have learned about human populations primarily through sample surveys taken every few years, end-of-period government statistics, and in-depth studies of particular places, people, or events. These sources of information have served us well but, as is widely known, are limited: Survey research produces occasional snapshots of random selections of isolated individuals from unknown geographic locations, and the increases in cell phone use and growing levels of nonresponse are crumbling its scientific foundation. Aggregate government statistics are valuable, but in many countries are of dubious validity and are reported only with intentionally limited resolution or after obscuring valuable information. One-off in-depth studies are highly informative but for the most part do not scale, are not representative, and do not measure long-term change.

In the next half-century, these existing data collection mechanisms will surely continue to be used and improved --- such as with inexpensive web surveys, if the problems with their representativeness can be addressed --- but they will be supplemented by the profusion of massive data bases already becoming available in many areas. Some produce extensive or continuous time information on individual political behavior and its causes, such as based on text sources (via automated information extraction from blogs, emails, speeches, government reports, and other web sources), electoral activity (via ballot images, precinct-level results, and individual-level registration, primary participation, and campaign contribution data), commercial activity (through every credit card and real estate transaction and via product RFIDs), geographic location (by carrying cell phones or passing through toll booths with Fastlane or EZPass transponders), health information (through digital medical records, hospital admittances, and accelerometers and other devices being included in cell phones), and others. Parts of the biological sciences are now effectively becoming social sciences, as developments in genomics, proteomics, metabolomics, and brain imaging produce huge numbers of person-level variables. Satellite imagery is increasing in scope, resolution, and availability. The internet is spawning numerous ways for individuals to interact, such as through social networking sites, social bookmarking, comments on blogs, participating in product reviews, and entering virtual worlds, all of which are possibilities for observation and experimentation. (Ensuring privacy and protection of personal information during the analyses to be conducted with this information will require considerable effort, care, and new work in research ethics, but should not be markedly more difficult than the now routine medical research involving experiments on human subjects with drugs and surgical procedures of unknown safety and efficacy.)

The analogue-to-digital transformation of numerous devices people own makes them work better, faster, and less expensively, but also enables each one to produce data in domains not previously accessible via systematic analysis. This includes everything from real-time changes in the web of contacts among people in in society (the bluetooth in your cell phone knows whether other people are nearby!) to records kept of individuals' web clicking, searches, and advertising clickthroughs. Partly as a result of new technology, governmental bureaucracies are improving their record keeping by moving from paper to electronic data bases, many of which are increasingly available to researchers. Some governmental policies are furthering these changes by requiring more data collection, such as the ``No Child Left Behind Act'' in education and via the proliferation of randomized policy experiments. All these changes are being supplemented by the replication movement in academia that encourages or requires social scientists to share data we have created with other researchers.

These data put numerous advances within our reach for the first time. Instead of trying to extract information from a few thousand activists' opinions about politics every two years, in the necessarily artificial conversation initiated by a survey interview, we can use new methods to mine the tens of millions of political opinions expressed daily in published blogs. Instead of studying the effects of context and interactions among people by asking respondents to recall their frequency and nature of social contacts, we now have the ability to obtain a continuous record of all phone calls, emails, text messages, and in-person contacts among a much larger group. In place of dubious or nonexistent governmental statistics to study economic development or population spread in Africa, we can use satellite pictures of human-generated light at night or networks of roads and other infrastructure measured from space during the day. The number, extent, and variety of questions we can address are considerable and increasing fast.

If we can tackle the substantial privacy issues, build more powerful and more widely applicable theories with observable implications in these new forms of data, help create informatics techniques to ensure that the data are accessible and preserved, and develop new statistical methods adapted to the new types of data, political science can make more dramatic progress than ever before. The challenge before us as a profession, before each of us as researchers, and before the broader community of social scientists, is to prepare for the collection and analysis of these new data sources, to unlock the secrets they hold, and to use this new information to better understand and ameliorate the major problems that affect society and the well-being of human populations.

original PDF version

Posted by Gary King at 8:05 AM