23 May 2009
How will you expect swine flu cases to be distributed by weekday? More specifically, will you expect more cases distributed in weekdays or in weekends? My first reaction is that there will be more cases if there are more social gatherings.
Following this logic, the reasons for supporting more cases in weekdays may include that susceptible population have more contacts with infected population in weekdays, either through school or through work, etc. In addition, as people are more likely to travel in weekends, it means that they will have more contacts with infected subjects during their traveling, but because it takes around two days for the virus to have impacts, the cases will not be identified until a couple of days later. Could this also be due to the fact that there are less clinical services provided in weekends and that people are less likely to visit clinics in weekends?
Here is an old graph I made according to the swine flu updates (4/26/2009 - 05/21/2009) published on WHO's website. To be more accurate, I drew a new graph using the number of confirmed new cases rather than the cumulative number of confirmed cases.
As the reporting times for confirmed new cases vary, some at 18:00 while others at 6:00, etc., I kept only records between 05/01 and 05/21 whose reporting time is at 6:00 and redrew the graph. Weekdays are redefined as well. For example, Thursday 6:00 to Friday 6:00 is defined as Thursday. Could you still see any salient patterns, like the differential between weekdays and weekends? Why Friday is so spiky this time?
20 May 2009
A few weeks ago my friend Aaron Swartz wrote a blog post called Transparency is Bunk, arguing that government transparency websites don't do what they're supposed to do, and in fact have perversely negative effects: they bury the real story in oceans of pro forma data, encourage apathy by revealing "the mindnumbing universality of waste and corruption," and lull activists into a false sense of accomplishment when occasional successes occur. It's a particularly powerful piece because Aaron uses the platform to announce he's done working on his own government website (watchdog.net). The piece appears to have caused a stir in government transparency/hacktivist circles, where Aaron is pretty well known.
On looking back at it I think Aaron's argument (or rant, more accurately) against the transparency websites is not very strong: indeed, data overload, apathy, and complacency are all dangers these efforts face, but that shouldn't have come as a surprise.
I had two other responses particular to my perch in academia. First, there is some good academic research showing that transparency works, although the evidence on the effectiveness of grassroots watchdogging is less strong than the evidence on auditing from e.g. Ferraz and Finan on Brazilian municipalities (QJE 2008, working paper version) or Olken's field experiment in Indonesia (JPE 2008, working paper version).
Second, my own work and that of other academics benefits greatly from these websites. I have a project right now on the investments of members of Congress (joint with Jens Hainmueller) that is possible only because of websites like the ones Aaron criticizes. I think this paper is going to be useful in helping watchdogs understand how Congress invests and whether additional regulation is a good idea, and it would be a shame if the funders of these sites listened to Aaron and shut them down.
I do agree with Aaron that professional analysis may be better than grassroots citizen activism in achieving the goals of the transparency movement. Sticking with the example of the congressional stock trading data I'm using, I suspect that not much useful watchdogging came out of the web interface that OpenSecrets provides for the investments data. While it may be interesting to know that Nancy Pelosi owns stock in company X, it's hard to get any sense of patterns of ownership across members and how these investments relate to political relationships between members and companies. This is what our paper tries to do. It takes a ton of work, far more than an investigative journalist is going to put in. We do it because of the rewards of publishing interesting and original and careful research, and also because these transparency websites have made it much more manageable: OpenSecrets.org converted the scanned disclosure forms into a database and provided lobbying data, and GovTrack provided committee and bill info, as well as an API linking company addresses to congressional districts. Most of the excitement around these websites seems to center on grassroots citizen activism, but their value to academic research (and the value of academic research to government accountability) should not be overlooked.
15 May 2009
In late April, the FAA released the long-awaited bird strike data. It shows every recorded bird strike since 2000.
But as far as I can tell, the stories seem to have forgotten about the denominator: the number of flights, which has been increasing just as rapidly since 2000.
To test this, I went and pulled commercial flight totals from a public BTS data set from 2003 on. Then I limited the bird data to the same airports, months, years and carriers as appear in the BTS flight data. Then I divided bird strikes by flights, and presto: the strike rate has been flat since 2003.
But the most interesting part of this mashup was breaking down the figures by carrier. Do some airlines strike birds more than others? The answer appears to be "Yes."
At the top of the pile are Frontier, United and JetBlue. Frontier Airlines has a staggering 9.4 strikes per 10,000 flights, compared to the industry average of 4.0. Now, a good statistician (or a Frontier executive) would wonder about confounders. Frontier's Denver hub is the most bird-prone major airport in the U.S., with 7.8 strikes per 10,000 flights. Here's the breakdown by airport for the top 34 airports in the U.S..
To try to account for all the possible confounders, I fit a Poisson regression modeling strike rate using the following covariates:
Since the data have 100,000 rows and hundreds of columns after expanding all the categorical covariates, I used bigglm in R to fit the model. The operator coefficients (actually, exp(coefficient)) are shown below. 1.0 refers to the "base" rate -- the strike rate you would expect given the airline's flight history of airports, years and months. The "winners" -- Hawaiian, United and Frontier -- all have values above 2, which means their strike rates are more than double those of any other airline with their flying schedule.
So why do some airlines strike more birds than others, even after accounting for airport and month? Possibilities include differing planes, pilots or maintenance crews.
13 May 2009
The social sciences have long embraced the idea of text-as-data, but in recent years, increasing numbers of quantitative researchers are investigating how to have computers find answers to questions in texts. This task might appear easy on the outset (as it apparently did to early researchers in machine translation), but, as we know, natural languages are incredibly complicated. In most of the applications in social science, analysts end up making a "bag of words" assumptions--the relevant part of a document are the actual words, not their order (this is not a unreasonable assumptions, especially given the questions being asked).
When I see applications of natural language processing (NLP) in the social sciences, I typically think very quickly to its future. Computers are making strides at being able to understand, in some sense, what they are reading. Two recent articles , however, give a good overview of the challenges that NLP faces. First, John Seabrook of the New Yorker had an article last summer, Hello, Hal, which states the problem clearly:
The first attempts at speech recognition were made in the nineteen-fifties and sixties, when the A.I. pioneers tried to simulate the way the human mind apprehends language. But where do you start? Even a simple concept like "yes" might be expressed in dozens of different ways--including "yes," "ya," "yup," "yeah," "yeayuh," "yeppers," "yessirree," "aye, aye," "mmmhmm," "uh-huh," "sure," "totally," "certainly," "indeed," "affirmative," "fine," "definitely," "you bet," "you betcha," "no problemo," and "okeydoke"--and what's the rule in that?
The article is mostly about speech recognition, but it definitely hits the main points about why human-generated language is so hard tricky. The second article, in the New York Times recently, is a short story about Watson, the computer that IBM is creating to compete on Jeopardy! IBM is trying to push the field of Question Answering quite a bit forward with this challenge. This goal is to create a computer that you can ask a natural language question to and get the correct answer. A quick story in the article indicates that they may a bit to go:
In a demonstration match here at the I.B.M. laboratory against two researchers recently, Watson appeared to be both aggressive and competent, but also made the occasional puzzling blunder.
For example, given the statement, "Bordered by Syria and Israel, this small country is only 135 miles long and 35 miles wide," Watson beat its human competitors by quickly answering, "What is Lebanon?"
Moments later, however, the program stumbled when it decided it had high confidence that a "sheet" was a fruit.
This whole Watson enterprise makes me wonder if there are applications for this kind of technology within the social sciences. Would this only be useful as a research aid, or are there empirical discoveries to be made with this? I suppose it comes down to this: if a computer could answer your question, what would you ask?
10 May 2009
David Brooks wrote a column a few days ago about Will Dobbie and Roland Fryer's working paper on the Harlem Children's Zone charter schools, which the authors report dramatically improved students' performance, particularly in math. Looking at the paper, I think it's a nice example of constructing multiple comparisons to assess the effect of a program and to do some disentangling of mechanisms.
The program they study is enrollment in one of the Promise Academy elementary and middle schools in Harlem Children's Zone, a set of schools that offer extended class days, provide incentives for teacher and student performance, and emphasize a "culture of achievement." The authors assess the schools' effect on student test scores by comparing the performance of students at the schools with that of other students. The bulk of the paper is concerned with how to define this group of comparable non-students, and the authors pursue two strategies:
The estimated effect is very large, particularly for math. Because the estimates are based on comparisons both within the HCZ and between HCZ and non-HCZ students, the authors can speculate somewhat about the relative importance of the schooling itself vs other aspects of the HCZ: they tentatively suggest that the community aspects must not drive the results, because non-HCZ students did just as well.
Overall I thought it was a nice example of careful comparisons in a non-experimental situation providing useful knowledge. I don't really know this literature, but it seems like a case where good work could have a big impact.