27 January 2008
FedSpending.org has been on my list of new data resources to check out for a while now. This is a project of budget watchdog OMB Watch that brings government data on contracts and awards together in a single searchable database at www.fedspending.org. After looking into it this week I can report that it looks like a very useful resource, interesting not only for the data it provides but as yet another example of the phenomenon of private groups repackaging government data for public use in the name of transparency and accountability.
First, on the data: you can use FedSpending.org to get pretty good detail on contracts handed out by the federal government, including the contracting agency, the company to which the contract was awarded, the size and purpose of the contract, the location of the company, and the location where the contract was carried out. You can subset it in interesting ways, searching by congressional district or by contractor. You can even specify contractor characteristics. For example, this search will return contracts awarded to minority-owned businesses in Massachusetts. Finally, you can output search results in various formats (including CSV), and there is even an API that allows you to generate your queries programmatically so that, with a little bit of code to parse the XML, you could systematically build an interesting dataset without spending time clicking and downloading by hand.
I view this site as part of a phenomenon where data released by the government is being repackaged and publicly released in a useful format by private actors. Another intriguing example is GovTrack, a project built by a Princeton undergrad several years ago (he's now a linguistics grad student) that parses the congressional record and a bunch of other public information and provides email notifications so that you can track any legislator or issue. A more professional recent attempt to bring transparency to Capitol Hill is OpenCongress.
I'm interested in whether academics and other researchers find useful ways to exploit these new resources. I suspect the pretty presentation that these sites provide will be of limited use to researchers, who are used to grappling with the raw data, but other aspects of cleaning up the data for public consumption (for example, tagging elements of the congressional record based on which members are mentioned) seem like they could save us some steps. And the reach and effectiveness of this kind of transparency effort is itself a topic worthy of research in my view.
One of the interesting things about FedSpending is that last month OMB released its own website providing public access to data on federal contracts and awards, at USASpending.gov. On the surface, USASpending looks pretty different from FedSpending -- the government site's design is clean and patriotic, prominently featuring a US flag and the White House facade, while FedSpending's design stays true the site's watchdog roots with a green color scheme and a washed-out dollar bill as a header. But look more closely and you find that USASpending is basically a clone of FedSpending, identical down to the examples in the documentation. It turns out that a piece of 2006 legislation (the bill co-sponsored by Obama that was held up by the notorious "secret hold") required OMB to build a website to publicly disclose data on federal contracts and awards. But at the time OMB Watch was already almost done with a site of its own that would do essentially everything specified in the legislation. So in a bizarre twist, OMB decided to hire OMB Watch (which, as the name suggests, usually makes its bread by criticizing OMB) to lend its technology to the government project. Based on the outcome, it looks like OMB is basically running a clone of the OMB Watch site. So much for legislating transparency.
Incidentally, can anyone use FedSpending (or USASpending) to find the contract in which OMB hired OMB Watch? I couldn't. The Washington Post article where I learned about the arrangement says there was an intermediary contractor, and the contract appears not to have been sourced from OMB, so I wasn't able to locate it either by the contractor or the contracting agency. Oh well.
The applied statistics workshop returns this Wednesday, January 30. We’ll have David Nickerson, Department of Political Science, University of Notre Dame, presenting “How (and how not) to Study Voter Registration Experimentally”.
The workshop will convene at 12 noon with a light lunch served. The presentation will begin at 1215. We are located in CGIS Knafel (1737 Cambridge St), room N-354. We hope to see you there.
Any questions, comments, or concerns? Please send me an email—(firstname.lastname@example.org)
9 January 2008
New Hampshire voted last night, and managed to set off another frenzy of introspection among pollsters and pundits. On the Democratic side, public polls released after Iowa showed Obama leading Clinton by an average of about 10 points, but in the end Clinton of course edged out a narrow victory. The polls were much closer on the Republican side, but the "miss" on the Democratic side has already produced much concern about "New Hampshire's Polling Fiasco". Perhaps the witch-hunt that ensues whenever polls appear to be inaccurate in a major election should be viewed as a positive sign about the acceptance of survey research in the media and electorate; at the very least, these kinds of things keep a fair number of our colleagues gainfully employed. From my perspective, it would have been nice to have polls that were more consistent with the eventual outcome since we were planning to use them as examples in an undergraduate class; they will still be examples, but now the focus will be more on total survey error.
Why did the poll results diverge from the outcome? Several hypotheses are floating around. Jon Krosnick from Stanford has an opinion piece pointing to the ballot order effect; Hillary Clinton won the random draw to end up near the top of the ballot. There is certainly a lot of evidence that ballot-order effects matter, but my sense of the literature is that these effects tend to be smaller for better-known candidates, and it is hard to imagine a candidate better known than Hillary Clinton. Dan Ho and Kosuke Imai have written two articles on elections California that take advantage of randomization to estimate ballot-order effects:
More comment has focused on the possibility that Obama suffered from the "Bradley effect", in which some white voters say that they will support a black candidate when responding to poll questions but end up voting for a white candidate at the ballot box. There is not much academic literature on this supposed effect; here is a Pew Research Center note from last year; ironically, it is titled "Can you trust what polls say about Obama's electoral prospects?"
Finally, many observers have pointed to political prediction markets as either a supplement or alternative to traditional polls for predicting election outcomes, on the idea that these can incorporate other sources of information and require participants to put their money where their mouth is. They didn't do so well, either, as Jon Tierney notes on his New York Times blog, although the market prices did begin to move during the day. There is an interesting research agenda regarding the relative merits of polls and markets (and how markets integrate the information from various polls); Bob Erikson and Justin Wolfers, who are leading contributors to this literature, have an interesting exchange on this question on Andrew Gelman's blog (posted a week before New Hampshire, but even more interesting today).
4 January 2008
James Fowler sent the following message to the Polmeth list, regarding a conference that we will apparently be hosting in June that may be of interest:
The study of networks has exploded over the last decade, both in the social and hard sciences. From sociology to biology, there has been a paradigm shift from a focus on the units of the system to the relationships among those units. Despite a tradition incorporating network ideas dating back at least 70 years, political science has been largely left out of this recent creative surge. This has begun to change, as witnessed, for example, by an exponential increase in network-related research presented at the major disciplinary conferences.
We therefore announce an open call for paper proposals for presentation at a conference on "Networks in Political Science" (NIPS), aimed at _all_ of the subdisciplines of political science. NIPS is supported by the National Science Foundation, and sponsored by the Program on Networked Governance at Harvard University.
The conference will take place June 13-14. Preceding the conference will be a series of workshops introducing existing substantive areas of research, statistical methods (and software packages) for dealing with the distinctive dependencies of network data, and network visualization. There will be a $50 conference fee. Limited funding will be available to defray the costs of attendance for doctoral students and recent (post 2005) PhDs. Funding may be available for graduate students not presenting papers, but preference will be given to students using network analysis in their dissertations. Women and minorities are especially encouraged to apply.
The deadline for submitting a paper proposal is March 1, 2008. Proposals should include a title and a one-paragraph abstract. Graduate students and recent Ph.D.'s applying for funding should also include their CV, a letter of support from their advisor, and a brief statement about their intended use of network analysis. Send them to email@example.com. The final program will be available at www.ksg.harvard.edu/netgov.
3 January 2008
The presidential caucuses in Iowa will be held tonight, giving us our first "official" measure of popular support for the candidates in each party. We've had lots of "unofficial" measures from polls taken over the past few months - over 50 polls taken since Labor Day are posted on pollster.com - but polling to predict the outcome of the caucuses (as opposed to polling designed to measure overall support for the candidates) presents a number of difficult problems.
The first of these problems, and the one that has received the most attention, is identifying likely caucus participants from the sample of respondents. The Iowa caucuses require a significant time commitment (on the order of two to three hours) in order to participate, and turnout has historically been much lower than it has in the New Hampshire primary, to say nothing of a general elections. Identifying likely voters is a key challenge for any poll, but the low turnout levels make survey results unusually sensitive to the screening assumptions used. The recent Des Moines Register poll showing Obama at 32% ahead of Clinton at 25% prompted a great deal of discussion on this topic. These estimates were produced from a screen that implied that 60% of participants tonight would be first time caucus-goers and only 54% would be registered as Democrats before the caucus. While these could be valid estimates (we will soon see), this would be a very different population of caucus participants than we have seen in the past. For more discussion of the screening issue, see this post at Mystery Pollster, which is one of the best sites for coverage of the various polling-related issues in the current campaign.
On the Republican side, identifying likely caucus-goers is the main methodological problem. Iowa Republicans basically take a straw poll and report the results, so a random sample of participants (if they could be identified without error and did not change their minds) would produce an unbiased estimate of the outcome. Things are much more complicated on the Democratic side (their caucus guide is thirteen pages long); caucus participants first break into preference groups for each candidate. After this is accomplished, there is the opportunity for realignment subject to a "viability threshold" which varies from precinct to precinct but is always at least 15%. Pollsters may attempt to model this by asking respondents for their second choice and reallocating those respondents supporting candidates estimated to be below the threshold on a statewide basis. This, of course, assumes a uniform distribution of support across the state, which may or may not be a reasonable assumption (or may be more reasonable for some candidates than for others). If support for Kucinich, hypothetically, is concentrated in Ames and Iowa City, then he may be viable in the precincts where most of his support is found despite being well below the threshold statewide.
Finally, the Democrats in Iowa do not act or report results on a "one-person-one-vote" basis. The precinct caucuses elect delegates pledged to each of the candidates, and the number of delegates is based on the historical support for Democratic candidates in that precinct, not the number of people who participate in the caucuses. The state party then takes these results and calculates the "state delegate equivalent" share. The raw vote totals are available to the state party (which would be the closest to the parameter that the pre-caucus polling is trying to estimate) but the party does not release those results to the media (an op-ed criticizing this practice appeared in the New York Times last month). To my knowledge, none of the groups polling in Iowa attempts to take this weighting into account. The degree to which this causes the reported results to diverge from the raw votes will depend largely on the degree to which turnout in the caucuses diverges from historical Democratic turnout; if the Register poll is correct, this divergence could be quite large (and Obama supporters might not be too happy).
To sum up, polling the Iowa caucuses in order to predict the outcome for the Democrats is a serious problem: the population is hard to define, preferences are likely to be unusually malleable since the party rules require some participants to change their votes, and the results that are reported are not the quantity that would be estimated by a simple random sample of participants. If the polls appear to have gotten it wrong, it will be hard to parse out which of these factors (in addition to the normal sources of bias in any survey) were the main contributors.