26 May 2011
Google Correlate is a new services from Google that allows you analyze temporal or spatial correlations between search terms. Could be incredibly interesting data in here for various fields. The methodology from the whitepaper gives an insight as to how Google does these correlations at scale:
In our Approximate Nearest Neighbor (ANN) system, we achieve a good balance of precision and speed by using a two-pass hash-based system. In the first pass, we compute an approximate distance from the target series to a hash of each series in our database. In the second pass, we compute the exact distance function on the top results returned from the first pass.
I tried my hand with “Barack Obama”, where most of the action comes from the South, the Rust Belt, and the Eastern Seaboard:
Compare that to map for the search phrase “Barack Hussein Obama”:
Here you see a much more distinct Appalachian pattern emerging. This map looks similar to others that highlight where John McCain in 2008 did better than George Bush in 2004. The kind of search terms with high spatial correlation with the two are also fascinating. For “Barack Obama” you see mostly references to African Americans or African American culture:
For “Barack Hussein Obama”, there is quite the hodgepodge, with references to “obama koran” and “obama the antichrist”:
Obviously, we wouldn’t want to read too much into these comparisons due to the ecological inference problems, but there is a lot to explore here.
25 May 2011
I got an email a few days ago urging me to try www.zanran.com, a search engine "for finding data and statistics". I'm skeptical of some of these search engines (I can never find anything I want on www.rseek.org for example), but I was pleasantly surprised. I typed in "fatwa" and, lo and behold, uncovered a few papers that are of some relevance to the project I'm doing on Islamic fatwas.
My first reaction is that this is really a search engine for finding figures and tables, not necessarily "data" or "statistics". But it turns out that I really like finding new papers by searching their figures. It's much easier to tell if a paper is going to be relevant to my research by looking at figure 4 than by looking at the title.
Searching through papers by figures has had me thinking about the search engine I'd really like to see: one that searches for images. It'll work like this: you take a picture, upload the picture to the search engine and then it uses image similarity algorithms to generate search results ranked by the similarity of the image. This would alleviate the somewhat rare but annoying situation where you have a picture of something but don't know what it is. There's really no way to google that...
"Who was that random guy in our group photo?"
"Hold on, I'm googling 'guy with brown hair'..."
19 May 2011
In response to a comment by Chris Blattman, the Givewell blog has a nice post with "customer feedback" for the social sciences. Number one on the wish-list is pre-registration of studies to fight publication bias -- something along the lines of the NIH registry for clinical trials.
I couldn't agree more. I especially like that Givewell's recommendations go beyond the usual call for RCT registration to suggest that we should also be registering observational studies. If we're dreaming about discipline-wide reforms to enhance the credibility of political science, it would be nice if we had reforms that weren't only applicable to the research that is already most credible.
The most though-provoking reform idea thrown out by the Givewell blog is this:
As food for thought, imagine a journal that accepted only studies for which results were not yet known. Arguably this journal would be more credible as a source of "well-designed studies addressing worthwhile questions, regardless of their results" as opposed to "studies whose results make the journal editors happy."[Thought experiment round two: how would this journal differ from the APSR?]
I've been trying to think of ways to personally implement the principle of preregistration (short of organizing a registry or starting the above journal). The most obvious thing I can think of is to keep a detailed lab notebook (see discussion by Lupia here). Ideally, it would be public so that I couldn't go back and fudge it -- "Oh, I expected all along that the coefficient would be negative." Or maybe I'd keep it private during the research but somehow make deletions impossible.
Actually, even if I never made this public, taking better notes as I do research could have serious benefits. For one thing, it would be incredibly helpful for mitigating the inevitable bit-rot from letting a project sit for a while. And it's nice to be able to remember how a project actually unfolded. As Fox sagely observes, "It is best...not to fool yourself, regardless of what you think about fooling others" (p. 511, in reference to standard errors).
Perhaps there's actually a market for this kind of thing. Would reviewers look more favorably on papers submitted with a time-stamped preregistration? I guess not, or else at least a few people would be doing it already.
Still, I'm tempted to give public lab notes a whirl myself. Suggestions welcome!
2 May 2011
Every so often, I try to take some time to read something that I should have read ages ago. Tonight's gem was the 2010 draft of "The Industrial Organization of Rebellion: The Logic of Forced Labor and Child Soldiering" by Bernd Beber and Chris Blattman (link). The paper gets a lot of traction out of a formal model and then matches these predictions up to reality using some unique and hard-earned data.
An off-hand comment in the paper caught my attention: Beber and Blattman's assertion that "a single case helps to refine our theory and validate some basic assumptions, but cannot test it" (p 19). Ordinarily I'd say "sure", except that they are referring to their own nuanced analysis of a large number of child soldiers painstakingly tracked down in Uganda. Moreover, they actually have a credible identification strategy -- abduction into the Lord's Resistance Army was essentially random after conditioning on age and location. I was pretty convinced by the analysis; more so than by the regressions on the novel but dubious rebel dataset they introduce at the end of the paper. Maybe the Uganda child-soldier analysis wasn't a "test" but it sure moved my posterior beliefs about their hypotheses.
So, I wonder what they meant. I guess I can just email them and ask, but that would be no fun. Instead, I'll speculate wildly.
One possibility is that the Uganda data was collected and analyzed prior to the development of the model. Or perhaps they are noting that there isn't any variation in rebel groups, so we obviously can't estimate the effect of some of their important moving parts (motivating the turn to other data). On the other hand, their theory does generate a number of observable implications that they do successfully
test "validate" with the child-soldiering data. And like I said, it was this evidence that convinced me. I think even the realize this: they spend 8 pages on Uganda and 4 on the cross-rebel group comparisons.
I think the usefulness of single cases is worth pondering, especially since it's often much easier to get the kind of unique data that allows causal identification for a single case!