15 March 2014
Judging from the number of tweets with "big data hubris" (I'll admit the irony of my using this metric) our paper on Google Flu Trends has gotten a bit of a buzz. There is one small point I would like to elaborate on. A few people have suggested perhaps GFT is right and the CDC data are wrong. In our analysis/discussion we are not assuming that the CDC data are "right" (indeed, in a trivial sense they must be wrong and the statistical question is, generally, how wrong are they). However, GFT is built on top of CDC data--technically, it's not a predictive model of flu prevalence, it's a predictive model of future CDC reports about the present. If the CDC data have warts, if GFT is working well, it will fit those warts. If the CDC data underestimate flu prevalence in certain regions, say, then GFT will as well.
As we note in the paper, the interpretation would be different for a methodology that directly aimed to measure flu prevalence. In that case, if there were a deviation, one would have make an assessment as to which method was more likely to be accurate.
Of course, there are a few minor caveats to this. For example, if for some exogenous reason the CDC data were to dramatically drop in quality at a certain point in time (say, if funding for data collection were slashed) then there could be an argument that an approach such as GFT would, for a period at least, be more accurate than CDC data (since GFT would have been fit to the previously higher quality CDC data). But I don't think we have any reason to believe that this is the case currently.
Posted by David Lazer at March 15, 2014 8:11 AM