February 2007
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28      

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries

Categories

Blogroll

SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


« February 21, 2007 | Main | February 23, 2007 »

22 February 2007

Cheating for Honest People

Let me follow up on yesterday’s post by Jim Greiner.

Jim’s problem: He’s touring the country touting tools for increased honesty in applied statistical research, only to be asked, effectively, for recommendations about using these tools to cheat more effectively. Yay academic job market.

Jim’s example goes like this: An analyst is asked to model the effect of a treatment, T, on the outcome, Y, while controlling for a bunch of confounders, X. To minimize the potential for data dredging we give the analyst only the treatment and the observed potential confounders to model the treatment assignment process, but we withhold the outcome data. Only after the analyst announces success in balancing the data (by including X, functions of X,f(X), deleting off-support observations etc), would we communicate the outcome data, plug the outcome in the equation, run it once, and be done.

So how can we help Jim help his audience cheat? Let’s make two assumptions (which I’d be willing to defend with my life). First, although the analyst is not given the actual outcome data, the analyst does know what the outcome is (wages, say). Second, the analyst is permitted to drop elements of X from the analysis, based on his or her analytic judgment.

Now let’s cheat. First, select the covariate, C, from the pool of potential confounders, X, believed to correlate most strongly with the outcome, Y. Second, treat C as the outcome and build a model through data dredging to maximize (or minimize, if this is your objective) the “effect” of T on C. Specifically, find the subset of functions of X, S(f(X)), that maximizes the effect of T on C while maintaining balance in S(f(X)). Third, upon receiving the outcome data, just plug them into the model but “forget” to mention that you didn’t include C in the treatment assignment model. If C really correlates strongly with Y then this procedure should lead to an upwardly biased estimate of T on Y.

I fear that this would work well in practice (though one could construct a counterexample). Seems to me, however, that it would be more technically demanding to cheat in this way than to cheat in, say, standard regression analysis.

Posted by Felix Elwert at 6:42 PM