22 February 2007
Let me follow up on yesterday’s post by Jim Greiner.
Jim’s problem: He’s touring the country touting tools for increased honesty in applied statistical research, only to be asked, effectively, for recommendations about using these tools to cheat more effectively. Yay academic job market.
Jim’s example goes like this: An analyst is asked to model the effect of a treatment, T, on the outcome, Y, while controlling for a bunch of confounders, X. To minimize the potential for data dredging we give the analyst only the treatment and the observed potential confounders to model the treatment assignment process, but we withhold the outcome data. Only after the analyst announces success in balancing the data (by including X, functions of X,f(X), deleting off-support observations etc), would we communicate the outcome data, plug the outcome in the equation, run it once, and be done.
So how can we help Jim help his audience cheat? Let’s make two assumptions (which I’d be willing to defend with my life). First, although the analyst is not given the actual outcome data, the analyst does know what the outcome is (wages, say). Second, the analyst is permitted to drop elements of X from the analysis, based on his or her analytic judgment.
Now let’s cheat. First, select the covariate, C, from the pool of potential confounders, X, believed to correlate most strongly with the outcome, Y. Second, treat C as the outcome and build a model through data dredging to maximize (or minimize, if this is your objective) the “effect” of T on C. Specifically, find the subset of functions of X, S(f(X)), that maximizes the effect of T on C while maintaining balance in S(f(X)). Third, upon receiving the outcome data, just plug them into the model but “forget” to mention that you didn’t include C in the treatment assignment model. If C really correlates strongly with Y then this procedure should lead to an upwardly biased estimate of T on Y.
I fear that this would work well in practice (though one could construct a counterexample). Seems to me, however, that it would be more technically demanding to cheat in this way than to cheat in, say, standard regression analysis.