23 October 2009
During a recent conversation with some colleagues regarding data sources, an interesting point was made that left me pondering. One member of our group stated that he would not trust a particular source of data to provide useful estimates of population means, but he would trust it to estimate regression coefficients. This puzzled me, because a regression coefficient is a (perhaps slightly fancy) version of a mean. Why, then, would a data source that cannot be trusted for a simple average be useful for a coefficient?
I think the answer lies in the assumed source of randomness. When we make inferences from our sample data to a wider universe of cases, there are two sources of randomness involved: probabilities introduced through the sampling design and probabilities introduced through an assumed stochastic model underlying our observed data. In the first case, we are interested in the existing finite population and our outcome of interest Y is regarded as fixed; randomness is introduced through the sample inclusion probabilities. In the second case, we are interested in a broader "superpopulation" which we posit is generated through some random process, and thus our outcome Y is regarded as a random variable. In much of social science, researchers are interested in this second source of randomness. Hypotheses center around parameters associated with the probability distribution for Y - such as regression coefficients.
Identifying the sources of randomness underlying our data is important, because they have implications for our analysis. Särndal, Swensson, and Wretman show that the variance of a parameter from a ordinary regression model estimated using sample data can be decomposed into two elements, one based on the sampling design and one based on the model. In the case of a census, the extra variance introduced from the design is zero, and thus the total variance of the estimated parameter is the variance of the "BLUE" estimator. Otherwise, accounting for the sampling design in the analysis should improve inference.