October 2009
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries

Categories

Blogroll

SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


« October 21, 2009 | Main | October 26, 2009 »

23 October 2009

Sources of Randomness

During a recent conversation with some colleagues regarding data sources, an interesting point was made that left me pondering. One member of our group stated that he would not trust a particular source of data to provide useful estimates of population means, but he would trust it to estimate regression coefficients. This puzzled me, because a regression coefficient is a (perhaps slightly fancy) version of a mean. Why, then, would a data source that cannot be trusted for a simple average be useful for a coefficient?

I think the answer lies in the assumed source of randomness. When we make inferences from our sample data to a wider universe of cases, there are two sources of randomness involved: probabilities introduced through the sampling design and probabilities introduced through an assumed stochastic model underlying our observed data. In the first case, we are interested in the existing finite population and our outcome of interest Y is regarded as fixed; randomness is introduced through the sample inclusion probabilities. In the second case, we are interested in a broader "superpopulation" which we posit is generated through some random process, and thus our outcome Y is regarded as a random variable. In much of social science, researchers are interested in this second source of randomness. Hypotheses center around parameters associated with the probability distribution for Y - such as regression coefficients.

Identifying the sources of randomness underlying our data is important, because they have implications for our analysis. Särndal, Swensson, and Wretman show that the variance of a parameter from a ordinary regression model estimated using sample data can be decomposed into two elements, one based on the sampling design and one based on the model. In the case of a census, the extra variance introduced from the design is zero, and thus the total variance of the estimated parameter is the variance of the "BLUE" estimator. Otherwise, accounting for the sampling design in the analysis should improve inference.

Posted by Deirdre Bloome at 5:20 PM