11 March 2008
While the Democratic nomination contest drags on (and on and on...; Tom Hanks declared himself bored with the race last week), attention is turning to hypothetical general election matchups between Hilary Clinton or Barack Obama and John McCain. Mystery Pollster has a post up reporting on state-by-state hypothetical matchup numbers obtained from surveys of 600 registered voters in each state conducted by Survey USA. There is some debate about the quality of the data (Survey USA uses Interactive Voice Response to conduct its surveys, there is no likely voter screen, etc.). But we have what we have.
At this point, the results are primarily of interest to the extent that they speak to the "electability" question on the Democratic side; who is more likely to beat McCain? MP goes through the results state by state, classifying each state into Strong McCain, Lean McCain, Toss-up, etc. From this you can calculate the number of electoral votes in each category, which provides some information but isn't exactly what we're interested in.
This problem is a natural one for the application of some simple, naive Bayesian ideas. If we throw on some flat priors, make all sorts of unreasonably strong independence assumptions, and assume that the results were derived from simple random sampling, we can quickly get posterior distributions for the support for each candidate in each state and can calculate estimates of the probability of victory. From there, it is easy to calculate the posterior distribution of the number of electoral votes for each candidate and find posterior probabilities that Obama beats McCain, Clinton beats McCain, or the probability that Obama would receive more electoral votes than Clinton.
While I was sitting around at lunch yesterday, I ran a very quick analysis using the reported SurveyUSA marginals. Essentially, I took samples from 50 independent Dirichlet posteriors for both hypothetical matchups, assuming a flat prior and multinomial sampling density (to allow for undecideds); to avoid dealing with the posterior predictive distributions, I'm just going to assume that all registered voters will vote so I can just compare posterior proportions. When you run this, you obtain estimates (conditional on the data and, most importantly, the model) that the probability of an Obama victory over McCain is about 88% and the probability of a Clinton victory is about 72%. There is a roughly 70% posterior probability that Obama would win more electoral votes than Clinton.
As I mentioned, this is an extremely naive Bayesian approach. There are a lot of ways that one could make the model better: adding additional sources of uncertainty, allowing for correlations between the states, using historical information to inform priors, and imposing a hierarchical structure to shrink outlying estimates toward the grand mean. One place to start would be by modeling the pairs of responses to the two hypothetical matchup questions. Any of these things, however, is going to be much easier to do in a Bayesian framework, since calculating posterior distributions of functions of the model parameters is extremely easy.