22 April 2008
Update: Check out how my predictions fared! Two comparisons are given, one showing both maps in the same image and one as an animated GIF (kudos to the animation package in R).
Overall, my predictions did pretty well. Their overall correlation with the true vote shares was .89 -- leading to an R^2 of .79, just below the in-sample R^2. My biggest miss was Centre County, where I predicted that Clinton would edge out Obama. Instead, Obama won pretty convincingly, with over 60% of the vote. I also overestimated Obama’s support in some of the counties surrounding Philadelphia. Not sure what I can do to improve the model next time. If you have any ideas, leave a comment.
Original entry:This isn't my normal blogging day, but I wanted to show my final Pennsylvania prediction map. Later on I will update my post to include the true map in the same color scheme, so we can compare. I have updated the prediction model after everyone's suggestions last time.
The big problems last time were:
There were other comments, too, but not all of them could be addressed effectively (What else can I do besides predict on the county level? That's where we have data!) Well, I'm happy to say that for the latest model I pulled in lots more covariates from the census:
With all these, the model fits like a dream come true. R^2 = 0.82 and a residual standard error of 0.04 (i.e., +- 8% of Obama's true share). Here are the estimated coefficients (after pruning some variables based on the BIC):
The coefficients are pretty much as you expect: counties with more Blacks, young people and higher incomes vote for Obama. Poorer counties and counties where Kerry did well tend to go for Clinton. The only somewhat surprising part is the negative coefficient on male population. You would think counties with more females would go for Clinton. There's probably some confounder, because there were several counties in Ohio with 55% male populations who went for Clinton.
Anyway, I will update this post tomorrow comparing my predictions to the realized results.
a) 100 students take a class, and 50 pass.
b) Given that next time, 50 students pass the (identical) class, how many students, on average, were enrolled?
The "fallacy" is in assuming that the expected number of original enrollees is 100, when it must necessarily be greater than 100 due to the uncertainty in the estimation of passing the class. The article points out that it's ignorance of the prior distribution of passing students that's at fault for the "fallacy" - I argue that it's the prior distribution of one student passing a test that's the cause of the paradox.
Break the problem in two:
a) 100 students take a class, and 50 pass.
Assume for the moment that a student passes or fails the class independent of their peers (which is a reasonable assumption for the initial problem, dealing with the failure rate of vehicles.) Let's assume the standard noninformative prior case, that "half a student" passes and "half a student" fails (the Jeffreys prior) and that students are basically identical. Then the posterior distribution of the probability of passing the test is equivalent to a Beta(50.5,50.5) distribution.
b) Given 50 students passed, on average how many enrolled?
The number of students enrolled in the class for each one who passed is then 1/p - but the mean of 1/p (in this case, 2.02) is necessarily greater than 1/(the mean of p), 2. So the expected class size must be greater under these assumptions. So roughly 101 students enrolled.
The original authors, however, make a profound overestimation of the average of starting students, choosing a "posterior" distribution that yields a class size of 150. To get an expectation this big with this prior information, we would observe a posterior of Beta(2.0,2.0) - or, 1.5 students passing and 1.5 failing! Putting this in perspective, the most likely way I can see this happening is that students pooled their talents and produced 3 distinct final papers: one good, one bad, and one just good enough to get the professor to flip a coin.
It does, however, seem to explain why Harvard classrooms always seem to overflow chaotically at the beginning of each term.
P.S. The original authors call this the "backwards reasoning fallacy", even though Google says the name is better applied to startling schoolchildren deterministically rather than failing them stochastically. Resolving the namespace collision here, does this problem go by another name, or shall we go via Stigler and call it Gelman's paradox?
Update: We recently received this comment from the work's original author, as the comment system failed to post it. I've attached it verbatim. -AT, 8-12-08
I am the author of the original article and a colleague of mine alerted me to your posting on Andy Gellman's blog. You said (about my article):
"An interesting problem with an awful delivery."
You also said:
"I'd normally agree that someone's selling something with this, but the fact that the page was cosponsored by a university makes me wonder about their grossly exaggerated result."
For a start it would not have been too difficult for you to have found out who I was since my name is very clearly stated at the bottom of the article, and the web site provides full information about me. So it would have been nice for you to raise the concerns you have about the article with me directly rather than through the use of insulting comments on a third party web site.
As to the substance of your criticisms, you seem to have misunderstood the particular problem and context and have produced a different model, that does not address the very real example that we had to deal with. You say that
"The original authors ... make a profound overestimation of the average of starting students, choosing a "posterior" distribution that yields a class size of 150."
This is not what I did at all. I made it clear that the crucial assumption was the prior average class size. To illustrate the problem I chose an example in which the prior average was deliberately high, 180. The fact that this gives a posterior average class size of about 153 when the 50 passes is observed is exactly the point I wanted to emphasize. Your comment about us making a "profound overestimation" is quite simply nonsense. Part of the fallacy was to assume that the class size of 100 in the specific example was in any way representative of the average class size.
I suggest you read the article again and pay particular attention to the (real) vehicle example at the end. The model that I produced EXACTLY represented the real data.
You should also be aware that the aim of my probability puzzles/fallacies web page is to raise awareness of probability (and in particular Bayesian reasoning) to as broad an audience as possible. While I am pleased if other professional statisticians read it, it is not they who are the target. This means having to use a language and presentation style that does not fit with the traditional academic approach.
In fact, one thing I have discovered over the years is that too many academic statisticians tend to speak only to other like-minded academic statisticians. The result is that in practice (i.e. in the real world) potentially powerful arguments have been 'lost' or simply ignored due to the failure to present them in a way in which lay people can understand. I have seen this problem extensively first hand in work as an expert witness. For example, in a recent medical negligence case the core dispute was solved by a very straightforward Bayesian argument. However, this had been presented to the defence lawyers and expert physicians in the traditional formulaic way. Neither the lawyers nor the physicians could understand the argument, and the QC was adamant that he could not present it in court. We were brought in to check the validity of the Bayesian results and to provide a user-friendly explanation that would enable the lawyers and doctors to understand it sufficiently well to present it in court. The statisticians simply did not realise that what is simple to them may be incomprehensible to others, and that there are much better (visual) ways to present these arguments. We used a decision tree and all the parties understood it immediately because it was couched in term of real number of patients rather than abstract probabilities. Had we not been involved the (valid) Bayesian argument would simply have never been used.
Professor of Computer Science
Head of RADAR (Risk Assessment and Decision Analysis Research)
Computer Science Department
Queen Mary (University of London)
London E1 4NS.
Tel: 020 7882 7860
32-33 Hatton Garden
London EC1N 8DL
Tel: +44 (0) 20 7404 9722
Fax: +44 (0) 20 7404 9723