31 October 2005
A few years ago, I taught the following lesson in my daughter's kindergarden class and my graduate methods class in the same week. It worked pretty well in both. Anyone who has a kid in kindergarten, some good graduate students, or both, might want to try this. It was especially fun for the instructor.
To start, I hold up some nails and ask "does everyone likes to eat nails?" The kindergarten kids scream, "Nooooooo." The graduate students say "No," trying to look cool. I say I'm going to convince them otherwise.
I hand out a little magnet to everyone. I ask the class to figure out what it sticks to and what it doesn't stick to. After a few minutes running around the classroom, the kindergardners figure out that magnets stick to stuff with iron in it, and anything without iron in it doesn't stick. The graduate students sit there looking cool.
From behind the table, I pull out a box of Total Cereal (teaching is just like doing magic tricks, except that you get paid more as a magician). I show them the list of ingredients; "iron, 100 percent" is on the list. I ask by a show of hands whether this is the same iron as in the nails. 3 of 23 kindergarten kids say "yes"; 5 of 44 Harvard graduate students say "yes" (almost the same percent in both classes!).
I show the students that the box is sealed (and I have nothing up my sleeves), Then, I open the box, spill some cereal on a cutting board, and smash it up into tiny pieces with a rolling pin. I take the pile of cereal around the room and let the kids put their magnet next to it and see whether the cereal sticks to the magnet. To everyone's amazement, it sticks!
Then I ask, are we now convinced that the iron in the nails is the same iron as in the cereal? All the kids in kindergarten and all the graduate students say "yes."
I respond by saying "but how do you know the cereal stuck to the magnet because it had iron in it? Maybe it was just sticky, like gum or tape." Now that I finally have their attention (not a minor matter with kindergartners), I get to explain to them what a control group is. And from behind the table, I pull out a box of Rice Krispies (which are made of nothing). We examine the side of the box to verify the lack of (much) iron, and then I smash up the Rice Krispies, and let them see if their magnet sticks. It doesn't stick!
Everyone gets to take home a cool fact (they love to eat the stuff in nails), I get to convey the point of the lesson in a way they won't forget (the central role of control groups in causal inference), and everyone gets a free magnet.
30 October 2005
This week, the Applied Statistics Workshop will present a talk by Guido Imbens of the University of California at Berkeley Department of Economics. Professor Imbens is currently a visiting professor in the Harvard Economics Department and is one of the faculty sponsors of the Applied Statistics Workshop, so we are delighted that he will be speaking to the group. He received his Ph.D. from Brown University and has served on the faculties of Harvard and UCLA before moving to Berkeley. He has published widely, with a particular focus on questions relating to causal inference.
Professor Imbens will present a talk entitled " Moving the Goalposts: Addressing Limited Overlap in Estimation of Average Treatment Effects by Changing the Estimand. " If you have been following the discussion on achieving balance taking place on the blog, then this talk should be of great interest. It considers situations in which balance is difficult to achieve in practice, and suggests that estimating treatment effects for statistically defined subsamples may produce better results. The presentation will be at noon on Wednesday, November 2 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. The abstract of the paper follows on the jump:
Estimation of average treatment effects under unconfoundedness or selection on observables is often hampered by lack of overlap in the covariate distributions. This lack of overlap can lead to imprecise estimates and can make commonly used estimators sensitive to the choice of specification. In such cases researchers have often used informal methods for trimming the sample or focused on subpopulations of interest. In this paper we develop formal methods for addressing such lack of overlap in which we sacrifice some external validity in exchange for improved internal validity. We characterize optimal subsamples where the average treatment effect can be estimated most precisely, as well optimally weighted average treatment effects. We show the problem of lack of overlap has important connections to the presence of treatment effect heterogeneity: under the assumption of constant conditional average treatment effects the treatment effect can be estimated much more precisely. The efficient estimator for the treatment effect under the assumption of a constant conditional average treatment effect is shown to be identical to the efficient estimator for the optimally weighted average treatment effect. We also develop tests for the null hypotheses of a constant and a zero conditional average treatment effect. The latter is in practice more powerful than the commonly used test for a zero average treatment effect.
28 October 2005
The recent posts on achieving good balance within matching have stimulated a certain amount of interest. To this debate I offer more questions and, alas, no answers, which are what I'd really like to know. (For what it's worth, I am not doing research in this area. All of my questions are genuine, not rhetorical.)
As I understand it, the genetic algorithm that Diamond and Sekhon favor searches for matches that minimize p-values from hypothesis tests. The subject of the hypothesis tests are the covariates, taken one at a time, and the two-way interactions, also taken one at a time.
Is the objective in matching treated and control units to find sets of observations with the same JOINT distribution of the covariates, which is what one would have in a randomized experiment?
If so, do we expect achieving balance in all univariate (i.e. marginal) and two-way distributions to accomplish this goal, given that the marginal distributions of any multidimensional random vector do not determine the joint? On the other hand, if two sets of random vectors have the same joint distribution, would we expect hypothesis tests applied to individual (univariate) covariates or their interactions to achieve p-values of .15 or greater?
Does the dimension of the vector (i.e. the number of covariates) play a role here, in that if we had 20 covariates, we would expect a comparison of individual covariates marginally to produce a few p-values of below .15? Perhaps more broadly, what theory tells us that the genetic algorithm search is actually attempting to do the right thing - and what is it?
A propensity score method has answers to some of these questions, though it raises others. On the plus side, the theorems say that observations with the same propensity score have the same joint (not merely marginal) distribution of the covariates. Thus, if the goal is to replicate a randomized experiment's much-valued ability to produce observations with the same joint covariate distribution, conditioning on the true propensity score will do that. That's the theory that tells us what propensity score matching is attempting to do is the right thing. The problem is, of course, that in any case that matters, we don't know the true propensity scores, and estimation of them raises profound questions about model fit and adequacy. One can check disparities in marginal distributions, but for the reasons stated above, such checks are not really enough. A question for advocates of propensity scores is the following: if propensity score matching is designed to reduce dependence on the substantive model that relates outcomes to covariates, does it do so only by inducing dependence on proper specification of the propensity score model?
For those who would eschew hypothesis tests in assessing balance (see yesterday's post), how does one assess balance? True, one can always reduce the power of any test to reject a null by discarding observations (I have heard that K-S in particular has low power), but any comparison of distributions rests on some set of criteria. Looking at t-scores is a hypothesis test (how else would one decide when the set of scores is too big or too small?). Are hypothesis tests the worst method of assessing balance, except for all of the others?
I have only one suggestion on this subject: whatever method one uses to create matched sets of treated and control groups, after all ordinary checking of marginal distributions is complete, throw something completely wild at the results. For both groups, calculate a fifth moment of covariate one, interact it with a third moment of covariate two and a second moment of covariate three. Do a test and see what happens. If the two groups have the same joint distribution of their covariates . . . .
27 October 2005
Jens' last two blog posts constitute an excellent statement of where the literature on matching is, but I think almost all of the literature has this point wrong. Hypothesis tests for checking balance in matching are in fact (1) unhelpful at best and (2) usually harmful.
Suppose you had a control group and a treatment group that are identical (exactly matched) except for one person, or except for a bunch of people in one very minor way. Suppose hypothesis tests indicate no difference between the groups, and so you'd be in the situation of reporting balance was great and no further adjustment was needed. (We might think of this as a real experiment where the outcome variable hasn't been collected but is expensive to do so.) If you were given the chance of dropping the one or few people that caused the two groups to differ and replacing them with others that exactly matched, would you do so? Since the dimension on which the inexact match or matches occurred might be the one that has a huge effect on your outcome variable, the bias due to not switching could be huge. So you'd undoubtedly make the switch, despite the fact that the hypothesis test indicated that there was no problem. Hence (1) the tests are unhelpful: passing the test does not necessarily protect one from bias more than failing the test.
Now suppose you have data that don't match very well by all hypothesis tests and you randomly (rather than systematically to improve matching) drop observations, in a bad application of matching. what will happen? Your t-tests or ks-tests or any other hypothesis tests will lose power and so will indicate that balance is getting better and better. Yet, bias is not changing at all, and efficency is dropping fast. The tests are telling you to discard data! Hence (2) hypothesis tests to evaluate balance are harmful, quite seriously so.
The fact is that there is no superpopulation to which we need to infer features of the explanatory variables; all analysis models we regularly use after matching are conditional on X. Balance should be assessed on the observed data, and not be the subject of inference or hypothesis tests.
This message rehearses an argument in a to-be-revised version of our matching paper by Ho, Imai, King, and Stuart that we hope to be finished with and post in a couple of weeks.
26 October 2005
Continuing from yesterday's post, another popular way to test balance is to examine standardized differences (SDIFF) between groups (Rubin and Rosenbaum 1985). SDIFF capture the difference in means in the matched samples, scaled by the square root of the average variance in the un-matched groups. This test has been criticized for the lack of formal criteria for judging the size of the standardized bias. Moreover, it may be open to manipulation as one can add observations to the control group in order to decrease variance in the denominator (Smith and Todd 2005).
Staying in the realm of univariate balance tests, some claim that difference in means tests are insufficient and that Kolmogorov-Smirnov (KS) tests are needed to non-parametrically test for the equality of distributions (Diamond and Sekhon 2005). These KS tests need to be bootstrapped, by the way, to yield correct coverage in the presence of point masses in the distributions of the covariates (Abadie 2002). Again, these tests would substantially increase the balance hurdle. Are they necessary for reliable causal inference?
Apart from univariate tests there are also some multivariate balance tests floating around in the literature such as the Hotelling T^2 test of the joint null of equal means of all covariates, multivariate (bootstrapped) Kolmogorov-Smirnov (KS) and Chi-Square null deviance tests based on the estimated assignment probabilities, as well as various regression-based tests for joint insignificance, etc. Which of these tests is preferable in what situation? What is the relationship between uni- and multivariate balance?
Last but not least, there is the thorny question of significance levels. Is a p-value of 0.10, let's say against the null of equality of means, high enough for satisfactory balance? Is .05 permissible? There is evidence that conventional significance standards are too lenient to obtain reliable causal inference in the canonical LaLonde data set (Diamond and Sekhon 2005).
These are too many questions to which I do not know the answers. The current lack of a scholarly standard for covariate balance strikes me as troubling, because balance affects the quality of the causal inferences we draw. I think it is important to bring the balance issue to the forefront of the matching debate. That is why Jas Sekhon and I are currently working on a paper on this topic. Suppose you are reviewing a matching article. What does it take to convince you that the authors "achieved balance"? Please feel cordially invited to join the debate.
25 October 2005
I thought you might be interested in a newly updated dataset of almost 10 million individually coded international events (1990-2004). Each event is summarized in the data as "Actor A does something to Actor B", with Actors A and B coded for about 450 countries (and other actors) and "does something to" coded in an ontology of about 200 types of actions. The data are coded by a computer "reading" millions of Reuters news reports. Will Lowe and I wrote an article* that evaluated the software system (produced by VRA) that performs this task and found that for the numbers of events it was possible to convince humans (trained Harvard undergraduates) to coded by hand, the machine did as well as the humans. However, in part since there is only so much pizza you can feed undergraduates, the machine clearly dominates for larger numbers of events. We previously released a dataset with 3.5 million events; this one is bigger, more accurate (since the software has been improved), and covers a longer time period.
Most international relations data are limited to analyses aggregated to the year or month. Yet, as we say in the article, when the Palestinians launch a mortar attack into Israel, the Israeli army does not wait until the end of the calendar year to react. We think there is much to be learned about international relations from data like these. For the data, documentation, and our article, see this site.
*Gary King and Will Lowe. 2003. "An Automated Information Extraction Tool For International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design" International Organization, 57, 3 (July, 2003): Pp. 617-642.
There exists a growing consensus in the causal inference literature that when it comes to bias adjustment under selection on observables, matching methods dominate ordinary regression (esp. when discrepancies between groups are large). But how do we judge the quality of a matching? My professors tell me: "We want good balance." Sounds great, so I thought at first. Reading more matching articles, however, I soon became somewhat startled by the scholarly disagreement about what actually constitutes "good" balance in observational studies. Despite the fact that matching methods are now widely used all across the social sciences, we still lack shared standards for covariate balance: Which tests should be used in what type of data? What are their statistical properties and how do they compare to each other? And how much balance is good enough?
From reading this literature (sincere apologies if I have missed something relevant), it seems to me that most people agree that paired t-tests for differences in means are obligatory. T-tests are useful because matching by construction produces matched pairs. But should we test by comparing whole groups (treated vs. matched-untreated) or within propensity score ("PS") subclasses? A problem with the latter may be that the choice of intervals can be arbitrary, which is critical as interval width affects the power of the test (Smith and Todd 2005).
Moreover, which covariates should we t-test balance on? At least all that are included in the matching (right?), but how about other moments, the full set of interactions and higher-order terms, etc? The latter seems helpful to minimize bias but is done once in a blue moon (at least in the papers that I encountered). Most authors avoid these additional tests since they exacerbate common support problems and substantially raise the hurdle for obtaining balance.
Finally, should we t-test balance on the PS score and or the covariates othorgonalized to the PS score? How do we deal with the estimation uncertainty in these variables? And what does it mean -- as happens sometimes in practice -- to have remaining imbalance on the PS while all covariates are balanced?
Stand by for part II of this post tomorrow.
24 October 2005
This week, the Applied Statistics Workshop will be presenting a talk by Gopi Goswami of the Harvard Statistics Department entitled "Evolutionary Monte Carlo Methods for Clustering." Gopi Goswami received his Ph.D. from the Department of Statistics at Harvard in June 2005. Before coming to Harvard, he was an undergraduate and master's student at the Indian Statistical Institute in Calcutta. His dissertation, "On Population-Based MCMC Methods," develops new techinques for more efficiently sampling from a target density. He is currently a post-doctoral scholar in the Harvard Statistics Department. The presentation will be at noon on Wednesday, October 26 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided. The paper he will present on Wednesday explores these methods in the context of clustering problems:
We consider the problem of clustering a group of observations according to some objective function (e.g. K-means clustering, variable selection) or according to a posterior density (e.g. posterior from a Dirichlet Process prior) of cluster indicators. We cast both kinds of problems in the framework of sampling for cluster indicators. So far, Gibbs sampling, â€œsplit-mergeâ€? Metropolis-Hasting algorithm and various modifications of these have been the basic tools used for sampling in this context. We propose a new population based MCMC approach, in the same vein as parallel tempering. We introduce three new â€œcrossover movesâ€? (based on swapping and reshuffling sub-clusters intersections) which make such an algorithm very efficient with respect to Integrated Autocorrelation Time (IAT) of various relevant statistics and also with respect to the ability to escape from local modes. We call this new algorithm Population Based Clustering (PBC) algorithm. We apply PBC algorithm to motif clustering, Beta mixture of Bernoulli clustering and a Bayesian Information Criterion (BIC) based variable selection problem. We also discuss clustering of mixture of Normals and compare the performance PBC algorithm as a stochastic optimizer with K-means clustering.
One of my most embarrassing experiences occurred surrounding the use of instrumental variables in my ASR article with Sanjeev Khagram on inequality and corruption (2005). The article developed from my qualifying paper on causes of corruption (2003), in which I examined several hypotheses on the causal effects of inequality, democracy, economic development, and trade openness. Since all these four explanatory variables may be affected by corruption, I tried to find appropriate instruments. Initially, I tried five: latitude, # frost days, malaria prevalence index, ethno-linguistic fractionalization, and constructed openness. They had a strong predictive power for the endogenous variables in the first stage regression, and the p-values for the over-identification test in the second stage regressions were generally large enough so that I could not reject the null hypothesis of no correlation between the instruments and the error term of the regression. I worked with Professor Khagram to make a publishable article from my qualifying paper, and we submitted our manuscript to the ASR. The first review we received from the editor was encouraging. The editor advised us to â€œrevise and resubmitâ€? in his three-page long letter, which showed his interest in our paper. But the editor as well as an anonymous reviewer asked us to provide an argument explaining how our instruments were correlated with the endogenous variables but not directly correlated with corruption. I initially considered responding to this critique by citing Rodrik et al.â€™s draft paper entitled â€œInstitutions Rule: The Primacy of Institutions over Geography and Integration in Economic Developmentâ€? (later published in the Journal of Economic Growth, 2004), which argued, â€œAn instrument is something that simply has some desirable statistical properties. It need not be a large part of the causal story.â€?
However, I was criticized regarding the use of instruments when I presented at a Work-in-Progress Seminar at the Kennedy School of Government and at Comparative Political Economy Conference at Yale University in spring 2004. In the Work-in-Progress Seminar, some professors at the Kennedy School noted that overidentification test can pass if they are all wrong in the same direction. In the Yale conference, Professor Daron Acemoglu of MIT was a discussant for my paper, and he used the term â€œIV etiquetteâ€? to emphasize the importance of giving a plausible story for the first stage. He pointed that without a clear story for the fist stage, it is impossible to tell whether the instrument is uncorrelated with unobserved determinants of the dependent variable. It was really an embarrassing moment when I was criticized for the lack of etiquette in front of many scholars.
So, I had to find more convincing instruments. In this regard, I have to thank my friend, Andrew Leigh, who was a doctoral student in public policy then and is currently Research Fellow at Australian National University. He found that â€œmature cohort sizeâ€? can be used as an instrument for inequality in his dissertation paper entitled "Does Equality Lead to Fraternity?", based on Higgins and Williamson's (1999) theory of cohort size effect on income inequality. Also, I came to realize how conference presentations and discussions can be helpful in improving the quality of research.
21 October 2005
Letâ€™s salute the New York Timeâ€™s for its near perfect polling documentation. In a recent edition of the Sunday Magazine, the Times includes a two-page spread on a phone survey on New York City politics. Though the survey touches on some life-and-death issues (â€œWould you ever date a Republican?â€?), itâ€™s really more for laughs than higher learning. Regardless, the Times goes to great length to describe its methodology:
â€œMethodology: This telephone poll of a random sample of 1,011 adults in New York City was conducted for the New York Times Magazine by Blum &Weprin Associates Inc. between Aug. 29 and Sept. 1. The sample was based on a random-digital-dialing design that draws numbers from all existing telephone exchanges in the five boroughs of New York, giving all numbers, listed and unlisted, a proportionate chance of being included. Respondents were selected randomly within the household and offered the option of being interviewed in Spanish. The overall sample results were weighted demographically and geographically to population data. The estimated average sample tolerance for data from the survey is plus or minus 3 percent at the 95 percent confidence level. Sampling error for subgroups is higher. Sampling is only one source of error. Other sources of error may include question wording, question order and interviewer effects.â€?
Thatâ€™s 146 words on survey sampling likely lost on many readers. We may quibble about the omission of the nonresponse rate (although they mention that results were weighted to represent known geographic and demographic distributions). We may find the phrase â€œsample toleranceâ€? for â€œconfidence intervalâ€? a tad confusing. We may protest that they forgot a comma before the â€œandâ€? in the closing enumeration. But thatâ€™s about it.
I would cry tears of joy if the major papers in my native Germany would start taking survey sampling nearly as seriously as the Times. Instead, we get anecdote-laden head scratching over recent failures to predict national election results with anything approaching accuracy. Seriously, I know Europeans arenâ€™t currently inclined to follow American examples. But how would attention to basic statistical ethics work for an exception?
20 October 2005
I continue with my review of The Probability of God, by Stephen D. Unwin, which I began here.
The first clue I had that this book would have anything but rigorous mathematical analysis was that I found it in the Harvardâ€™s Divinity library. As expected, the book is mainly philosophical in nature, but that doesnâ€™t mean it exceeds its mathematical scope. Indeed, it gives the reader a good introduction to Bayesian inference while being very clear about its limits.
The premise is simple: start with a proposition â€“ in this case, that a monotheistic God exists; select a series of evidential questions that are relevant to the investigation; and assess the evidence under each of the two mutually exclusive probabilities.
The considerations he takes into account are as follows:
Prior distribution: Is there any reason to believe God exists other than using anti-anthropic arguments? Unwin believes there is no value in the â€œwatchmakerâ€? hypothesis â€“ that the wonder and beauty we see around us is so complex that it could only have been designed by a being of higher order than our own â€“ and so chooses the simplest of priors, that there might as well be a 50-50 chance. (Unwin later demonstrates that this prior fails any reasonable sensitivity analysis â€“ stay tuned.)
In its rawest form, Bayesian inference takes the following form:
P(proptrue|evid) = P(evid|proptrue)P(proptrue)
P(evid|proptrue)P(proptrue) + P(evid|propfalse)P(propfalse)
Notice that if we divide top and bottom by P(evidence|prop false), we have the following quantity on top and bottom: P(evidence|proptrue)/P(evidence|propfalse). Statisticians call this a Bayes Factor â€“ the likelihood of one model over another â€“ while Unwin, seeking to appeal to a wider audience, calls this a Divine Indicator. Iâ€™ll continue with the former.
He then considers six â€œquantitiesâ€? that relate to Godâ€™s existence, and how they fair under a world with God or no-God. In particular, he examines each Bayes factor, considers each piece of evidence to be independent from the others, then performs the Bayes calculation one at a time, using each subsequent posterior probability as the new prior probability. Any skeptic might question that the nature of his inquiries might be skewed under his own personal biases should remember that this is just an exercise.
In addition, to simplify the math, Unwin uses a scale of 1 to 3 to evaluate each piece of evidence, indicating no, weak or strong support (this is my interpretation, rather than a hard ranking system the author himself uses.) To put this into the equation, he uses a 5-level scale, setting the Bayes factor to be 0.1, 0.5, 1, 2 or 10 depending on the comparison of evidence.
1) The recognition that â€œgoodnessâ€? exists. Under God, he argues, good and evil are built into the system. Without God, goodness can only be described as a pragmatic measure, so goodness wouldnâ€™t be taken in that context. Unwin starts off with a blast and gives himself a 10. P(God exists) is now 91%.
2) The recognition that â€œmoral evilâ€? exists. Unwin says that moral evil is inevitable in a godless universe, but that God wouldnâ€™t tolerate such a degree we have right now. Strong meets weak; the Bayes factor at this step is 0.5, leaving an 83% chance. (I find this step a little unsettling, as it immediately turns God into a humanlike figure, attaching too much specificity in my mind.)
3) The recognition that â€œnatural evilâ€? exists. In the wake of Hurricane Katrina, a great number of survivors in Louisiana are asking themselves what kind of a God would allow such a tragedy to happen. Unwin carries the same spirit across and claims that such a perspective makes little sense under Godâ€™s domain. No evidence versus strong gives a Bayes factor of 0.1 and a 33% chance of Godâ€™s existence.
4) The incidence of â€œintra-naturalâ€? miracles (such as whether praying for the Red Sox to win makes it so.) There are studies carried out routinely whether organized prayer can aid in the healing process. Never mind that these studies are highly unscientific â€“ there isnâ€™t an equal group praying against another injured person with roughly the same path to recovery, and a control group is nearly impossible to manufacture. Unwin doesnâ€™t mind the inconclusiveness of these experiments; instead he relies on personal perspective and finds that prayer has some place in the world of God but little in one without. A Bayes
factor of 2 brings the probability of God back to 50%.
5) The incidence of â€œextra-naturalâ€? miracles (those examples that canâ€™t be explained by science). These sorts of miracles were observed before God, so Unwin says many other systems are good enough to explain their existence (though certainly not their cause.) Equal evidence means a Bayes factor of 1, and the probability of God holds at 50%.
6) Religious experiences. I find this category to be the weakest of Unwinâ€™s areas of evidence, since it immediately suggests a stacked deck. Unwin does hold back and merely suggests that what we perceive to be religious experiences â€“ perceived moments of oneness with a higher power â€“ are more likely to be justified if there is such a higher power. Unwin gives a Bayes factor of 2, bringing us to the conclusion that in his perspective, the probability of Godâ€™s existence is 67%.
Now many of you (including my co-authors) are bewildered as to why Iâ€™d consider this book, and this analysis, as being relevant to the practice of statistics. To begin with â€“ or rather, end with â€“ Unwin admits that this test is extremely sensitive to the choice of prior beliefs. Under his assessment of the evidence, his prior belief in Godâ€™s existence (50%) yields the probability of Godâ€™s existence at 67%; using prior beliefs of 10% or 75%, using the same evidence, swings the result to 18% or 86% respectively.
As in many strong works of philosophy, the important lesson is not in the answer, but in asking the questions that lead there. These calculations lead only to the halfway point of the text, as Unwin now segues from his method of observation into a discussion of the nature of faith, and what components of probability and faith lead to what we understand as belief.
19 October 2005
Professor Kousserâ€™s 1984 article on objectivity in expert testimony, which I first introduced to the blog here, raises fundamental questions about the role of expert witnesses in litigation. Among those questions is the following: when presenting conclusions to a court, how much are expert witnesses entitled to rely on the adversarial process that is the foundation of lawsuits? Some experts appear to believe that their job is to present the best statistical, engineering, chemical, or whatever, case for their sides. Of course, they would not perjure themselves. Still, such witnesses do not attempt to provide a balanced look at the factual information to be evaluated; rather, they focus on demonstrating how the relevant data can be interpreted in favor of the parties retaining them. After all, the opposing sides have their own lawyers and, ordinarily, its own experts who (surely) are doing the same thing.
To make matters more concrete, I provide the following simplified example. My colleagues and I retained a quantitative expert in a redistricting case to measure the partisan bias of several proposed redistricting plans. We used a measure of bias that assigned a score to each plan; a score of zero meant no bias, while a score of two meant roughly that the plan would give one party two â€œextraâ€? seats. The (litigation) difficulty we ran into was that the scores did not appear to distinguish the plan we favored from the one the other side proposed. The bias in our plan was, say, .03, while that of the other side was something like .15. Thus, the difference in bias between the two plans was approximately one tenth of one seat. But our expert, at our prompting, presented the results differently: he emphasized the other sideâ€™s plan was five times more biased than our own.
Before dismissing this story, and the view of the expert as an extension of trial counsel, with a snort and a shake of the head about the lack of ethics in modern society, consider how the structure of the litigation process favors such choices. At trial, an expert (just like any other witness) is not allowed to relate his or her views directly to the court. Rather, the expert speaks to the judge or jury only in response to questions from lawyers under the duty to advocate their respective clients' cases, that is, the duty NOT to be neutral. Before trial, an expert who has consulted with a party to litigation may not be retained by the opposing party. And trial counsel, not the witness, decides whether the expert speaks to the court at all.
There are good reasons for all of these rules. The rule requiring testimony to come in response to questions from an attorney prevents witnesses from testifying about subjects deemed inadmissible (opposing counsel can object between question and answer). With respect to the prohibition on consulting with one side and then working for the other, experts who have consulted for Side A learn about Side Aâ€™s case in a way that Side B might pay handsomely to discover. But if, as many in the legal profession appear to believe, expert witnesses really are whores, could it be otherwise as litigation is presently structured?
18 October 2005
After eight years of learning something about architecture (from Harry Cobb and his team) and extensive programmatic planning, the Institute for Quantitative Social Science this semester moves into the new Center for Government and International Studies buildings. Our official address is the Third Floor of 1737 Cambridge Street (the design is vaguely reminiscent of the bridge of the Starship Enterprise), although we also occupy some of the other floors and some of the building across the street. It is not really finished yet, but it is a terrific facility, with floor to ceiling windows in most offices, a wonderful seminar room for our Applied Statistics Workshop, and many other useful features. Perhaps even more remarkably, everyone seems to love it (Congratulations Harry!).
One issue I learned during this long process was how the field of architecture has the best science, engineering, and art, but very little modern social scientific analysis. Yet, social science, quantitative social science in particular, could greatly help architecture achieve its goals, I think. Ultimately the goal of this particular $100M-plus building, and of most buildings built by universities, is not only to create beautiful surroundings but also to increase the amount of knowledge created, disseminated, and preserved (my summary of the purpose of modern research universities). So do not limit yourself to asking how a building makes you feel, what architectural critics might think, how it fits in with the style of other buildings on campus, or whether your office is to your liking. Ask instead, or in addition, whether the building increases the units of knowledge created, disseminated, and preserved more than some other building or some other potential use for the money. This strikes me as the central question to be answered by those who decide what buildings to build, and yet the systematic scientific basis for this decision is almost nonexistent.
As such, some systematic data collection could have a considerable impact on this field. Do corridors or suites make the faculty and students produce and learn more? Does vertical circulation work as well as horizontal? Should we put faculty in close proximity to others working on the same projects or should we maximize interdisciplinary adjacencies? Which types of floor plans increase interaction? Which types of interaction produce the most knowledge created, generated, and preserved? Do we want to build buildings that encourage doors to be kept open, so as to make the faculty seem approachable or should we try to keep doors closed so that they can get work done? In this field as in most others, a great deal can be learned by directly measuring the relevant outcome variable; in architecture, quite remarkably, this has only rarely been attempted.
Of course it is done all the time via qualitative judgments, but in almost every field of science where a sufficient fraction of information can be quantified, statistical analysis beats human judgment. There is no reason to think that the same kind of statistical science wouldn't also create enormous advances here too.
I have heard of a couple of isolated academic works on this subject, but we're talking about some of the most important and expensive decisions universities make (and among the biggest decisions businesses, and many other institutions make too). There should be an entire subfield devoted to the subject. All it would take is some data collection and analysis. Outcome measures could include, for example faculty citation rates, publications, awards, grants, and departmental rankings, along with student recruitment, retention, graduation, and placement rates. The key treatment variables would include various information on the types of buildings and architectural design. Random assignment seems infeasible, but relatively exogenous features might include departmental moves or city and town building restrictions. Universities that allow faculty the choice of buildings could also provide useful revealed preference measures. I would think that a few enterprising scholars on this path could have an enormous impact both in creating a new academic subfield and in improving a vitally important set of university (and societal) decisions.
In the interm, we'll enjoy the new buildings and hope they have a positive impact.
17 October 2005
This week, the Applied Statistics Workshop will be presenting a talk by Charles Kemp and Josh Tenenbaum of the Department of Brain and Cognitive Sciences at MIT. Their presentation, "Bayesian Models of Human Learning and Reasoning," will look at the way individuals make generalizations based on limited experience. The presenters argue that these generalizations should be thought of as Bayesian inferences over structured probabilistic models.
Charles Kemp is a Ph.D. student in the Department of Brain and Cognitive Sciences at MIT. He hails from Australia, and he received his undergraduate degree from the University of Melbourne. His research focuses on formal models of semantic knowledge and the use of those models for inductive inference. Josh Tenenbaum is the Paul E. Newman Career Development Professor in the Department of Brain and Cognitive Sciences. After receiving his Ph.D. from MIT in 1999, he joined the faculty at Stanford University, returning to MIT in 2002. He has published widely on the topic of human learning and inference, drawing on modeling methodologies including Bayesian statistics, graph theory, and the geometry of manifolds.
The presentation will be at noon on Wednesday, October 19 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.
One of the goals of this blog is to promote dialog between people working in different social science disciplines. As part of that, we have been posting reports from the Political Methodology conference in Tallahassee. Of course, even though we may all speak the same statistical language, we often speak it with distinct accents; similar concepts and methods often go by different names in different fields. For example, it turns out that estimating the ideal points of political actors is similar in many ways to the problem of estimating the difficulty of question on standardized tests, a commonality that has only been exploited in the last few years.
First things first, however; what exactly is an ideal point? People have long thought about politics in spatial terms: "left" and "right" have been used to describe political preferences since at least the French Revolution, when royalists sat on the right and radicals on the left in the Legislative Assembly. Ideal point models attempt to estimate the position of each legislator on the left-right or other dimensions using the votes that they cast on legislation. Basically, the models assume that a legislator will vote in favor of a motion if it moves policy outcomes closer to their most preferred policy. The resulting estimates from these models provide a descriptive summary of the distribution of preferences within a legislature. They are also important parameters in many formal models of legislative behavior.
Much of the recent work in the area of ideal point estimation has drawn on earlier research by education scholars. Item response theory studies the relationship between the ability (and other characteristics) of test subjects and the answers they give to particular test questions. The general idea is that every test question has an associated ability cutpoint; those with ability above the cutpoint will answer correctly on average. In a typical testing situation, the authors will attempt to include questions with an array of cutpoints in order to estimate the ability of the test takers.
The analogy between ability estimation and ideal point estimation is close; votes in the legislature correspond to questions on the test. One difference is that, in the item response context, the researcher will typically know the correct answer and can therefore associate those responses with higher estimated ability. In the ideal point context, it is not always clear whether a proposal moves policy left or right. Several recent articles have addressed this and other problems in translating item response models to the political context, including work by Harvard's own Kevin Quinn with Andrew Martin (Martin and Quinn 2002) , Clinton, Jackman, and Rivers (2004), and Bafumi, Gelman, Park, and Kaplan (2005). Dan Hopkins described some recent work on ideal point estimation in an earlier post.
14 October 2005
In large-N quantitative research, instrumental variables are often used to address the problem of endogeneity. In small-N qualitative research such as comparative historical case studies, researchers examine historical sequence and intervening causal process between an independent variable(s) and the outcome of the dependent variable in order to establish causal direction and illuminate causal mechanisms (Rueschemeyer and Stephens 1997). However, careful examination of sequence and intervening process through process-tracing may not solve the problem of endogeneity. When Y affected X initially and X, in turn, influenced Y later, looking at the sequence and intervening causal process in the latter part without examining the former process will produce a misleading conclusion.
In my comparative historical case study of corruption in South Korea, relative to Taiwan and the Philippines, I attempted to test my hypothesis that income inequality increases corruption and to identify causal mechanisms. It was easy to show the correlations between inequality and corruption. Both inequality and corruption have been the highest in the Philippines and the lowest in Taiwan, with Korea in between. I found that the success of land reform in Korea and Taiwan produced much lower levels of inequality in assets and income than was true of Philippines, where land reform failed. I provided plausible evidence that the different levels of inequality due to success and failure of land reform accounted for different levels of corruption, and identified some causal mechanisms. Also, between Korea and Taiwan, I found that Korea's chaebol (large conglomerate)-centered industrialization and Taiwan's avoidance of economic concentration led to a divergence of inequality over time, which contributed to divergence of corruption level.
However, the process-tracing for the period after the success or failure of land reform and for the period after the adoption of different industrial policies was not sufficient to establish causal direction because different levels of corruption might have influenced the success and failure of land reform as well as the industrial policy. Hence, I had to show that success and failure of land reform was affected very little by corruption, but largely determined by external factors such as the threat of communism and the differences in the US policy toward these countries. Also, I had to provide evidence that the initial adoption of different industrial policies by Park Chung-hee in Korea and by the KMT leadership in China were not affected by the different levels of corruption. Essentially, land reform and industrial policy played the role of instrumental variables in statistical studies. These were exogenous events that produced different levels of inequality and thereby caused different levels of corruption but had not been influenced by corruption. Thus, the idea of instrumental variable can be useful in qualitative research as well.
13 October 2005
There are tougher tasks than appeasing the human subject review board. A few weeks ago, I met Aldo Benini at the American Sociological Association annual meeting in Philadelphia. Benini has worked for various humanitarian organizations over the past decades and specializes in what strikes me as the most dangerous subfield of social science statistics: he collects, analyzes, and models data on the direct and indirect casualties of war.
I had come across Benini before when I saw a presentation on his work with the Global Landmine Survey, which involved building quantitative models to assist the ongoing mine cleanup in Vietnam. Recently, Benini has been working on estimating the number of civilian victims during the first nine months of Operation Enduring Freedom in Afghanistan following 9/11/01. There, field staff visited all 600 communities directly affected by fighting (both airstrikes and ground combat). This survey improves on previous estimates in the news â€“ not least by being a virtual census of the affected communities, employing trained interviewers, and using standardized questionnaires. Itâ€™s hard for me to imagine more dangerous conditions of data collection (but, wait, Benini currently works on a similar project in Iraq).
The resulting study establishes a number of important findings. Itâ€™s also methodologically interesting. All told, 5,576 residents were killed violently between 9/11/01 and June 2002. Another 5,194 were injured. These numbers are considerably higher than previous estimates. Iâ€™m not going to rehash their entire analysis* here. But with respect to the methodological focus of this blog, Iâ€™d like to highlight the authors' conclusion that there's evidence that modern war apparently facilitates considerable underreporting of civilian losses.
*Including an interesting zero-inflated Poisson model for the concurrent and historical factors affecting the distribution of civilian victims in Afghanistan.
12 October 2005
Since the early seventies, political scientists have been interested in the causal effects of incumbency, i.e. the electoral gain to being the incumbent in a district, relative to not being the incumbent. Unfortunately, these two potential outcomes are never observed simultaneously. Even worse, the inferential problem is compounded by selection on unobservables. Estimates are vulnerable to hidden bias because there probably is a lot of unobserved stuff thatâ€™s correlated with both incumbency and electoral success (such as candidate quality, etc.) that you cannot condition on. To identify the incumbency advantage, estimates had to rely on rather strong assumptions. In a recent paper entitled "Randomized Experiments from Non-random Selection in U.S. House Elections", economist David Lee took an innovative whack at this issue. He employs a regression discontinuity design (RDD) that tackles the hidden bias problem based on a fairly weak assumption.
Somewhat ironically, this technique is rather old. The earliest published example dates back to Thistlethwaite and Campbell (1960). They examine the effect of scholarships on career outcomes by comparing students just above and below a threshold for test scores that determines whether students were granted the award. The underlying idea is that in the close neighborhood of the threshold, assignment to treatment is as good as random. Accordingly, unlucky students that just missed the threshold are virtually identical to lucky ones who scored just above the cutoff value. This provides a suitable counterfactual for causal inference. Take a look at the explanatory graph for the situation of a positive causal effect and the situation of no effect.
See the parallel to the incumbency problem? Basically, the RDD works in settings in which assignment to treatment changes discontinuously as a function of one or more underlying variables. Lee argues that this is exactly what happens in the case of (party) incumbency. In a two party system, you become the incumbent if you exceed the (sharp) threshold of 50 percent of vote share. Now assume that parties usually do not exert perfect control over their observed vote share (observed vote share = true vote share + error term with a continuous density). The closer the race, the more likely that random factors determine who ends up winning (just imagine the weather had been different on election day).
Incumbents that did barely win the previous election are thus virtually identical to non-incumbents that did barely lose. Lee shows that as long as the covariate that determines assignment to treatment includes a random component with a continuous density, treatment status close to the threshold is (in the limit) statistically randomized. The plausibility of this identification assumption is a function of the degree to which parties are able to sort around the threshold. And the cool thing is that you can even test whether this identifying assumption holds - at least for the observed confounders â€“ by using common covariate balance tests.
There is no free lunch, of course. One potential limitation of the RDD is that it identifies the incumbency effect only for close elections. However, one could argue that when looking at the incumbency advantage, marginal districts are precisely the subpopulation of interest. It is only in close elections that the incumbency advantage is likely to make any difference. Another potential limitation is that the RDD identifies the effect of â€œpartyâ€? incumbency, which is not directly comparable to earlier estimates of incumbency advantage that focused on â€œlegislatorâ€? incumbency advantage. Party incumbency subsumes legislator incumbency, but also contains a seperate party effect and there is no chance to disentangle the two. So surley, the RDD design is no paneca. Yet, it can be used to draw causal inferences from observational data based on weaker assumptions that previously employed in this literature.
The Lee paper has led to a surge in the use of the RDD in political science. Incumbency effects have been re-estimated not only for US House elections, but also for other countries as diverse as India, Great Britain, and Germany. It has also been used to study split-party delegations in the Senate. There may be other political settings in which the RDD framework can be fruitfully applied. Think about it - before economists do :-)
11 October 2005
In my last blog entry (here), I wrote that associations like space can mess up the assumptions underlying standard estimation techniques. This entry is about the first problem I mentioned, spatial lag: when neighboring observations affect one another. Such dependencies can lead to inconsistent and biased estimates in an OLS model. And even if you don't care about "space" in a geographic sense, you might be interested in related topics like technology diffusion among farmers, network effects, countries that share the same membership in international organizations (an idea picked up in Beck, Gleditsch and Beardsley; see below) etc. The point is that spatial lag is pervasive in many contexts and though it might be called different names, the basic problem remains the same.
Spatial lag models are similar to lagged dependent variable autoregression models in time series analysis but the problem is that the correlation coefficient cannot be easily estimated. That's a problem because to estimate the coefficient, a spatial weights matrix is needed but it is often not clear what that matrix should look like, i.e., what the actual spatial relation is.
So how much can it matter? James LeSage (in an excellent guide to spatial econometrics and his MATLAB functions, also below) provides an example of OLS and spatial lag estimations of the determinants of house values. The idea is that -- apart from the influence of the independent variables like county population density or unemployment rates -- areas with high house values might be adjacent to other high value areas, and therefore there is a spatial trend in the outcome variable. The example shows that an interesting variable like population density can become statistically insignificant when spatial dependence is taken into account, and that coefficients of other variables can change in magnitude. In addition, taking spatial lag into account also improves the model fit.
So one should really take space into account if it matters. How would you know if it does? There are a number of tests to check for spatial lag, but for most part just starting to think about it helps.
For some more information of spatial lag, take a look at the sources mentioned:
-- James LeSage's Econometrics Toolbox (www.spatial-econometrics.com), which has an excellent workbook discussing spatial econometrics and examples for the MATLAB functions provided on the same site; and
-- Beck, Gleditsch and Beardley (draft of April 14, 2005) "Space is more than Geography: Using Spatial Econometrics in the Study of Political Economy" (http://www.nyu.edu/gsas/dept/politics/faculty/beck/becketal.pdf).
10 October 2005
This weekâ€™s Applied Statistics Workshop presentation will be given by Rima Izem of the Harvard Statistics Department. After receiving her Ph.D. in Statistics from the University of North Carolina at Chapel Hill, Professor Izem joined the Harvard Statistics Department in 2004. Her research interests include statistical methodology in functional data analysis, spatial statistics, and non-parametric statistics. She has presented her work at conferences across North America and Europe. Her 2004 paper, "Analyzing nonlinear variation of thermal performance curves," won the Best Student Paper award from the Graphics and Computing Section of the American Statistical Association.
Professor Izem will present a talk on "Boundary Analysis of Unemployment Rates in Germany." The presentation will be at noon on Wednesday, October 12 in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.
7 October 2005
Over twenty years ago, J. Morgan Kousser wrote an article with the provocative title, â€œAre Expert Witnesses Whores? Reflections on Objectivity in Scholarship and Expert Witnessingâ€? (6 The Public Historian 5 (1984)). In answering the rhetorical question largely in the negative, Professor Kousser recounted his own experience as an expert in litigation under the Voting Rights Act, an experience which, according to him, â€œafforded me the opportunity to tell the truth and do good at the same time.â€?
As a historian of southern politics specializing in the post-Reconstruction and Progressive eras, Professor Kousser had concluded that at-large voting systems had a racially discriminatory impact upon disfavored minority groups, and that such systems were adopted for exactly that purpose. Having written on the subject, he was â€œâ€˜discoveredâ€™â€? by a civil rights attorney, retained, and stood ready to provide â€œwindow-dressingâ€? in Section 2 cases challenging at-large systems when the Supreme Court decided Mobile v. Bolden, 446 U.S. 55 (1980). Without delving into legal technicalities, and oversimplifying somewhat, Mobile compelled Section 2 plaintiffs to produce evidence regarding the motives of those who adopted the voting schemes under challenge. In doing so, Mobile â€œmade historians . . . necessary participants in voting rights casesâ€? (at least until Congress removed the intent requirement by amending Section 2 in 1982), and so Professor Kousser ended up testifying in several pieces of litigation regarding the motives of those who adopted at-large voting systems and the effectiveness of such systems in achieving their framersâ€™ desires. After examining various meanings of bias and objectivity, and the threats to the latter in both expert witnessing and researching, Professor Kousser concludes his article with the statement, â€œTestifying and scholaring are about equally objective pursuits.â€?
As a former litigator of employment discrimination and voting rights cases, I believe that Professor Kousserâ€™s vision of an expert witness is one few lawyers would recognize. As a budding statistician interested in application of social science to the litigation setting, I assert (admittedly with slightly less certainty) that Professor Kousserâ€™s narrative would be unfamiliar to most expert witnesses as well. Few attorneys discover expert witnesses who have spent years studying a question critical in a case they are litigating, fewer still an expert who has reached the â€œrightâ€? answer. It is rare that scholars, having reached conclusions after years of study and research for academic purposes, suddenly discover that the law has evolved in a way to make those conclusions relevant to pending (and, in Professor Kousserâ€™s case, high-profile) litigation.
Iâ€™ll be using Professor Kousserâ€™s article as a springboard for a discussion on the relationship among courts, litigators, and expert witnesses in several blog posts. As is true of all members of the Content Committee of this blog, I remain eager for responses and comments.
(It should go without saying that I do not intend in any way to question Professor Kousserâ€™s honesty or integrity, either in the testimony he gave or in his 1984 article. In case it does not go without saying . . .).
6 October 2005
One of the key applications of cognitive science to the other social sciences can lie in testing some of the assumptions made about human psychology in other fields. A classic example of this is in economics: as I understand it, for a long time economists envisioned people as rational actors who act to increase their utility (usually measured by money) as much as they can. The classic results of Kahneman & Tversky, which earned the Nobel Prize, were among the first to show that, contrary to this assumption, in many spheres people act "irrationally." I am putting the word "irrational" in quotes because it's not that we act completely randomly or without motivation, simply that we do not always simply exist to maximize our utility: we use cognitive heuristics to calculate the value of things, we value money not as an absolute but with respect to many other factors (such as how much we already have, how things are phrased and sold to us, etc), and our attitudes towards money and maximizing are influenced by culture and the social situation. This means that models of human economic or group behavior are often only as good as the assumptions made about the people in them.
One researcher who studies these problems is Dan Ariely at MIT. In a recent line of research, he looks at what he calls two separate markets, the monetary and the social. The idea is that if people perceive themselves to be in a monetary market (one involving money), they are highly sensitive to the quantity of compensation, and will do less work if they receive less compensation. If, on the other hand, they perceive themselves to be in a social market (one in which no money is exchanged), they will not be concerned with the quantity of "social" compensation, such as the value of any gifts received.
I really liked this article, in part because (unusual for academic articles) it is kind of funny in places. For instance, their methodology consisted of having the participants do a really boring task and measuring how well their effort correlated to how much they were paid, in either a monetary or social market. The task is really grim: repeatedly dragging a computerized ball to a specific location on the screen. As the authors dryly state, "pretesting and post-experiment debriefing showed that our implementation continues in the grandest tradition of tasks that participants view as being utterly uninteresting and without any redeeming value." (I do not envy that debriefer!)
Funny parts aside, the point this research makes is really interesting: people approach the same task differently depending on what they think it is. When they are not compensated or compensated with a gift (a "social" exchange) they will expend a high amount of effort regardless of the value of the gift. When compensated with money or a gift whose monetary value they are told of, effort is proportional to the value of the compensation. Methodologically, this makes an important point -- if we want to model all sorts of aspects of the market or even social behavior, it's good to understand how our behavior changes as a function of how we conceptualize what is going on. From the cognitive science side, the question is why our behavior changes in this way, and in what instances this is so.
And the message for all of us? If we have a task we need help on, the authors suggest "asking friends and offering them dinner. Just do not tell them how much the dinner costs."
5 October 2005
Most disciplines define themselves through their field of inquiry; historians study events of the past and the evolving stories of those events, psychologists study the working of the mind, and political scientists study the interaction of governments and people. Economists take a different approach, though, identifying themselves not through subject matter but instead through methodology.
What are these tenets of methodology? While the precise delineation of oneâ€™s field is always a tricky matter, I believe most economists would agree on three basic principles: Preferences, Optimization, and Equilibrium. In essence, economics operates under the assumption that people know what they want and then do their best (given limited means) to get it. Given these foundations, mathematics helps to formalize our intuition, since choosing the best alternate can be rewritten as the maximization of a function, often named â€œutility.â€? In many cases, of course, people will fail miserably to achieve these goals. The problem might be a lack of information, or unforeseen costs, or any number of other obstacles; but, in economics, it cannot be that people simply do not want something that is better for them.
To many, this definition of economics will seem extraordinarily narrow, disallowing the study of a great many human phenomena. No doubt, in many cases, this observation is correct. But I believe it exactly this methodological focus that has laid the foundations for the great success of economics in the past 70 years. As a foundation, the framework is straightforward and intuitive; why would someone not want something that, by definition, they prefer? Furthermore, the mathematical expression of economics ideas â€“ a direct result of the assumption of optimization â€“ has helped to lay bare the assumptions lurking behind arguments with great speed. And while I freely admit that economics cannot capture all relevant aspects of human behavior, it would seem a foolâ€™s errand to find a research design that could.
(A brief aside: Mathematics, in economics, is no more than a language for expressing ideas. It is extremely helpful in many situations, as is much jargon, for discussions among experts within the field. But, far too often, economists allow this language to become a barrier between them and the world. I suggest you hold all economists you meet to this standard: If they cannot explain the intuition behind an economic idea, using only standard English words, in five minutes, it is their fault and not yours!)
4 October 2005
This weekâ€™s Applied Statistics Workshop presentation will be given by Andrew Thomas of the Harvard Statistics Department. Drewâ€™s talk, entitled â€œ A Comparison of Strategies in Ice Hockey,â€? considers the choices facing players and coaches during the course of a game. Should they prioritize possession of the puck, or its location on the rink? The paper presents a model that divides the play of the game into different states in order to estimate the probability of scoring or allowing goals conditional on a given starting state.
Drew is currently a second-year Ph.D. candidate in the Statistics Department, having graduated from MIT in 2004 with a degree in Physics. He has presented his work at the Joint Statistical Meetings and the New England Statistical Symposium. He was born and raised in Toronto, Ontario, which may have something to do with his interest in hockey. And, most importantly, he is a fellow blogger on the Social Science Statistics blog. The presentation will be at noon on Wednesday, October 5 (coincidentally enough, opening night for the NHL) in Room N354, CGIS North, 1737 Cambridge St. Lunch will be provided.
Stephen D. Unwin made headlines - at least, in the Odds and Ends section â€“ two years ago, with the publication of his book "The Probability of God". His idea was to determine, using some numerical method, whether conditions on earth would be enough to predict whether the Judeo-Christian construction of God does indeed exist.
Thankfully, the book is classified as humor. The actual problem being solved is somewhat irrelevant to the greater community, since matters of faith are conducted in the absence of fact. But this does represent the fringe of our discipline, and how numbers are perceived in the real world.
In this "real world," there are too many examples of numbers distorted for the sake of an agenda. For example, that 4 out 5 dentists choose a particular toothpaste to endorse tells us nothing about the sample size (or about a possible line of dentists they tossed beforehand). Sports statistics are mangled and mishandled all the time without a mention of sample size concerns or actual relevance. (The misuse of numbers in society is a favorite theme of mine; keep looking for it in my entries.)
At least Dr. Unwin has not only a clearly stated agenda behind his work, but also a clearly stated method and an acknowledgement of subjectivity. Unwinâ€™s calculation puts the probability of Godâ€™s existence at 67%; Richard Dawkins, the famed atheist, used the same method and obtained a result of 2% -- about 2% higher than Dawkins would otherwise be willing to admit.
Most of this information came from a radio interview with the good Dr. Unwin. Stay tuned for the book review and a look at his technique.
3 October 2005
Has anybody figured out how to estimate multilevel hazard models with time-varying covariates in log-time metric (i.e., an accelerated failure time model)?
Together with two colleagues from the Medical School, Iâ€™m working on the effect of contextual variables on mortality. We're using a large longitudinal dataset of around Â½ million married couples and nine years of follow up. Our key independent variable is time varying. In recent years, much work has been done on multilevel hazard models, for example, that done by Harvey Goldstein and colleages. But the standard recommendation for estimating such models in the presence of time-varying covariates is to approximate the Cox proportional hazard model using a conditional (i.e, fixed effects) logistic regression, which makes hefty demands on memory. Given the size of our data, we can implement this standard strategy only for a subset of our data.
We are hoping that the log-time metric would make better use of memory and allow us to use the entire sample. The question is: has anybody already developed software to estimate multilevel hazard models with time-varying covariates in the log-time metric? Or can't it be done in principle? Either way, I'd be grateful for pointers.