March 2007
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee


Matt Blackwell (Gov)


Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries



SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
The Education Wonks
Empirical Legal Studies
Free Exchange
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science



Powered by
Movable Type 4.24-en

« February 2007 | Main | April 2007 »

30 March 2007

"That looks cool!" versus "What does it mean?"

Every Sunday, I flip open the New York Times Magazine to the weekly social commentary, "The Way We Live Now," and I check out the accompanying data presentation graphic. First, I think, "That looks cool." Then, for the next several minutes, I wonder, "What does it mean?" I'm usually looking at an illustration like this:

I sat down to write this entry ready to argue that clarity is always more important than aesthetics when communicating with data and that the media needs to be more educated when it comes to data presentation. I still think those things. However, after a little googling, I discovered that Catalogtree (as in "Chart by Catalogtree" in the graphic above) is a Dutch design firm, not a research organization, and I started to wonder whether the Times knowingly prioritizes art over data for these graphics. Maybe communication is not the primary goal. This is, after all, a magazine, including fashion and a serial comic strip along with coverage of political and social issues.

How should a publication balance illustration and information? If I belong to a statistics department, am I allowed to say, "That looks cool!" and not point out that a chart is indecipherable? My gut reaction is that information should always win, but maybe I'm wrong - and I do like the designs. You can see some of Catalogtree's other creations for the Times here and their other work here.

Posted by Cassandra Wolos at 1:49 PM

29 March 2007

New York's anti-poverty experiment

The mayor of New York, Michael Bloomberg, announced today that the city is proceeding with its plan target poverty using cash incentives for school attendance, medical checkups and the like. The first phase of the plan is an experimental test of the efficacy of the incentives. From the NY Times:

Under the program, which is based on a similar effort in Mexico but is believed to be the first of its kind in the nation, families would receive payments every two months for meeting any of 20 or so criteria per individual. The payments would range from perhaps $25 for an elementary school student’s attendance to $300 for greatly improved performance on a standardized test, officials said.

Conceived as an experiment, the program, first announced last fall and set to begin in September, is to serve 2,500 randomly selected families whose progress will be tracked against another 2,500 randomly selected families who will not receive the assistance.

Now, I think most of us in the social science statistical community would be very much in favor of this kind of evaluation. In fact, the degree to which these kinds of designs are becoming the standard for policy evaluation is an impressive change from the way projects were evaluated even twenty years ago. Gary King and several graduate students here at IQSS have been working on the evaluation of a similar project in Mexico involving the roll-out of Seguro Popular, a health insurance scheme for low-income Mexicans.

On the other hand, the political scientist in me wonders if (when?) we are going to start to see pushback from those being experimented on (or, more likely, from the interest groups that purport to represent them). The image of 2,500 families randomly selected to not receive benefits probably doesn't do much to help the cause of people (like me) who would like to see more of this. How can we in the statistical community make these kind of randomized field experiments more palatable (beyond saying, "you need to do this if you want the right answer")?

Posted by Mike Kellermann at 3:22 PM

28 March 2007

The singular of data is anecdote

Amy Perfors

This post started off as little more than some amusing wordplay brought on by the truism that "the plural of anecdote is not data". It's a sensible admonition -- you can't just exchange anectodes and feel like that's the equivalent of actual scientific data -- but, like many truisms, it's not necessarily true. After all, the singular of data is anecdote: every individual datapoint in a scientific study constitutes an anecdote (though admittedly probably a quite boring one, depending on the nature of your study). A better truism would therefore be more like "the plural of anecdote is probably not data", which of course isn't nearly as catchy.

The post started that way, but then I got to thinking about it more and I realized that the attitude embodied by "the plural of anecdote is not data" -- while a necessary corrective in our culture, where people far more often go too far in the other direction -- isn't very useful, either.

A very important caveat first: I think it's an admirable goal -- definitely for scientists in their professional lives, but also for everyone in our personal lives -- to as far as possible try to make choices and draw conclusions informed not by personal anecdote(s) but rather by what "the data" shows. Anecdote is notoriously unreliable; it's distorted by context and memory; because it's emotionally fraught it's all too easy to weight anecdotes that resound with our experience more highly and discount those that don't; and, of course, the process of anecdote collection is hardly systematic or representative. For all of those reasons, it's my natural temptation to distrust "reasoning by anecdote", and I think that's a very good suspicion to hone.

But... but. It would be too easy to conclude that anecdotes should be discounted entirely, or that there is no difference between anecdotes of different sorts, and that's not the case. The main thing that turns an anecdote into data is the sampling process: if attention is paid to ensuring not only that the source of the data is representative, but also that the process of data collection hasn't greatly skewed the results in some way, then it is more like data than anecdote. (There are other criteria, of course, but I think that's a main one).

That means, though, that some anecdotes are better than others. One person's anecdote about an incredibly rare situation should properly be discounted more than 1000 anecdotes from people drawn from an array of backgrounds (unless, of course, one wants to learn about that very rare situation); likewise, a collection of stories taken from the comments of a highly partisan blog where disagreement is immediately deleted -- even if there are 1000 of them -- should be discounted more than, say, a focus group of 100 people carefully chosen to be representative, led by a trained moderator.

I feel like I'm sort of belaboring the obvious, but I think it's also easy for "the obvious" to be forgotten (or ignored, or discounted) if its opposite is repeated enough.

Also, I think the tension between the "focus on data only" philosophy on one hand, and "be informed by anecdote" philosophy on the other, is a deep and interesting one: in my opinion, it is one of the main meta-issues in cognitive science, and of course comes up all the time in other areas (politics and policy, personal decision-making, stereotyping, etc). The main reason it's an issue, of course, is that we don't have data about most things -- either because the question simply hasn't been studied scientifically, or because it has but in an effort to "be scientific" the sample has been restricted enough that it's to know how well one can generalize beyond it. For a long time most studies in medicine used white men only as subjects; what then should one infer regarding women, or other genders? One is caught between the Scylla of using possibly inappropriate data, and the Charybdis of not using any data at all. Of course in the long term one should go out and get more data, but life can't wait for "the long term." Furthermore, if one is going to be absolutely insistent on a rigid reliance on appropriate data, there is the reductive problem that, strictly speaking, a dataset never allows you to logically draw a conclusion about anything other than itself. Unless it is the entire population, it will always be different than the population; the real question comes in deciding whether it is too different -- and as far as I can tell, aside from a few simple metrics, that decision is at least as much art as science (and is itself made partly on the basis of anecdote).

Another example, one I'm intimately familiar with, is the constant tension in psychology between ecological and external validity on the one hand, and proper scientific methodology on the other. Too often, increasing one means sacrificing the other: if you're interested in categorization, for instance, you can try to control for every possible factor by limiting your subjects to undergrad students in the same major, testing everyone in the same blank room at the same time of day, creating stimuli consisting of geometric figures with a clear number of equally-salient features, randomizing the order of presentation, etc. You can't be completely sure you've removed all possible confounds, but you've done a pretty good job. The problem is that what you're studying is now so unlike the categorization we do every day -- which is flexible, context-sensitive, influenced by many factors of the situation and ourselves, and about things that are not anything like abstract geometric pictures (unless you work in a modern art museum, I suppose) -- that it's hard to know how it applies. Every cognitive scientist I know is aware of this tension, and in my opinion the best science occurs right on the tightrope - not at the extremes.

That's why I think it's worth pointing out why the extreme -- even the extreme I tend to err on -- is best avoided, even if it seems obvious.

Posted by Amy Perfors at 10:06 AM

27 March 2007

The answer is -3.9% (plus or minus 17.4%)

The government released its report on new home sales for the month of February; here is how the story was reported by Reuters (as seen on the New York Times website):

WASHINGTON, March 26 (Reuters) — Sales of new homes unexpectedly fell in February, hitting their lowest level in nearly seven years, according to a report released on Monday. New-home sales slid 3.9 percent, to an annual rate of 848,000 units, the lowest since June 2000, from a downwardly revised pace of 882,000 in January, the Commerce Department said. Sales for November and December were revised down as well.

And here is the Census Bureau press release:

Sales of new one-family houses in February 2007 were at a seasonally adjusted annual rate of 848,000, according to estimates released jointly today by the U.S. Census Bureau and the Department of Housing and Urban Development. This is 3.9 percent (±17.4%)* below the revised January rate of 882,000 and is 18.3 percent (±12.2%) below the February 2006 estimate of 1,038,000.

There are several amazing things about this. First, with all of the resources of the federal government, we can't get better than a 17.4% half-width for a 90% confidence interval? Second, people treat these point estimates like they mean something; the DJIA dropped by about 50 points after this "news" hit the wires. And finally, why can't I get stuff published with confidence intervals that wide?

Posted by Mike Kellermann at 5:30 PM

26 March 2007

Judicial Drift?

In light of Jim's post below, it is worth pointing out an ongoing conversation at the Northwestern Law Review on ideological change on the Supreme Court. The discussion was prompted by a forthcoming article entitled "Ideological Drift among Supreme Court Justices: Who, When, and How Important?", authored by a who's who of empirical court scholars: Lee Epstein, Andrew Martin, Jeffrey Segal, and our own Kevin Quinn. In addition to their comments on the article, there is a response by Linda Greenhouse, who covers the Supreme Court for the New York Times. (It also got a plug in the Washington Post this morning).

I'm more sympathetic to the project of modelling judicial decisions than I take Jim to be; I think that the ideal point framework gives us a useful way of thinking about the preferences of political actors, including judges. On the other hand, his points about precedent and interference across units are well-taken. Consider the following graph, which appears in the Epstein et al. paper:

It is explained as the estimated probability of a "liberal" vote by Justice O'Connor on two of the key social policy cases decided by the court in the past few years: Lawrence (which struck down Texas' anti-sodomy law) and Grutter (upholding the University of Michigan's law school admissions policy; the undergraduate policy was struck down in Gratz v. Bollinger). I assume that these probabilities were calculated using the posterior distribution of the case parameters in Lawrence and Grutter and combining them with the posterior distribution for O'Connor's ideal points in each year. Fair enough, but what does this actually mean? If Grutter had come before the court in 1985, it would not have been Grutter. I don't say this to be flippant; the University of Michigan used different admissions policies in the 1980s (in fact, when I went to Michigan as an undergrad, I was admitted under a different policy than the procedure struck down in Gratz); Adarand, Hopwood, and related cases would not have been on the books, etc. I just don't see how the implied counterfactual ("What is the probability that O'Conner would cast a liberal vote if Grutter had been decided in year X") makes any sense.

Posted by Mike Kellermann at 3:20 PM

Applied Statistics - Spring Break

As many of you know, Harvard is on spring break this week, so the Applied Statistics Workshop will not meet. Please join us next Wednesday, April 4, for a presentation by Professor Richard Berk of the University of Pennsylvania. And for those of you at Harvard, enjoy some time off (or at least some time without students!).

Posted by Mike Kellermann at 8:19 AM

21 March 2007

Efficient Vacationing, Summer 2007

With the ice melting and the birds chirping it’s the time again for planning the summer. Here a few worthwhile reasons not to be stuck behind your desk all summer. Maybe these are not the most exotic events and locations but at least they are ‘productive’ and you won’t feel guilty for being away.

The Michigan Summer Institute in Survey Research Techniques runs several sessions over a total of eight weeks from June 4 to July 27. The courses are mainly about designing, writing and testing surveys, and analyzing survey data. The level of the courses differs but they have some advanced courses on sampling and analysis. Because of a modular setup, it's possible to pick and choose broadly. I've heard good things about this institute, particularly from people who want to collect their own data.

Also in Michigan is the Summer Program in Quantitative Methods of Social Research which runs two sessions from June 25 to August 17. This program focuses on analytics and also caters for different levels of sophistication. I only know a few people who attended this program, with mixed reviews. Much seems to depend on what courses you actually take, some are great and others so-so.

The University of Chicago hosts this years’ Institute on Computational Economics from July 30 to August 9. The topics are quite advanced and focus on programming approaches to economic problems. This seems to be quite worthwhile, if it's your interest.

Further afield is the Mannheim Empirical Research Summer School from July 8 – 20. This event focuses on analysis of household data but also features sessions on experiment design and behavioral economics. I didn't hear about previous schools but would be curious to find out.

There are other summer schools that don’t have a strong methods focus. Harvard, LSE and a host of other universities offer a number of courses that might provide a quick dip into some of the substantive topics.

Posted by Sebastian Bauhoff at 6:19 PM

Applied Stats slides

For anyone who couldn't make it today, the slides from the Applied Stats talk given by Ken Kleinman are now posted at the course website.

Posted by Mike Kellermann at 4:28 PM

20 March 2007

Judicial Decisions as Data Points

Empirical, particularly quantitative empirical, scholarship is all the rage these days in law schools. (By the way, as a quantitative legal empiricist,that makes me really nervous. If there's one constant in legal academia, it's that things go in and out of style as fast in law schools as they do in Milan fashion shows.)

One thing that has been bothering me lately about this next phase, new wave, dance craze aspect of legal scholarship is the use of appellate cases as datapoints. It's tempting to think that one can code appellate decisions or judicial opinions pursuant to some neutral criteria, then look for trends, tease out inferences of causation, etc. Here's a note of caution: they're not i.i.d. They're probably not i.i.d. given X (whatever X is). Precedent matters. In our legal system, the fact that a previous appellate case (with a published opinion) was decided a certain way is a reason to decide a subsequent, facially similar appellate case the same way, even if the first decision might have been (arguably) wrong. Folks will argue over how much precedent matters; all I can tell say is that as a law clerk to an appellate judge, I participated in numerous conversations that resulted in the sentiment, "I might/would have decided the present case differently had Smith v. Jones not been on the books, but I see no grounds for departing from the reasoning of Smith v. Jones here." I.i.d. models, or analyses that assume non-interference among units, should be viewed with great caution in this setting.

Posted by James Greiner at 4:40 PM

19 March 2007

Applied Statistics - Ken Kleinman

This week, the Applied Statistics Workshop will present a talk by Ken Kleinman, associate professor in the Department of Ambulatory Care and Prevention at the Harvard Medical School. Professor Kleinman received his Sc.D. from the Harvard School of Public Health. He has published widely in journals in medicine and epidemiology. His statistical research centers mainly on methods for clustered and longitudinal repeated measures data. Much of his recent work focuses on spatial surveillance for public health, with a particular interest in applications related to problems in detecting bioterrorism.

Professor Kleinman will present a talk entitled "Statistical issues (and some
solutions) around spatial surveillance for public health." The presentation will be at noon on Wednesday, March 14 in Room N354, CGIS North, 1737 Cambridge St. As always, lunch will be provided.

Posted by Mike Kellermann at 8:12 AM

18 March 2007

Three-way ties and Jeopardy: Or, Drew questions the odds

It's been in the news that a three-way tie happened on Jeopardy on Friday night. From the AP article:

The show contacted a mathematician who calculated the odds of such a three-way tie happening — one in 25 million.

I have to believe that the mathematician contacted didn't have all the facts (and the AP rushed to meet deadline), because once you're in Final Jeopardy there's little randomness about it. It's all down to game theory.

Suppose we first estimate the odds that all three players are tied at the end of Double Jeopardy.The total dollar value shared by all three is around $30000, or about $10000 each. Since questions have dollar values which are multiples of $200, we could reasonably assume that there are 100 dollar values, between 0 and 20000, where each player can end up. So the odds of a tie at this stage should be no more than one in a million - and this is a very conservative guess, since I assume that the probabilities are all equal (whereas they would likely have a central mode around 10000.)

Breaking a three way tie with a Final Jeopardy question would then require that all three players bet the same amount, and I think the odds are considerably less than 1 in 20 that they'd all bet the farm no matter the category.

But it shouldn't even get that far. The scenario on Friday night had two players tied behind the leader who didn't have a runaway. So we have somewhere around 1 in 20,000 odds that this would happen (the factor of two because the third player could be ahead or behind the tied players.)

The runners-up would both be highly likely to bet everything in order to get past the leader. And the leader, in this case, placed a tying bet for great strategic reasons - getting one more day against known opposition rather than taking the chance of a new superstar appearing the next day - as well as a true demonstration of giving away someone else's money to appear magnanimous.

Even if the leader only had a 10% chance of making that call, and given that the other two players were pressured to bet high, that's still 1 in 200,000 - over 100 times more likely with a fairly conservative estimation process.

Posted by Andrew C. Thomas at 11:14 PM

16 March 2007

March Madness

As we often say, one of the goals of this blog is to share the conversations that take place around the halls of IQSS. Well, the conversations at the Institute (along with just about every other office in the country) have been heavily slanted toward college basketball this week. As I've posted here before, the relationship between sports and statistics has been both profitable for both sides. And so, in that spirit, here are links to some recent papers on the NCAA Men's Basketball Tournament:

Identifying and Evaluating Contrarian Strategies for NCAA Tournament Pools

These authors (biostatisticians associated with the University of Minnesota) tackle one of the most important questions surrounding March Madness: how do I maximize my chances of winning the office pool? They find that, in pools that do not reward picking upset, strategies that maximize the expected score in the pool do not necessarily maximize the chances of winning the pool, since these brackets look too much like the brackets of other players. Too late for this year, but maybe you'll get some pointers for next year. For another paper that comes to a similar conclusion, take a look at Optimal Strategies for Sports Betting Pools .

March Madness is (NP-) Hard

Since it is too late to change your picks for this year, is there a way to tell when you don't need to pay attention anymore because you have no chance of winning? A group of computer scientists from MIT consider this question, and show that the generic problem of determining whether a particular participant has been mathematically eliminated is NP-complete. "Even if a participant were omnipotent in the sense that he could controll the outcome of any remaining games, he still would not be able to efficiently determine whether doing so would allow him to win the pool." Of course, in a finite tournament with a finite number of players in the pool, it is possible to determine who could still win the pool. I haven;t been eliminated yet, but things aren't looking too good.

Posted by Mike Kellermann at 3:00 PM

14 March 2007

Who makes a good peer reviewer?

Amy Perfors

One of the interesting things about accruing more experience in a field is that as you do so, you find yourself called upon to be a peer reviewer more and more often (as I'm discovering). But because I've never been an editor, I've often wondered what this process looks like from that perspective: how do you pick reviewers? And what kind of people tend to be the best reviewers?

A recent article in the (open-access) journal PLoS Medicine speaks to these questions. Even though it's in medicine, I found the results somewhat interesting for what they might imply or predict about other fields as well.

In a nutshell, this study looked at 306 reviewers from the journal Annals of Emergency Medicine. Each of the 2,856 reviews (of 1,484 separate manuscripts) had been rated by the editors of the journal on a five-point scale (1=worst, 5=best). The study simply tried to identify what characteristics of the reviewers could be used to predict the effectiveness of the review. The basic finding?

Multivariable analysis revealed that most variables, including academic rank, formal training in critical appraisal or statistics, or status as principal investigator of a grant, failed to predict performance of higher-quality reviews. The only significant predictors of quality were working in a university-operated hospital versus other teaching environment and relative youth (under ten years of experience after finishing training). Being on an editorial board and doing formal grant (study section) review were each predictors for only one of our two comparisons. However, the predictive power of all variables was weak.

The details of the study are somewhat helpful for interpreting these results. When I first read that younger was better, I wondered to what extent this might simply be because younger people have more time. After looking at the details, I think this interpretation, while possible, is doubtful: the youngest cohort were defined as those that had less than ten years of experience after finishing training, not those who were largely still in grad school. I'd guess that most of those were on the tenure-track, or at least still in the beginnings of their career. This is when it's probably most important to do many many things and be extremely busy: so I doubt those people have more time. Arguably, they might just be more motivated to do well precisely because they are still young and trying to make a name for themselves -- though I don't know how big of a factor it would be given the anonymity of the process: the only people you're impressing with a good review are the editors of the journals.

All in all, I'm not actually that surprised that "goodness of review" isn't correlated with things such as academic rank, training in statistics, or being a good PI: not that those things don't matter, but my guess would be that nearly everyone who's a potential reviewer (for what is, I gather, a fairly prestigious journal) would have sufficient intelligence and training to be able to do a good review. If that's the case, then the best predictors of reviewing quality would come down to more ineffable traits like general conscientiousness and motivation to do a good review... This interpretation, if true, implies that a good way to generate better reviews is not to just choose big names, but rather to make sure people are motivated to put the time and effort into those reviews. Unfortunately, given that peer review is largely uncredited and gloryless, it's difficult to see how best to motivate them.

What do you all think about the idea of making these sort of rankings public? If people could put them on their CV, I bet there would suddenly be a lot more interest in writing good reviews... at least for the people for whom the CV still mattered.

Posted by Amy Perfors at 6:45 PM

13 March 2007

Which Color for your Figure?

ever wondered about what would be the best color for your graphs? While common in the sciences, it may be fair to say that the use of color in graphs is still under-appreciated in many social science fields. Colors can be a every effective tool to visualize data in many forms, because color is essentially a 3-d concept:

- hue (red, green, blue)
- value/lightness: (light vs. dark)
- saturation/chroma (dull vs. vivid)

From my limited understanding of this topic, not much scientific knowlegde exists about how color is best used. However, a few general principles have emerged from the literature. For example, sequential information (ordering) is often best indicated through distinction in lightness. The tricky part here is that indicating sequence with colors requires the viewer to remember the color ordering. A small number of colors should be used. One principle that is sometimes advocated is the use of a neutral color midpoint, that makes sense when there is a "natural" midpoint in the data. If so, you may want to distinguish above and below the midpoint, and use dark color1 -> light color1 -> white -> light color2 -> dark color2 (e.g., dark blue to dark red) . If no natural midpoint exists, one option is to use a single hue and just vary lightness (e.g., white/pink to dark red). Another idea is that categorical distinctions are best indicated through hue (e.g., red=higher than average, blue=lower than average). Read Edward Tufte and the cites therein for more ideas on the use of color. In addition, a nice online tool that helps you choose color in a principled way is ColorBrewer, a website definitely worth a visit. Many of the color schemes advocated there are also available in R in the ColorBrewer {RColorBrewer} library. Good luck!

Posted by Jens Hainmueller at 11:14 PM

12 March 2007

Applied Statistics - Christopher Zorn

This week, the Applied Statistics Workshop will present a talk by Christopher Zorn, associate professor of political science at the University of South Carolina. Professor Zorn received his Ph.D. from The Ohio State University and was on the faculty at Emory University from 1997 to 2005. He has served as program director for the NSF Program on Law and Social Science. His work has appeared in numerous journals, including the American Political Science Review and Political Analysis. While much of his work has looked at judicial politics in the United States, his interests are broad, extending from "The etiology of public support for the designated hitter rule" (joint with Jeff Gill) to “Agglomerative Clustering of Rankings Data, with an Application to Prison Rodeo Events.”

Professor Zorn will present a talk entitled "Measuring Supreme Court Ideology," which is based on joint work with Greg Caldeira. The slides are available from the workshop website. The presentation will be at noon on Wednesday, March 14 in Room N354, CGIS North, 1737 Cambridge St. As always, lunch will be provided.

Posted by Mike Kellermann at 6:58 PM

9 March 2007

Replication is hard...

...particularly when the data keeps changing. The ability to replicate results is essential to the scientific enterprise. One of the great benefits of experimental research is that, in principle, we can repeat the experiment and generate a fresh set of data. While this is impossible for many questions in social science, at a minimum one would hope that we could replicate our original results using the same dataset. As many students in Gov 2001 can tell you, however, social science often fails to clear even that low bar.

Of course, even this type of replication is impossible if someone else has changed the dataset since the original analysis was conducted. But that would never happen, right? Maybe not. In an interesting paper, Alexander Ljungqvist, Christopher Malloy, and Felicia Marston take a look at the I/B/E/S dataset of analyst stock recommendations "made" during the period from 1993 to 2000. Here is what they found:

Comparing two snapshots of the entire historical I/B/E/S database of research analyst stock recommendations, taken in 2002 and 2004 but each covering the same time period 1993-2002, we identify tens of thousands of changes which collectively call into question the principle of replicability of empirical research. The changes are of four types: 1) The non-random removal of 19,904 analyst names from historic recommendations (“anonymizations”); 2) the addition of 19,204 new records that were not previously part of the database; 3) the removal of 4,923 records that had been in the data; and 4) alterations to 10,698 historical recommendation levels. In total, we document 54,729 ex post changes to a database originally containing 280,463 observations.

Our main contribution is to document the characteristics and effects of these pervasive changes. The academic literature on analyst stock recommendations, using I/B/E/S data, is truly vast: As of December 12, 2006, Google Scholar identifies 565 articles and working papers using the keywords “I/B/E/S”, “analysts”, and “recommendations”. Given this keen academic interest, as well as the intense scrutiny that research analysts face in the marketplace and the growing popularity of trading strategies based on analyst output, changes to the historical I/B/E/S database are of obvious interest to academics and practitioners alike. We demonstrate that the changes have a significant effect on the distribution of recommendations, both overall and for individual stocks and individual brokerage firms. Equally important, they affect trading signal classifications, back-testing inferences, the track records of individual analysts, and models of analysts’ career outcomes in the years since the changes occurred. Regrettably, none of the changes can easily be “undone” by researchers, which makes replicating extant studies difficult. Our findings thus have potentially important ramifications for existing and future empirical studies of equity analysts.

Not surprisingly, they find that these changes typically make it appear as if analysts were (a) more cautious and (b) more accurate in their predictions. The clear implication from the paper is that analysts and their employers had a vested interest in selectively editing this particular dataset; while I doubt that anyone cares enough about most questions in political science to do something similar, it is an important cautionary tale. The rest of their paper, "Rewriting History," is available from SSRN. (Hat tip: Big Picture)

Posted by Mike Kellermann at 4:01 PM

8 March 2007

Is there a "best diet"?

Janet Rosenbaum (guest blogger)

The public often reports being confused by contradictory diet studies, and there is some effort to find the "best diet", but is that the right question to be asking? A study released today in JAMA compared four common diets in 311 overweight/obese women over a 12 month period.

Most weight loss occurred within the first two months, with no visible change for three of the four diets between months 2 and 6. Given the amount of weight that these women could lose, some have commented that the effect sizes seem to be
fairly small.

While I'm of course a fan of randomized controlled trials, I'm not sure that an RCT is answering the most salient question. An RCT is answering the question of how much weight will people lose on average on each diet. While understanding average behavior may have implications for our understanding of human biology, in practice the most important question for an overweight person and their health care provider is which diet will be best for them, given their assessment for why they are overweight, which diets have worked for them in the past, and their personal tastes.

People may differ substantially across these factors. Someone who eats 100 calories too much at every meal may need to employ different strategies than someone who eats a 500 calorie snack every other day, even though they have the same calorie surplus. Likewise, someone with a tendency to eat too much of a given food category needs to know whether moderation or total abstinence is the best long term strategy. My sense of the research is that there is quite a lot of psychological research on strategies for good short term outcomes, but no RCTs focus on the medical questions of long term outcomes.

Weight loss plans employ different strategies --- for instance, Weight Watchers tries for moderation, while Atkins advocates abstinence --- but studying the individual plans confounds the question of which strategies are best with other characteristics across which the plans differ, and it averages effects over groups of individuals with heterogeneous reasons for their overweight.

It seems to me that weight loss research needs to determine if there are in fact distinct groups of overweight, and focus studies more narrowly on these groups.

Studying more homogeneous groups on a more limited set of questions would answer the questions that are most relevant for clinicians and individuals, although it would be more expensive.

Posted by Mike Kellermann at 11:53 AM

7 March 2007

More on Cheating

In my last post, I solicited comments on ways to cheat when using a design-before-analysis framework for analyzing observational studies. My claim was that if one does the hard work of distinguishing intermediate outcomes from covariates (followed usually by discarding the former) and of balancing the covariates (often done by discarding non-comparable observations) without access to the outcome variable, it should be hard(er) to cheat. Felix suggested one way that should work but that should also be fairly easy to spot: temporarily substitute in a "good" (meaning highly predictive of the outcome variable) covariate as the outcome and find a design that achieves the desired result, then use this design with the "real" outcome. In a comment, Mike suggested another way: do honest observational studies, but don't tell anyone about those that don't come to desired results.

Here's my thought: in many observational settings, we have a strong prior that there is either an effect in a particular direction or no effect at all. In an anti-discrimination lawsuit, for example, the issue is whether the plaintiff class is suffering from discrimination. There is usually little chance (or worry) that the plaintiff class is in fact benefiting from discrimination. Thus, the key issue is whether the estimated causal effect is statistically (and practically/legally) significant. With that in mind, it seems like a researcher might be able to manipulate the distance metric essential to any balancing process. When balancing, we have to define (a) a usually one-dimensional distance metric to decide how close observations are to one another, and (b) a cutoff point beyond which we say observations are too far from one another to risk inference, in which case we discard the offending observations. If one side of a debate (e.g., the defendant) has an interest in results that are not statistically significant, that side can insist on distance metrics and cutofff points that result in discarding (as too far away from their peers) a great many observations. A smaller number of observations generally means less precision and a lower likelihood of a significant result. The other side can, of course, do the opposite.

I still think we're way better off in this world than in the model-snooping of regression. What do people think?

Posted by James Greiner at 4:53 PM

6 March 2007

More Tools for Research

It’s been a while since Jens and I summarized some useful tools for research. Since then more productivity tools have appeared that make life easy for researchers. Some of the following might only work for Harvard affiliates but maybe your outfit offers something similar.

First, Harvard offers a table of contents service. After signing up you can request to receive the table of contents of most journals that Harvard Libraries carries. The handy part is a “Find it @ Harvard” button next to each article; clicking it takes you to the article through the library's account so that you have full access. This service also allows you to manage all journal subscriptions through only one account. (Best make the service email you the TOC as attachment, as in-text tables occasionally get cut off. Also, your spam filter might intercept those emails so check there if you don’t receive anything.)

Second, Harvard provides a new toolbar for the Firefox browser called LibX (see here). This provides quick links to Harvard’s e-tools (citation index, e-resources etc), lets you search in the Hollis catalog and provides a drag&drop field for Google Scholar. If you’re on a journal website without having gone through Harvard libraries, LibX allows you to reload the restricted webpage via Harvard to access to the full-text sources. Another nice feature is that LibX embeds cues in webpages. For example if you have installed the tool and are looking at a book on Amazon, you will notice a little Harvard shield on the page. Clicking it takes you straight to the book’s entry in Hollis. LibX also provides automatic links to print and e-resources for ISBN, DOI’s and other identifiers.

There are other useful tools for Firefox. I recently discovered the ScrapBook add-on which essentially works like bookmarks, but allows you to store only the part of a web page you’re interested in. Simply select the part and store it in your scrapbook. You can then access it offline and also comment or highlight. You can sort and import/export items too. A further useful built-in function uses search keywords in Firefox. This allows you to access a search box on any website through a user-defined keyword. For example you can define ``gs'' as keyword for the search box on the Google Scholar website. Then entering ``gs'' and a search term in the location bar in Firefox takes you straight to the search results for that term. If you use Google Scholar through your library you'll even get full access to the articles straight away.

Posted by Sebastian Bauhoff at 7:07 PM

5 March 2007

Applied Statistics - Anna Mikusheva

This week, the Applied Statistics Workshop will present a talk by Anna Mikusheva, a Ph.D. candidate in the Economics Department at Harvard. Before joining the graduate program at Harvard, she received a Ph.D. in mathematics from Moscow State University. She will present a talk entitled "Uniform inferences in autoregressive processes." The paper is available from the workshop website. The presentation will be at noon on Wednesday, March 7 in Room N354, CGIS North, 1737 Cambridge St. As always, lunch will be provided. An abstract of the paper follows on the jump:

Anna Mikusheva


The purpose of this paper is to provide theoretical justification for some existing methods
of constructing confidence intervals for the sum of coefficients in autoregressive models.
We show that the methods of Stock (1991), Andrews (1993), and Hansen (1999) provide
asymptotically valid confidence intervals, whereas the subsampling method of Romano and
Wolf (2001) does not. In addition, we generalize the three valid methods to a larger class
of statistics. We also clarify the difference between uniform and point-wise asymptotic
approximations, and show that a point-wise convergence of coverage probabilities for all
values of the parameter does not guarantee the validity of the confidence set.

Posted by Mike Kellermann at 4:17 PM