29 March 2011
I just happened across this paper on Steven Levitt's website entitled "What Does Performance in Graduate School Predict? Graduate Economics Education and Student Outcomes." In addition to learning all kinds of fascinating things about the economics profession and professionalization process, I was struck by the non-causal use of the word "effect" when discussing the results of their statistical (er, econometric) models.
I counted 8 uses in all, none of which were actually a believable effect of any kind. To wit:
When admissions rank is excluded from the model, the math GRE has a statistically significant effect on micro, macro, and metrics grades, and the verbal GRE has a statistically significant effect on macro and metrics grades. (page 514, second column).
It's pretty hard for me to believe that GRE scores actually affect grades apart from their effect on grad school admissions (which affects a student's ability to get grades at all). Clearly, they don't mean it causally, especially since they are dropping a post-treatment variable in and out of the model as they talk about it.
Ok, I get it that not everyone is as
apoplectic concerned as I am about using the word "effect" to denote non-causal relationships. I realize that the casual (non-causal) lingo of "effects" can just be short-hand among people who know better. And sure, it's just kind of a fun paper. But I would have expected this particular group of economists to have the catechisms of causal inference so well memorized that writing this kind of sentence would give them hives.
28 March 2011
We hope that you can join us for the Applied Statistics Workshop this Wednesday, March 30th, 2011 when we will be happy to have Alisdair McKay from the Department of Economics at Boston University. You will find an abstract for the paper. As always, we will serve a light lunch and the talk will begin around 12:15p.
"Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model"
Department of Economics, Boston University
CGIS K354 (1737 Cambridge St.)
Wednesday, March 30th, 2011, 12 noon
Often, individuals must choose among discrete alternatives with imperfect information about their values, such as selecting a job candidate, a vehicle or a university. Before choosing, they may have an opportunity to study the options, but doing so is costly. This costly information acquisition creates new choices such as the number of and types of questions to ask the job candidates. We model these situations using the tools of the rational inattention approach to information frictions (Sims, 2003). We nd that the decision maker's optimal strategy results in choosing probabilistically exactly in line with the multinomial logit model. This provides a new interpretation for a workhorse model of discrete choice theory. We also study cases for which the multinomial logit is not applicable, in particular when two options are duplicates. In such cases, our model generates a generalization of the logit formula, which is free of the limitations of the standard logit.
21 March 2011
We hope you can join us at the Applied Statistics Workshop this Wednesday, March 23rd, when we are excited to have Eric Chaney from the Department of Economics here at Harvard. Eric will be presenting his paper entitled “Revolt on the Nile: Economic Shocks, Religion and Political Influence.” You’ll find an abstract below. As usual, we will begin at 12 noon with a light lunch and wrap up by 1:30pm.
“Revolt on the Nile: Economic Shocks, Religion and Political Influence”
Department of Economics, Harvard University
Wednesday, March 23rd, 12 noon
CGIS Knafel 354 (1737 Cambridge St)
Can religious leaders use their popular influence to political ends? This paper explores this question using over 700 years of Nile flood data. Results show that deviant Nile floods were related to significant decreases in the probability of change of the highest-ranking religious authority. Qualitative evidence suggests this decrease reflects an increase in political power stemming from famine-induced surges in the religious authority’s control over popular support. Additional empirical results support this interpretation by linking the observed probability decrease to the number of individuals a religious authority could influence. The paper concludes that the results provide empirical support for theories suggesting religion as a determinant of institutional outcomes.
7 March 2011
A well-known social scientist once confessed to me that, after decades of doing social research, he still couldn't remember the difference between Type I and Type II errors. Since I suspect that many others also share this problem, I thought I would share a mnemonic I learned from a statistics professor. Recall that a Type I error occurs when the null hypothesis is rejected when it is in fact true, while a Type II error occurs when a null hypothesis is not rejected when it is actually false. This distinction, of course, many people find difficult to remember.
So here's the mnemonic: first, a Type I error can be viewed as a "false alarm" while a Type II error as a "missed detection"; second, note that the phrase "false alarm" has fewer letters than "missed detection," and analogously the numeral 1 (for Type I error) is smaller than 2 (for Type I error). Since learning this mnemonic, I have not forgotten the difference between Type I and Type II errors!
We hope you can join at the Applied Statistics Workshop us this Wednesday, March 9th, when we are excited to have Don Rubin, the John L. Loeb Professor of Statistics here at Harvard University, who will be presenting recent work on job-training programs. You will find an abstract below. As usual, we will begin with a light lunch at 12 noon, with the presentation starting at 12:15p and wrapping up by 1:30p.
“Are Job-Training Programs Effective?”
John L. Loeb Professor of Statistics, Harvard University
Wednesday, March 9th 12:00pm - 1:30pm
CGIS Knafel K354 (1737 Cambridge St)
In recent years, job-training programs have become more important in many developed countries with rising unemployment. It is widely accepted that the best way to evaluate such programs is to conduct randomized experiments. With these, among a group of people who indicate that they want job-training, some are randomly assigned to be offered the training and the others are denied such offers, at least initially. Then, according to a well-defined protocol, outcomes, such as employment statuses or wages for those who are employed, are measured for those who were offered the training and compared to the same outcomes for those who were not offered the training. Despite the high cost of these experiments, their results can be difficult to interpret because of inevitable complications when doing experiments with humans. In particular, some people do not comply with their assigned treatment, others drop out of the experiment before outcomes can be measured, and others who stay in the experiment are not employed, and thus their wages are not cleanly defined. Statistical analyses of such data can lead to important policy decisions, and yet the analyses typically deal with only one or two of these complications, which may obfuscate subtle effects. An analysis that simultaneously deals with all three complications generally provides more accurate conclusions, which may affect policy decisions. A specific example will be used to illustrate essential ideas that need to be considered when examining such data. Mathematical details will not be pursued.
The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms.
These include classification algorithms such as decision trees, neural nets, Bayesian classifiers, Support Vector Machines and cased-based (aka non-parametric) learning. They include regression algorithms such as multivariate polynomial regression, MARS, Locally Weighted Regression, GMDH and neural nets. And they include other data mining operations such as clustering (mixture models, k-means and hierarchical), Bayesian networks and Reinforcement Learning.
There is a little modesty in the description here. The slides that I have looked at do a great job motivating the methods using intuition, which is often hugely lacking.
2 March 2011
I've been waiting for this kind of R book for awhile. Packt Publishing, which releases technical and information technology books, has just published The R Graph Cookbook. The premise is simple: there is a need for a book that clearly presents "recipes" of R graphs in one comprehensive volume. Indeed, many researchers switch to R (from Stata or SAS) in part because of the enormous flexibility and power of R in creating graphs.
This book is perhaps most useful for beginners, but even experienced R users should find the clarity of the presentation and discussion of advanced graphics informative. In particular, I found the presentation of how to create heatmaps and geographic maps useful. I'll certainly use these examples when teaching data visualization. Another enormous benefit of the book is that the author has released all the R code used to create the graphs. You can download the R code here.
I have two quibbles, however. First, while the use of color in the graphs is pretty, I would've liked more examples with black-and-white templates. Although many decades from now (when most research might conceivably be published exclusively online), color graphs will be the norm, currently most research is published in journals where colors are not used. Second, like nearly all books I've seen on graphics using statistical packages, the author doesn't present graphics for regression coefficients and cross-tabs. (For information on graphing these, I recommend the excellent article on using graphs instead of tables, published in Perspectives in Politics.) Nonetheless, these are minor issues, and most R users, regardless of skill level, should find this book very useful for teaching and reference.
1 March 2011
We hope that you can join us for the Applied Statistics Workshop tomorrow, March 2nd when we will be happy to have Jean-Baptiste Michel (Postdoctoral Fellow, Department of Psychology) and Erez Lieberman Aiden (Harvard Society of Fellows). You will find an abstract below. As always, we will serve a light lunch and the talk will begin around 12:15p.
“Quantitative Analysis of Culture Using Millions of Digitized Books”
Jean-Baptiste Michel and Erez Lieberman Aiden
CGIS K354 (1737 Cambridge St.)
Wednesday, March 2nd 12 noon
We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.
I’ve been a fairly long-time emacs/ESS user, but there’s a new IDE for R called Rstudio that has a lot of potential. At the very least, it is a huge improvement over the standard R GUI (both on the Mac and Windows). Strangely, though, there are some emacs-like commands (cut everything from the cursor to the end of the line) that are available in the Console, but not the source editor. Organizing the figures, help, workspace, and history is just great, though.