November 15, 2007
From Andrew Gelman, I saw a link to an interesting "art exhibit" that's actually all about statistics and language. In some ways it reminded me of this other art exhibit that's actually all about statistics -- in this case, the meaning of some of the very large numbers we read about all the time, but find difficult to grasp on an intuitive level.
Both are worth checking out online. And if you live somewhere that you can visit either, lucky you!
October 31, 2007
There's an interesting article at Salon today about racial perception. As is normally the case for scientific articles reported in the mainstream media, I have mixed feelings about it.
1) First, a pet peeve: just because something is can be localized in the brain using fMRI or similar techniques, does not mean it's innate. This drives me craaazy. Everything that we conceptualize or do is represented in the brain somehow (unless you're a dualist, and that has its own major logical flaws). For instance, trained musicians devote more of their auditory processing regions to listening to piano music, and have a larger auditory cortex and larger areas devoted toward motor control of the fingers used to play their instrument. [cite]. This is (naturally, reasonably) not interpreted as meaning that playing the violin is innate, but that the brain can "tune itself" as it learns. [These differences are linked to amount of musical training, and are larger the younger the training began, which all supports such an interpretation]. The point is, localization in the brain != innateness. Aarrgh.
2) The article talks about what agent-based modeling has shown us, which is interesting:
Using this technique, University of Michigan political scientist Robert Axelrod and his colleague Ross Hammond of the Brookings Institution in Washington, D.C., have studied how ethnocentric behavior may have evolved even in the absence of any initial bias or prejudice. To make the model as simple as possible, they made each agent one of four possible colors. None of the colors was given any positive or negative ranking with respect to the other colors; in the beginning, all colors were created equal. The agents were then provided with instructions (simple algorithms) as to possible ways to respond when encountering another agent. One algorithm specified whether or not the agent cooperated when meeting someone of its own color. The other algorithm specified whether or not the agent cooperated with agents of a different color.
The scientists defined an ethnocentric strategy as one in which an agent cooperated only with other agents of its own color, and not with agents of other colors. The other strategies were to cooperate with everyone, cooperate with no one and cooperate only with agents of a different color. Since only one of the four possible strategies is ethnocentric and all were equally likely, random interactions would result in a 25 percent rate of ethnocentric behavior. Yet their studies consistently demonstrated that greater than three-fourths of the agents eventually adopted an ethnocentric strategy. In short, although the agents weren't programmed to have any initial bias for or against any color, they gradually evolved an ethnocentric preference for one's own color at the expense of those of another color.
Axelrod and Hammond don't claim that their studies duplicate the real-world complexities of prejudice and discrimination. But it is hard to ignore that an initially meaningless trait morphed into a trigger for group bias. Contrary to how most of us see bigotry and prejudice as arising out of faulty education and early-childhood indoctrination, Axelrod's model doesn't begin with preconceived notions about the relative values of different colors, nor is it associated with any underlying negative emotional state such as envy, frustration or animosity. Detection of a difference, no matter how innocent, is enough to result in ethnocentric strategies.
As I understand it, the general reason these experiments work the way they do is that the other strategies do worse given the dynamics of the game (single-interaction Prisoner's Dilemma): (a) cooperating with everyone leaves one open to being "suckered" by more people; (b) cooperating with nobody leaves one open to being hurt disproportionately by never getting the benefits of cooperation; and (c) cooperating with different colors is less likely to lead to a stable state.
Why is this last observation -- the critical one -- true? Let's say we have a red, orange, and yellow agent sitting next to each other, and all of them decide to cooperate with a different color. This is good, and leads to an increased probability of all of them being able to reproduce, and the next generation has two red, two yellow, and two orange agents. Now the problem is apparent: each of the agents is now next to an agent (i.e., the other one of its own color) that it is not going to cooperate with, which will hurt its chances of being able to survive and reproduce. By contrast, subsequent generations of agents that favor their own color won't have this problem. And in fact, if you remove "local reproduction" -- if an agent's children aren't likely to end up next to it -- then you don't get the rise of ethnocentrism... but you don't get much cooperation, either. (Again, this is sensible: the key is for agents to be able to essentially adapt to local conditions in such a way that they can rely on the other agents close to them, and they can't do that if reproduction isn't local). I would imagine that if one's cooperation strategy didn't tend to resemble the cooperation strategy of one's parents, you wouldn't see either ethnocentrism (or much cooperation) either.
3) One thing the article didn't talk about, but I think is very important, is how much racial perception may have to do with our strategies of categorization in general. There's a rich literature studying categorization, and one of the basic findings is of boundary sharpening and within-category blurring. (Rob Goldstone has been doing lots of interesting work in this area, for instance). Boundary sharpening refers to the tendency, once you've categorized X and Y as different things, to exaggerate their differences: if the categories containing X and Y are defined by differences in size, you would perceive the size difference between X and Y to be greater than it actually is. Within-category blurring refers to the opposite effect: the tendency to minimize the differences of objects within the same category -- so you might see two X's as being closer in size than they really are. This is a sensible strategy, since the more you do so it, the better you'll be able to correctly categorize the boundary cases. However, it results in something that looks very much like stereotyping.
Research along these lines is just beginning, and it's too early to go from this observation to conclude that part of the reason for stereotyping is that it emerges from the way we categorize things, but I think it's a possibility. (There also might be an interaction with the cognitive capacity of the learning agent, or its preference for a "simpler" explanation -- the more the agent can't remember subtle distinctions, and the more the agent favors an underlying categorization with few groups or few subtleties between or within groups, the more these effects occur).
All of which doesn't mean, of course, that stereotyping or different in-group/out-group responses are justified or rational in today's situations and contexts. But figuring out why we think this way is a good way to start to understand how not to when we need to.
[*] Axelrod and Hammond's paper can be found here.
October 4, 2007
On Tuesday I went to a talk by Terrence Fine from Cornell University. It was one of those talks that's worth going to, if nothing else because it makes you re-visit and re-question the sort of basic assumptions that are so easy to not even notice that you're making. In this case, that basic assumption was that the mathematics of probability theory, which views probability as a real number between 0 and 1, is equally applicable to any domain where we want to reason about statistics.
Is this a sensible assumption?
As I understand it, Fine made the point that in many applied fields, what you do is start from the phenomenon to be modeled and then use the mathematical/modeling framework that is appropriate to it. In other words, you go from the applied "meaning" to the framework: e.g., if you're modeling dynamical systems, then you decide to use differential equations. What's odd in applications of probability theory, he said, is that you basically go from the mathematical theory to the meaning: we interpret the same underlying math as having different potential meanings, depending on the application and the domain.
He discussed four different applications, which are typically interpreted in different ways: physically-determined probability (e.g., statistical mechanics or quantum mechanics); frequentist probability (i.e., more data driven); subjective probability (in which probability is interpreted as degree of belief); and epistemic/logical (in which probability is used to characterize inductive reasoning in a formal language). Though I broadly agree with these distinctions, I confess I'm not getting the exact subtleties he must be making: for instance, it seems to me the interpretation of probability in statistical mechanics is arguably very different from in quantum mechanics and they should therefore not be lumped together: in statistical mechanics, the statistics of flow arise some underlying variables (i.e., the movements of individual particles), and in quantum mechanics, as I understand it, there aren't any "hidden variables" determining the probabilities as all.
But that technicality aside, the main point he made is that depending on the interpretation of probability and the application we are using it for, our standard mathematical framework -- in which we reason about probabilities using real numbers between 0 and 1 -- may be inappropriate because it is either more or less expressive than necessary. For instance, in the domain of (say) IQ, numerical probability is probably too expressive -- it is not sensible or meaningful to divide IQs by each other; all we really want is an ordering (and maybe even a partial ordering, if, as seems likely, the precision of an IQ test is low enough that small distinctions aren't meaningful). So a mathematics of probability which views it in that way, Fine argues, would be more appropriate than the standard "numerical" view.
Another example would be in quantum mechanics, where we actually observe a violation of some axioms of probability. For instance, the distributivity of union and intersection fails: P(A or B) != P(A)+P(B)-P(A and B). This is an obvious place where one would want to use a different mathematical framework, but since (as far as I know) people in quantum mechanics actually do use such a framework, I'm not sure what his point was. Other than it's a good example of the overall moral, I guess?
Anyway, the talk was interesting and thought-provoking, and I think it's a good idea to keep this point in the back of one's mind. That said, although I can see why he's arguing that different underlying mathematics might be more appropriate in some cases, I'm not convinced yet that we can conclude that using a different underlying mathematics (in the case of IQ, say) would therefore lead to new insight or help us avoid misconceptions. One of the reasons numerical probability is used so widely -- in addition to whatever historical entrenchment there is -- is that it is an indispensible tool for doing inference, reasoning about distributions, etc. It seems like replacing it with a different sort of underlying math might result in losing some of these tools (or, at the very least, require us to spend decades re-inventing new ones).
Of course, other mathematical approaches might be worth it, but at this point I don't know how well-worked out they are, and -- speaking as someone interested in the applications -- I don't know if they'd be worth the work in order to see. (They might be; I just don't know... and, of course, a pure mathematician wouldn't care about this concern, which is all to the good). Fine gave a quick sketch of some of these alternative approaches, and I got the sense that he was working on developing them but they weren't that well developed yet -- but I could be totally wrong. If anyone knows any better, or knows of good references on this sort of thing, please let us know in comments. I couldn't find anything obvious on his web page.
 I really really do not want to get into a debate about whether and to what extent IQ in general is meaningful - that question is really tangential to the point of this post, and I use IQ as illustration only. (I use it rather than something perhaps less inflammatory because it's the example Fine used).
May 9, 2007
This may not be new to anybody but me, but recent news at UNC brought the so-called "Achievement Index" to my attention. The Achievement Index is a way of calculating GPA that takes into account not only how well one performs in a class, but also how hard the class is relative to others in the institution. It was first suggested by Valen Johnson, a professor of statistics at Duke University, in a paper in Statistical Science titled "An Alternative to Traditional GPA for Evaluating Student Performance." (The paper is available on his website; you can also find a more accessible pdf description here).
This seems like a great idea to me. The model, which is Bayesian, calculates "achievement index" scores for each student as latent variables that best explain the grade cutoffs for each class in the university. As a result, it captures several phenomena: (a) if a class is hard and full of very good students, then a high grade is more indicative of ability (and a low grade less indicative of lack of ability); (b) if a class is easy and full of poor students, then a high grade doesn't mean much; (c) if a certain instructor always gives As then the grade isn't that meaningful -- though it's more meaningful if the only people who take the class in the first place are the extremely bright, hard-working students. Your "achievement index" score thus reflects your actual grades as well as the difficulty level of the classes you have chosen.
Why isn't this a standard measure of student performance? 10 years ago it was proposed at Duke but failed to pass, and at UNC they are currently debating it -- but what about other universities? The Achievement Index addresses multiple problems. There would be less pressure toward grade inflation, for one thing. For another, it would address the unfortunate tendency of students to avoid "hard" classes for fear of hurting their GPA. Students in hard majors or taking hard classes also wouldn't be penalized in university-wide, GPA-based awards.
One might argue that students shouldn't avoid hard classes simply because of their potential grade, and I tend to agree that they shouldn't -- it was a glorious moment in my own college career when I finally decided "to heck with it" and decided to take the classes that interested me, even if they seemed really hard. But it's not necessarily irrational for a student to care about GPA, especially if important things -- many of which I didn't have to worry about -- hinge on it: things like scholarships or admission to medical school. Similarly, instructors shouldn't inflate grades and create easy classes, but it is often strictly "rational" to do so: giving higher grades can often mean better evaluations and less stress due to students whinging for a higher grade, and easier classes are also easier to teach. Why not try to create a system where the rational thing to do within that system is also the one that's beneficial for the university and the student in the long run? It seems like the only ones who benefit from the current system are the teachers who inflate their grades and teach "gimme" courses and the students who take those easy courses. The ones who pay are the teachers who really seek to challenge and teach their students, and the students who want to learn, who are intellectually curious and daring enough to take courses that challenge them. Shouldn't the incentive structure be the opposite?
I found a petition against the Achievement Index online, and I'm not very persuaded by their arguments. One problem they have is that it's not transparent how it works, which I could possibly see being a concern... but there are two kinds of transparency, and I think only one really matters. If it's not transparent because it's biased or subjective, then that's bad; but if it's not transparent simply because it's complicated (as this is), but is in fact totally objective and is published how it works - then, well, it's much less problematic. Sometimes complicated is better: and other things that matter a great deal for our academic success -- such as SATs and GREs -- aren't all that transparent either, and they are still very valuable. The petition also argues that using the AI system will make students more competitive with each other, but I confess I don't understand this argument at all: how will it increase competition above and beyond the standard GPA?
Anyway, it might seem like I'm being fairly dogmatic about the greatness of the Achievement Index, but I don't intend to be. I have no particular bone to pick, and I got interested in this issue originally mainly just because I wanted to understand the model. It's simply that I don't really see any true disadvantages and I wonder what I'm missing. Why don't more universities try to implement it? Can anyone enlighten me?
April 11, 2007
I've posted before about the various ways that the mass media of today interacts badly with cognitive heuristics people use, in such a way as to create apparently irrational behavior. Spending a fair amount of time recently standing in long security lines at airports crystallized another one to me.
The availability heuristic describes people's tendency to judge that events that are really emotionally salient or memorable are more probable than events that aren't, even if the ones that aren't are actually statistically more likely. One classic place you see this is in estimates of risk of dying in a terrorist attack: even though the odds are exceedingly low of dying this way (if you live in most countries, at least), we tend to spend far more resources, proportionally, fighting terror than in dealing with more prosaic dangers like automobile accidents or poverty. There might be other valid reasons to spend disproportionately -- e.g., terrorism is part of a web of other foreign-policy issues that we need to focus on for more long-term benefits; or people don't want to sacrifice the freedoms that would be necessary (like more restrictive speed limits) to make cars safer; or it's not very clear how to solve some problems (like poverty) -- and I really don't want to get into those debates -- the point is just that I think most everyone would agree that in all of those cases, at least part of the reason for the disproportionate attention is because dying in a terrorist attack is much more vivid and sensational than dying an early death because of the accumulated woes of living in poverty. And there's plenty of actual research showing that the availability heuristic plays a role in many aspects of prediction.
There's been a lot of debate about whether this heuristic is necessarily irrational. Evolutionarily speaking, it might make a lot of sense to pay more attention to the more salient information. To steal an example from Gerd Gigerenzer, if you live on the banks of a river and for 1000 days there have been no crocodile sightings there, but yesterday there was, you'd be well-advised to disregard the "overall statistics" and keep your kids from playing near the river today. It's a bit of a just-so story, but a sensible one, from which we might infer two possible morals: (a) as Steven Pinker pointed out, since events have causal structure, it might make sense to pay more attention to more recent ones (which tend to be more salient); and (b) it also might make sense to pay more attention to emotionally vivid ones, which give a good indication of the "costs" of being wrong.
However, I think the problem is that when we're talking about information that comes from mass media, both of these reasons don't apply as well. Why? Well, if your information doesn't come from mass media, to a good approximation you can assume that the events are statistically representative of the events that you might be likely to encounter. If you get your information from mass media, you cannot assume this. Mass media reports on events from all over the world in such a way that they can have the same vividness and impact as if they were in the next town over. And while it might be rational to worry a lot about crime if you consistently have shootings your neighborhood, it doesn't make as much sense to worry about it if there are multiple shootings in cities hundreds of miles away. Similarly, because mass media reports on news - i.e., statistically rare occurrences - it is easy to get the dual impression that (a) rare events are less rare than they actually are; and (b) that there is a "recent trend" that needs to be paid attention to.
In other words, while it might be rational to keep your kids in if there were crocodile attacks at the nearby river yesterday, it's pretty irrational to keep them in if there were attacks at the river a hundred miles away. Our "thinking" brains know this, but if we see those attacks as rapidly and as vividly as if they were right here -- i.e., if we watch them on the nightly news -- then it's very hard to listen to the thinking brain... even if you know about the dangers. And cable TV news, with its constant repetition, makes this even harder.
The source of the problem is due to the sampling structure of mass media, but it's of course far worse if the medium makes the message more emotional and vivid. So there's probably much less of a problem if you get most of your news from written sources -- especially multiple different ones -- than TV news. That's what I would guess, at least, though I don't know if anyone has actually done the research.
March 28, 2007
This post started off as little more than some amusing wordplay brought on by the truism that "the plural of anecdote is not data". It's a sensible admonition -- you can't just exchange anectodes and feel like that's the equivalent of actual scientific data -- but, like many truisms, it's not necessarily true. After all, the singular of data is anecdote: every individual datapoint in a scientific study constitutes an anecdote (though admittedly probably a quite boring one, depending on the nature of your study). A better truism would therefore be more like "the plural of anecdote is probably not data", which of course isn't nearly as catchy.
The post started that way, but then I got to thinking about it more and I realized that the attitude embodied by "the plural of anecdote is not data" -- while a necessary corrective in our culture, where people far more often go too far in the other direction -- isn't very useful, either.
A very important caveat first: I think it's an admirable goal -- definitely for scientists in their professional lives, but also for everyone in our personal lives -- to as far as possible try to make choices and draw conclusions informed not by personal anecdote(s) but rather by what "the data" shows. Anecdote is notoriously unreliable; it's distorted by context and memory; because it's emotionally fraught it's all too easy to weight anecdotes that resound with our experience more highly and discount those that don't; and, of course, the process of anecdote collection is hardly systematic or representative. For all of those reasons, it's my natural temptation to distrust "reasoning by anecdote", and I think that's a very good suspicion to hone.
But... but. It would be too easy to conclude that anecdotes should be discounted entirely, or that there is no difference between anecdotes of different sorts, and that's not the case. The main thing that turns an anecdote into data is the sampling process: if attention is paid to ensuring not only that the source of the data is representative, but also that the process of data collection hasn't greatly skewed the results in some way, then it is more like data than anecdote. (There are other criteria, of course, but I think that's a main one).
That means, though, that some anecdotes are better than others. One person's anecdote about an incredibly rare situation should properly be discounted more than 1000 anecdotes from people drawn from an array of backgrounds (unless, of course, one wants to learn about that very rare situation); likewise, a collection of stories taken from the comments of a highly partisan blog where disagreement is immediately deleted -- even if there are 1000 of them -- should be discounted more than, say, a focus group of 100 people carefully chosen to be representative, led by a trained moderator.
I feel like I'm sort of belaboring the obvious, but I think it's also easy for "the obvious" to be forgotten (or ignored, or discounted) if its opposite is repeated enough.
Also, I think the tension between the "focus on data only" philosophy on one hand, and "be informed by anecdote" philosophy on the other, is a deep and interesting one: in my opinion, it is one of the main meta-issues in cognitive science, and of course comes up all the time in other areas (politics and policy, personal decision-making, stereotyping, etc). The main reason it's an issue, of course, is that we don't have data about most things -- either because the question simply hasn't been studied scientifically, or because it has but in an effort to "be scientific" the sample has been restricted enough that it's to know how well one can generalize beyond it. For a long time most studies in medicine used white men only as subjects; what then should one infer regarding women, or other genders? One is caught between the Scylla of using possibly inappropriate data, and the Charybdis of not using any data at all. Of course in the long term one should go out and get more data, but life can't wait for "the long term." Furthermore, if one is going to be absolutely insistent on a rigid reliance on appropriate data, there is the reductive problem that, strictly speaking, a dataset never allows you to logically draw a conclusion about anything other than itself. Unless it is the entire population, it will always be different than the population; the real question comes in deciding whether it is too different -- and as far as I can tell, aside from a few simple metrics, that decision is at least as much art as science (and is itself made partly on the basis of anecdote).
Another example, one I'm intimately familiar with, is the constant tension in psychology between ecological and external validity on the one hand, and proper scientific methodology on the other. Too often, increasing one means sacrificing the other: if you're interested in categorization, for instance, you can try to control for every possible factor by limiting your subjects to undergrad students in the same major, testing everyone in the same blank room at the same time of day, creating stimuli consisting of geometric figures with a clear number of equally-salient features, randomizing the order of presentation, etc. You can't be completely sure you've removed all possible confounds, but you've done a pretty good job. The problem is that what you're studying is now so unlike the categorization we do every day -- which is flexible, context-sensitive, influenced by many factors of the situation and ourselves, and about things that are not anything like abstract geometric pictures (unless you work in a modern art museum, I suppose) -- that it's hard to know how it applies. Every cognitive scientist I know is aware of this tension, and in my opinion the best science occurs right on the tightrope - not at the extremes.
That's why I think it's worth pointing out why the extreme -- even the extreme I tend to err on -- is best avoided, even if it seems obvious.
March 14, 2007
One of the interesting things about accruing more experience in a field is that as you do so, you find yourself called upon to be a peer reviewer more and more often (as I'm discovering). But because I've never been an editor, I've often wondered what this process looks like from that perspective: how do you pick reviewers? And what kind of people tend to be the best reviewers?
A recent article in the (open-access) journal PLoS Medicine speaks to these questions. Even though it's in medicine, I found the results somewhat interesting for what they might imply or predict about other fields as well.
In a nutshell, this study looked at 306 reviewers from the journal Annals of Emergency Medicine. Each of the 2,856 reviews (of 1,484 separate manuscripts) had been rated by the editors of the journal on a five-point scale (1=worst, 5=best). The study simply tried to identify what characteristics of the reviewers could be used to predict the effectiveness of the review. The basic finding?
Multivariable analysis revealed that most variables, including academic rank, formal training in critical appraisal or statistics, or status as principal investigator of a grant, failed to predict performance of higher-quality reviews. The only significant predictors of quality were working in a university-operated hospital versus other teaching environment and relative youth (under ten years of experience after finishing training). Being on an editorial board and doing formal grant (study section) review were each predictors for only one of our two comparisons. However, the predictive power of all variables was weak.
The details of the study are somewhat helpful for interpreting these results. When I first read that younger was better, I wondered to what extent this might simply be because younger people have more time. After looking at the details, I think this interpretation, while possible, is doubtful: the youngest cohort were defined as those that had less than ten years of experience after finishing training, not those who were largely still in grad school. I'd guess that most of those were on the tenure-track, or at least still in the beginnings of their career. This is when it's probably most important to do many many things and be extremely busy: so I doubt those people have more time. Arguably, they might just be more motivated to do well precisely because they are still young and trying to make a name for themselves -- though I don't know how big of a factor it would be given the anonymity of the process: the only people you're impressing with a good review are the editors of the journals.
All in all, I'm not actually that surprised that "goodness of review" isn't correlated with things such as academic rank, training in statistics, or being a good PI: not that those things don't matter, but my guess would be that nearly everyone who's a potential reviewer (for what is, I gather, a fairly prestigious journal) would have sufficient intelligence and training to be able to do a good review. If that's the case, then the best predictors of reviewing quality would come down to more ineffable traits like general conscientiousness and motivation to do a good review... This interpretation, if true, implies that a good way to generate better reviews is not to just choose big names, but rather to make sure people are motivated to put the time and effort into those reviews. Unfortunately, given that peer review is largely uncredited and gloryless, it's difficult to see how best to motivate them.
What do you all think about the idea of making these sort of rankings public? If people could put them on their CV, I bet there would suddenly be a lot more interest in writing good reviews... at least for the people for whom the CV still mattered.
February 14, 2007
A friend of mine pointed me to this website, Many eyes. Basically any random person can upload any sort of dataset, visualize the dataset in any number of ways, and then make the results publically available so that anyone can see them.
The negative, of course, is much the same as with anything that "just anyone" can contribute to: there is a lot of useless stuff, and (if the source of the dataset is uncited) you don't know for sure how valid the dataset itself is. There may be a lot of positives, though: the volume of data alone is like a fantastic dream for many a social scientist; it's a great tool for getting "ordinary people" interested in doing their own research or analysis of their lives (for instance, I noticed some people graphing changes in their own sports performance over time); many of the interesting datasets have ongoing conversations about them; and only time will tell, but I imagine there is at least a chance this could end up being Wikipedia-like in its usefulness.
It may also serve as a template for data-sharing among scientists. Wouldn't it be nice if, every time you published, you had to make your dataset (or code) publically available? We might already be trending in that direction, but some centralized location for scientific data-sharing sure would speed it along.
January 31, 2007
Most of us are aware of various distortions in reasoning that people are vulnerable to, mainly because of heuristics we use to make decisions easier. I recently came across an article in Psychological Science called Choosing an inferior alternative that demonstrates a technique that will cause people to choose an alternative that they themselves have previously acknowledged to be personally inferior. This is interesting for two reasons: first of all, exactly how and why it works tells us something about the process by which our brains update (at least some sorts of) information; and second, because I anticipate commercials and politicians and master manipulaters to start using these techniques any day now, and maybe if we know about it in advance we'll be more resistant. One can hope, anyway.
So what's the idea?
It's been known for a while that decision makers tend to slightly bias their evaluations of new data to support whatever alternative is currently leading. For instance, if I'm trying to choose between alternatives A, B, and C -- let's say they are restaurants and I'm trying to decide where to go eat -- when I learn about one attribute, say price, I'll tentatively rank them and decide that (for now) A is the best option. If I then learn about another attribute, say variety, I'll rerank them, but not in the same way I would have if I'd seen those two attributes at the same time: I'll actually bias it somewhat so that the second attribute favors A more than it otherwise would have. This effect is generally only slight, so if restaurant B is much better on variety and only slightly worse on price, I'll still end up choosing restaurant B: but if A and B were objectively about equal, or B was even slightly better, then I might choose A anyway.
Well, you can see where this is going. These researchers presented subjects with a set of restaurants and attributes to determined their objective "favorite." Then, two weeks later, they brought the same subjects in again and presented them with the same restaurants. This time, though, they had determined -- individually, for each subject -- the proper order of attributes that would most favor choosing the inferior alternative. (It gets a little more complicated than this, because in order to try to ensure that the subjects didn't recognize their choice from before, they combined nine attributes into six, but that's the essential idea). Basically what they did is picked the attribute that most favored the inferior choice and put it first, hoping to establish that the inferior choice would get installed as the leader. The attribute that second-most favored the inferior choice was last, to take advantage of recency effects. The other attributes were presented in pairs, specifically chosen so that the ones that most favored the superior alternative were paired with neutral or less-favorable ones (thus hopefully "drowning them out.")
The results were that when presented with the information in this order, 61% of people chose the inferior alternative. The good news, I guess, is that it wasn't more than 61% -- some people were not fooled -- but it was robustly different than chance, and definitely more than you'd expect (since, after all, it was the inferior alternative, and one would hope you'd choose that less often). Moreover, people didn't realize they were doing this at all: they were more confident in their choice when they had picked the inferior alternative. Even when told about this effect and asked if they thought they themselves had done it, they tended not to think so (and the participants who did it most were no more likely to think they had done it than the ones who didn't).
I always get kind of depressed at this sort of result, mainly because I become convinced that this sort of knowledge is then used by unscrupulous people to manipulate others. I mean, it's probably always been used somewhat subconsciously that way, but making it explicit makes it potentially more powerful. On the plus side, it really does imply interesting things for how we process and update information -- and raises the question of why we bias the leading alternative, given that it's demonstrably vulnerable to order effects. Just to make ourselves feel better about our current choice? But why would this biasing do that - wouldn't we feel best of all if we knew we were being utterly rational the whole time? It's a puzzle.
January 17, 2007
I saw an thought-provoking post at John Baez's diary the other day pointing out an interesting analogy between natural selection and Bayesian inference, and I can't decide if I should classify it as just "neat" or if it might also be "neat, and potentially deep" (which is where I'm leaning). Because it's a rather lengthy post, I'll just quote the relevant bits:
The analogy is mathematically precise, and fascinating. In rough terms, it says that the process of natural selection resembles the process of Bayesian inference. A population of organisms can be thought of as having various "hypotheses" about how to survive - each hypothesis corresponding to a different allele. (Roughly, an allele is one of several alternative versions of a gene.) In each successive generation, the process of natural selection modifies the proportion of organisms having each hypothesis, according to Bayes' law!
Now let's be more precise:
Bayes' law says if we start with a "prior probability" for some hypothesis to be true, divide it by the probability that some observation is made, then multiply by the "conditional probability" that this observation will be made given that the hypothesis is true, we'll get the "posterior probability" that the hypothesis is true given that the observation is made.
Formally, the exact same equation shows up in population genetics! In fact, Chris showed it to me - it's equation 9.2 on page 30 of this book:
* R. Bürger, The Mathematical Theory of Selection, Recombination and Mutation, section I.9: Selection at a single locus, Wiley, 2000.
But, now all the terms in the equation have different meanings!
Now, instead of a "prior probability" for a hypothesis to be true, we have the frequency of occurence of some allele in some generation of a population. Instead of the probability that we make some observation, we have the expected number of offspring of an organism. Instead of the "conditional probability" of making the observation, we have the expected number of offspring of an organism given that it has this allele. And, instead of the "posterior probability" of our hypothesis, we have the frequency of occurence of that allele in the next generation.
Baez goes on to wonder, as I do, if people doing work on genetic programming or Bayesian approaches to machine learning have noticed this relationship. I feel like I would have remembered if I'd seen something like this (at least recently), and I don't remember anything, but that doesn't mean it's not there -- any pointers, anyone? [The closest I can think of is an interesting chapter (pdf) by David MacKay called "Why have sex? Information acquisition and evolution", but it's mainly about how one can use information theory to quantify the argument for why recombination (sex) is a better way to spread useful mutations and clear less-useful ones].
Also, re: the conceptual deepness of this point... I've long thought (and I'm sure I'm not alone in this) that it's useful to see natural selection as a guided search over genotype (or phenotype) space; Bayesian inference, i.e., searching over "problem space" so as to maximize posterior probability seems to be a valuable and useful thing to do in machine learning and cognitive science. [Incidentally, I've also found it to be a useful rhetorical tool in discussing evolution with creationists -- the idea that computers can do intelligent searches over large spaces and find things with small "chance" probability is one that many of them can accept, and from there it's not so much of a leap to think that evolution might be kind of analogous; it also helps them to understand how "natural selection" is not "random chance", which seems to be the common misunderstanding]. Anyway, in that superficial sense, it's perhaps not surprising that this analogy exists; on the other hand, the analogy goes deeper than "they are both searches over a space" -- it's more along the lines of "they are both trying to, essentially, maximize the same equation (posterior probability)." And
Anyway, I'm now speculating on things I know very little about, and I should go read the Burger book (which has been duly added to my ever-expanding reading list). But I thought I'd throw out these speculations right now anyway, since you all might find them interesting. And if anyone has any other references, I'd love to see them.
December 7, 2006
I've just spent this week at the annual NIPS conference; though its main focus seems to be machine learning, there are always interesting papers on the intersection of computational/mathematical methods in cognitive science and neuroscience. I thought it might be interesting to mention the highlights of the conference for me - which obviously tends to focus heavily on the cognitive science end of things. (Be aware that links (pdf) are to the paper pre-proceedings, not final versions, which haven't been released yet).
From Daniel Navarro and Tom Griffiths, we have A Nonparametric Bayesian Method for Inferring Features from Similarity Judgments. The problem, in a nutshell, is that if you're given a set of similarity ratings about a group of objects, you'd like to be able to infer the features of the objects from that. Additive clustering assumes that similarity is well-approximated by a weighted linear combination of common features. However, the actual inference problem -- actually finding the features -- has always been difficult. This paper presents a method for inferring the features (as well as figuring out how many features their are) that handles the empirical data well, and might even be useful for figuring out what sorts of information (i.e., what sorts of features) we humans represent and use.
From Mozer et. al. comes Context Effects in Category Learning: An Investigation of Four Probabilistic Models. Some interesting phenomena in human categorization are the so-called push and pull effects: when shown an example from a target category, the prototype gets "pulled" closer to that example, and the prototypes of other related categories get pushed away. It's proven difficult to explain this computationally, and this paper considers four obvious candidate models. The best one uses a distributed representation and a maximum likelihood learning rule (and thus tries to find the prototypes that maximize the probability of being able to identify the category given the example); it's interesting to speculate about what this might imply about humans. The main shortcoming of this paper, to my mind, is that they use very idealized categories; but it's probably a necessary simplification to begin with, and future work can extend it to categories with a richer representation.
The next is work from my own lab (though not me): Kemp et. al. present an account of Combining causal and similarity-based reasoning. The central point is that people have developed accounts of reasoning about causal relationships between properties (say, having wings causes one to be able to fly) and accounts of reasoning about objects on the basis of similarity (say, if a monkey has some gene, an ape is more likely to have it than a duck is). But many real-world inferences rely on both: if a duck has gene X, and gene X causes enzyme Y to be expressed, it is likely that a goose has enzyme Y. This paper presents a model that intelligently combines causal- and similarity-based reasoning, and is thus able to predict human judgments more accurately than either of them alone.
Roger Levy and T. Florian Jaeger have a paper called Speakers optimize information density through syntactic reduction. They explore the (intuitively sensible, but hard to study) idea that people -- if they are rational -- should try to communicate in the information-theoretically optimal way: they should try to give more information at highly ambiguous points in a sentence, but not bother doing so at less ambiguous points (since adding information has the undesirable side-effect of making utterances longer). They examine the use of reduced relative clauses (saying, e.g., "How big is the family you cook for" rather than "How big is the family THAT you look for" - the word "that" is extra information which reduces the ambiguity of the subsequent word "you"). The finding is that speakers choose to reduce the relative clause -- to say the first type of sentence -- when the subsequent word is relatively unambiguous; in other words, their choices are correlated with information density. One of the reasons this is interesting to me is because it motivates the question of why exactly speakers do this: is it a conscious adaptation to try to make things easier for the listener, or a more automatic/unconscious strategy of some sort?
There are a number of other papers that I found interesting -- Chemudugunta et. al. on Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model; Roy et. al. on Learning Annotated Hierarchies from Relational Data, and Greedy Layer-wise Training of Deep Networks by Bengio et. al., to name a few -- so if this sort of thing interests you, I suggest checking out the NIPS proceedings when they come out. And if any of you went to NIPS also, I'd be curious what you really liked and think I should have included on this list!
November 17, 2006
Andrew Gelman has link to a study that just came out in Nature Neuroscience whose author, Alex Pouget at the University of Rochester, suggests that "the cortex appears wired at its foundation to run Bayesian computations as efficiently as can be possible." I haven't read the paper yet, so I don't have much in the way of intelligent commentary, but I'll try to take a look at it soon. In the meantime, here is a link to the press release so you can read something about it even if you don't have access to Nature Neuroscience. From the blurb, it sounds pretty neat, especially if you (like me) are at all interested in the psychological plausibility of Bayesian models as applied to human cognition.
November 16, 2006
Since writing my last post (The cognitive style of better powerpoint), I noticed that two other bloggers wrote rather recently on the same topic. The first, from Dave Munger at Cognitive Daily, actually proposes a bit of an experiment to compare the efficacy of text vs. powerpoint - results to be posted Friday. The second, from Chad Orzel at Uncertain Principles, offers a list of "rules of thumb" for doing a good PowerPoint talk.
Given all this, you'd think I wouldn't have anything to add, right? Well, never underestimate my willingness to blather on and on about something. I actually think there's one thing neither they nor I discuss much, and that is presenting mathematical, technical, or statistical information. Both Orzel and I recommend, as much as possible, avoiding equations and math in your slides. And that's all well and good, but sometimes you just have to include some (especially if you're a math teacher and the talk in question is a lecture). For me, this issue crops up whenever I need to describe a computational model -- you need to give enough detail that it doesn't look like the results just come out of thin air, because if you don't, nobody will care about what you've done. And often "enough detail" means equations.
So, for whatever it's worth, here are my suggestions for how to present math in the most painless and effective way possible:
Abandon slideware. This isn't always feasible (for instance, if the conference doesn't have blackboards), nor even necessarily a good idea if the equation count is low enough and the "pretty picture" count is high enough, but I think slideware is sometimes overused, especially if you're a teacher. When you do the work on the blackboard, the students do it with you; when you do it on slideware, they watch. It is almost impossible to be engaged (or keep up) when rows of equations appear on slides; when the teacher works out the math on the spot, it is hard not to. (Okay, harder).
If you can't abandon slideware:
1. Include an intuitive explanation of what the equation means. (This is a good test to make sure you understand it yourself!). Obviously you should always do this verbally, but I find it very useful to write that part in text on the slide also. It's helpful for people to refer to as they try to match it with the equation and puzzle out how it works and what it means -- or, for the people who aren't very math-literate, to still get the gist of the talk without understanding the equation at all.
2. Decompose the equation into its parts. This is really, really useful. One effective way to do this is to present the entire thing at once, and then go through each term piece-by-piece, visually "minimizing" the others as you do so (either grey them out or make them smaller). As a trivial example, consider the equation z = x/y. You might first grey out (y) and talk about x. Then talk about y and grey out x: you might note things like that y is the denominator, you can see that when y gets larger our result gets smaller, etc. My example is totally lame, but this sort of thing can be tremendously useful when you get equations that are more complicated. People obviously know what numerators and denominators are, but it's still valuable to explicitly point out in a talk how the behavior of your equation depends on its component parts -- people could probably figure it out given enough time, but they don't have that time, particularly when it's all presented in the context of loads of other new information. And if the equation is important enough to put up, it's important to make sure people understand all of its parts.
3. As Orzel mentioned, define your terms. When you go through the parts of the equation you should verbally do this anyway, but a little "cheat sheet" there on the slide is invaluable. I find it quite helpful sometimes to have a line next to the equation that translates the equation into pseudo-English by replacing the math with the terms. Using my silly example, that would be something like "understanding (z) = clarity of images (x) / number of equations (y)". This can't always be done without cluttering things too much, but when you can, it's great.
4. Show some graphs exploring the behavior of your equation. ("Notice that when you hold x steady, increasing y results in smaller z"). This may not be necessary if the equation is simple enough, but if it's simple enough maybe you shouldn't present it, and just mention it verbally or in English. If what you're presenting is an algorithm, try to display pictorially what it looks like to implement the algorithm. Also, step through it on a very simple dataset. People remember and understand pictures far better than equations most of the time.
5. When referring back to your equation later, speak English. By this I mean that if you have a variable y whose rough English meaning is "number of equations", whenever you talk about it later, refer to it as "number of equations", not y. Half of the people won't remember what y is after you move on, and you'll lose them. If you feel you must use the variable name, at least try to periodically give reminders about what it stands for.
6. Use LaTeX where possible. LaTeX's software creates equations that are clean and easy to read, unlike PowerPoint (even with lots of tweaking). You don't necessarily have to do the entire talk in LaTeX if you don't want to, but at least make the equations in LaTeX, screen capture them and save them as bitmaps, and paste them into PowerPoint. It is much, much easier to read.
Obviously, these points become more or less important depending on the mathematical sophistication of your audience, but I think it's far far easier to make mathematical talks too difficult rather than too simple. This is because it's not a matter (or not mainly a matter) of sophistication -- some of the most egregious violaters of these suggestions that I've seen have been at NIPS, a machine learning conference -- it's a matter of how much information your audience can process in a short amount of time. No matter how mathematically capable your listeners are, it takes a while (and a fair amount of concentration) to see the ramifications and implications of an equation or algorithm while simultaneously fitting it in with the rest of your talk, keeping track of your overall point, and thinking of how all of this fits in with their research. The easier you can make that process, the more successful the talk will be.
Any agreements, disagreements, or further suggestions, I'm all ears.
November 9, 2006
While at the BUCLD conference this last weekend, I found myself thinking about the cognitive effects of using PowerPoint presentations. If you haven't read Edward Tufte's Cognitive Style of PowerPoint, I highly recommend it. His thesis is that powerpoint is "costly to both content and audience", basically because of the cognitive style that standard default PPT presentations embody: hierarchical path structure for organizing ideas, emphasis on format over content, and low information resolution chief among them.
Many of these negative results -- though not all -- occur because of a "dumb" use of the default templates. What about good powerpoint, that is, powerpoint that isn't forced into the hierarchical path-structure of organization, that doesn't use hideous, low-detail graphs? [Of course, this definition includes other forms of slide presentation, like LaTeX; I'll use the word "slideware" to mean all of these]. What are the cognitive implications of using slideware, as opposed to other types of presentation (transparencies, blackboard, speech)?
Here are my musings, unsubstantiated by any actual research:
I'd bet that the reliance on slideware actually improves the worst talks: whatever its faults, it at least imposes organization of a sort. And it at least gives a hapless audience something to write down and later try to puzzle over, which is harder to do if the talk is a rambling monologue or involves scribbled, messy handwriting on a blackboard.
Perhaps more controversially, I also would guess that slideware improves the best talks - or, at least, that the best talks with slideware can be as good as the best talks using other media. The PowerPoint Gettysburg Address is a funny spoof, but seriously, can you imagine a two-hour long, $23-million-gross movie of someone speaking in front of a blackboard or making a speech? An Inconvenient Truth was a great example of a presentation that was enhanced immeasurably by the well-organized and well-displayed visual content (and, notably, it did not use any templates that I could tell!). In general, because people are such visual learners, it makes sense that a presentation that can incorporate that information in the "right" way will be improved by doing so.
However, I think that for mid-range quality presenters (which most people are) slideware is still problematic. Here are some things I've noticed:
1. Adding slides is so simple and tempting that it's easy to mismanage your time. I've seen too many presentations where the last 10 minutes are spent hastily running through slide after slide, so the audience loses all the content in the disorganized mess the talk has become.
2. Relatedly, slideware creates the tendency to present information faster than it can be absorbed. This is most obvious when the talk involves math -- which I might discuss in a post of its own -- but the problem occurs with graphs, charts, diagrams, or any other high-content slides (which are otherwise great to have). Some try to solve the problem by creating handouts, but the problem isn't just that the audience doesn't have time to copy down the content -- they don't have the time to process it. Talks without slideware, by forcing you to present content at about the pace of writing, give the audience more time to think about the details and implications of what you're saying. Besides, the act of copying it down itself can do wonders for one's understanding and retention.
3. Most critically, slideware makes it easier to give a talk without really understanding the content or having thought through all the implications. If you can talk about something on an ad hoc basis, without the crutch of having written everything written out for you, then you really understand it. This isn't to say that giving a slideware presentation means you don't really understand your content; just that it's easier to get away with not knowing it.
4. Also, Tufte mentioned that slideware forces you to package your ideas into bullet-point size units. This is less of a problem if you don't slavishly follow templates, but even if you don't, you're limited by the size of the slide and font. So, yeah, what he said.
That all said, I think slideware is here to say; plus, it has many advantages over other types of presentation. So my advice isn't to not use slideware (except, perhaps, for math-intensive talks). Just keep these problems in mind when making your talks.
October 31, 2006
In a previous post about the Gerber & Malhotra paper about publication bias in political science, I rather optimistically opined that the findings -- that there were more significant results than would be predicted by chance, and that many of those were suspiciously close to 0.05 -- were probably not deeply worrisome, at least for those fields in which experimenters could vary the number of subjects run based on the significance level achieved thus far.
Well, I now disagree with myself.
This change of mind comes as a result of reading about the Jeffreys-Lindley paradox (Lindley, 1957), a Bayes-inspired critique of significance testing in classical statistics. It says, roughly, that with large enough sample size, a p-value can be arbitrarily close to zero even though the null hypothesis is highly probable (i.e., very close to one). In other words, a classical statistical test might reject the null hypothesis at an arbitrarily low p-value, even though the evidence that it should be accepted is overwhelming. [A discussion of the paradox can be found here].
When I learned about this result a few years ago, it astonished me, and I still haven't fully figured out how to deal with all of the implications. (This is obvious, since I forgot about it when writing the previous post!). As I understand the paradox, the intuitive idea is that, with larger sample size, you will naturally get some data that appears unlikely (and, the more data you collect, the more likely you are to see some really unlikely data). If you forget to compare the probability of that data under the null hypothesis with the probability of the data under the alternative hypotheses, then you might get an arbitrarily low p-value (indicating that the data are unlikely under the null hypothesis) even if the data is even more unlikely under any of the alternatives. Thus, if you just look at the p-value, without taking effect size, sample size, or the comparative posterior probability of each hypothesis under consideration, you are likely to wrongly reject the null hypothesis on the basis of the p-value, even if it is the most likely of all possibilities.
The tie-in with my post before, of course, is that it implies that it isn't necessarily "okay" practice to keep increasing sample size until you achieve statistical significance. Of course, in practice, sample sizes rarely get larger than 24 or 32 -- at the absolute outside, 50 to 100 -- which is much smaller than infinity. Does this practical consideration, then, mean that the practice is okay? As far as I can tell, it is fairly standard (but then, so is the reliance on p-values to the exclusion of effect sizes, confidence intervals, etc., so "common" doesn't mean "okay"). Is this practice a bad idea only if your sample gets extremely large?
Lindley, D.V. (1957) A statistical paradox. Biometrika, 44. 187-192
October 10, 2006
I can't resist chiming in and contributing post VI on causation and manipulation, but coming at a rather different angle: rather than ask what we as researchers should do, the cognitive science question is what people and children do do - what they assume and know about causal inference and understanding.
You might think that people would (for lack of a better term) suck at this, given other well-known difficulties in reasoning, anecdotal reports from educators everywhere, etc, etc. However, there's a fair amount of evidence that people -- both adults and children -- can be quite sophisticated causal reasoners. The literature on this is vast and growing, so let me just point out one quite interesting finding, and maybe I'll return to the topic in later posts.
One question is whether children are capable of using the difference between evidence from observations and evidence from intervention (manipulation) to build a different causal structure. The well-named "theory theory" theory of development suggests that children are like small scientists and should therefore be quite sophisticated causal reasoners at an early age. To test this, Schulz, Kushnir, & Gopnik [pdf showed preschool children a special "stickball machine" consisting of a box, out of which two sticks (X and Y) rose vertically. The children were told that some sticks were "special" and could cause the other sticks to move, and some weren't. In the test condition, children saw X and Y move together on their own three times; the experimenter then intervened to pull on Y, causing it to move and X to fail to move. In the experimental condition, the experimenter pulled on one stick (X) and both X and Y moved three times; a fourth time the experimenter pulled on Y again, but only it moved (X was stationary).
The probability of each stickball moving conditioned on the other are the same in both cases: however, if the children reason about causal interventions, then the experimental group -- but not the control group -- should perceive that X might cause Y to move (but not vice-versa). And indeed, this was the case.
Children are also good at detecting interventions that are obviously confounded, overriding prior knowledge, and taking base rate into account (at least somewhat). As I said, this is a huge (and exciting) literature, and understanding people's natural propensities and abilities to do causal reasoning might even help us address the knotty philosophical problems of what a cause is in the first place.
September 26, 2006
I'm a little late into the game with this, but it's interesting enough that I'll post anyway. Several folks have commented on this paper by Gerber and Malhotra (which they linked to) about publication bias in political science. G&M looked at how many articles were published with significant (p<0.05) vs. non-significant results, and found -- not surprisingly -- that there were more papers with significant results than would be predicted by chance; and, secondly, that many of the significant results were suspiciously close to 0.05.
I guess this is indeed "publication bias" in the sense of "there is something causing articles with different statistical significance to be published differentially." But I just can't see this as something to be worried about. Why?
Well, first of all, there's plenty of good reason to be wary of publishing null results. I can't speak for political science, but in psychology, a result can be non-significant for many many more boring reasons than that there is genuinely no effect. (And I can't imagine why this would be different in poli sci). For instance, suppose you want to prove that there is no relation between 12-month-olds' abilities in task A and task B. It's not sufficient to show a null result. Maybe your sample size wasn't large enough. Maybe you're not actually succeeding in measuring their abilities in either or both of the tasks (this is notoriously difficult with babies, but it's no picnic with adults either). Maybe A and B are related, but the relation is mediated by some other factor that you happen to have controlled for. etcetera. Now, this is not to say that no null results are meaningful or that null results should never be published, but a researcher -- quite rightly -- needs to do a lot more work to make it pass the smell test. And so it's a good thing, not a bad thing, that there are fewer null results published.
Secondly, I'm not even worried about the large number of studies that are just over significance. Maybe I'm young and naive, but I think it's probably less an indication of fudging data than a reflection of (quite reasonable) resource allocation. Take those same 12-month-old babies. If I get significant results with N=12, then I'm not going to run more babies in order to get more significant results. Since, rightly or wrongly, the gold standard is the p<0.05 value (which is another debate entirely), it makes little sense to waste time and other resources running superfluous subjects. Similarly, if I've run, say, 16 babies and my result is almost p<0.05, I'm not going to stop; I'll run 4 more. Obviously there is an upper limit on the number of subjects, but -- given the essential arbitrariness of the 0.05 value -- I can't see this as a bad thing either.
September 15, 2006
Ah, the beginning of fall term -- bringing with it the first anniversary of this blog (yay!), a return to our daily posting schedule (starting soon), and a question for you, our readers:
Do you have any feedback for us? Specifically, are there topics, issues, or themes you would like us to cover more (or less) than we do? Would you like to see more discussion of specific content and papers? More posts on higher-level, recurring issues in each of our fields (or across fields)? More musings about teaching, academia, or the sociology of science? Obviously the main factor in what we write about comes down to our whims and interests, but it's always nice to write things that people actually want to read.
In my specific case, I know that I try not to blog about many cognitive science and psychology topics that I think about if they aren't directly related to statistics or statistical methods in some way: I fear that it wouldn't be of interest to readers who come here for a blog about "Social Science Statistics". However, maybe I've been needlessly restrictive...?
So, readers, what are your opinions?
June 18, 2006
A friend emailed this to me;apparently the teaching assistants at the University of Oregon have creative as well as statistical talents. It's pretty funny. Perhaps every intro to statistics class could begin with a showing... video here
May 20, 2006
It's the end of the term for both Harvard and MIT... so in view of the fact that we on the authors committee are about to embark on summers of tireless dedication to research while scattered to the far reaches of the planet, posting to this blog will be reduced until fall.
A special thanks to the loyal readers and commenters of this blog -- you folks have made this year a really rewarding experience for us. We won't stop posting, so do hope you still stop by occasionally and are still with us when we resume on a full schedule at the end of the summer.
May 15, 2006
Google has just come out with a new tool, Google Trends, which compares the frequencies of different web searches and thus provides hours of entertainment to language and statistics geeks like myself. In honor of that -- and, okay, because it's nearing the end of the term and I'm just in the mood -- here's a rather frivolous post dedicated to the tireless folks at Google, for entertaining me today.
1) One thing that is interesting (though in hindsight not surprising) is that Google Trends seems like a decent tool for identifying how marked a form is. The basic idea is that a default term is unmarked (and often unsaid), but the marked term must be used in order to communicate that concept. For instance, in many sociological domains, "female" is marked more than "male" is -- hence people refer to "female Presidents" a lot more than they refer to "male Presidents", even though there are many more of the latter: the adjective "male" is unnecessary because it just feels redundant. In contrast, you much more often say "male nurse" than "female nurse", because masculinity is marked in the nursing context.
Anyway, I noticed that for many sets of words, the term that is searched for most often is the marked term, even though the unmarked term probably occurs numerically more often. For instance, Blacks, whites indicates far more queries for "blacks"; Gay, straight many more for "gay"; and Rich, poor, middle class the most for rich, followed by poor, and least of all middle class.
I have two hypotheses to explain this: (a) people generally google for information, and seek information about what they don't know; so it's not surprising that more people don't know about the non-default, usually numerically smaller item. And, (b) since unmarked means it doesn't need to be used, it's not really a surprise that people don't use it. Still, I thought it was interesting. And clearly this phenomenon, if real at all, is at most only one of many factors affecting query frequency: for instance, Christian, atheist, muslim indicates far more hits for "Christian", and those in very Christian areas.
2) Another observation: the first five numbers seem to have search frequencies that drop by half with each consecutive number. Is this interesting for cognitive reasons? I have no idea.
3) As far as I can tell, no search occurs more often than "sex." If anyone can find something with greater frequency, I'd love to hear it. On the one hand, it may say good things for our species that "love" beats out "hate", but that may just mean more people are searching for love than hate. And "war" beats out "peace", sadly enough.
4) "Hate bush" peaked right before the 2004 election, "love bush" about six months before that. I have no idea what that's all about.
5) It's amazing to me how many people clearly must use incredibly unspecific searches: who searches for "one"? Or "book"? Though there is no indication of numbers (a y axis on these graphs would be incredibly handy), a search needs a minimum number of queries otherwise it won't show up, so somebody must be making these.
6) In conclusion, I note that Harvard has more queries than MIT. Does this mean that MIT is the "default"? Or that Harvard generates more interest? Since I'm an MIT student but writing for a Harvard blog, I plead conflict of interest...
April 28, 2006
I've posted before about the "irrational" reasoning people use in some contexts, and how it might stem from applying cognitive heuristics to situations they were not evolved to cover. Lest we fall into the depths of despair about human irrationality, I thought I'd talk about another view on this issue, this time showing that people may be
In Simple heuristics that make us smart Gigerenzer et. al. argue that, contrary to popular belief, many of the cognitive heuristics people use are actually very rational given the constraints on memory and time that we have to face. One strand of their research suggests that people are far better at reasoning about probabilities when they are presented as natural frequencies rather than numbers (as most studies do). Thus, for instance, if people see pictures of, say, 100 cars, 90 of which are blue, they are more likely not to "forget" this base rate than if they are just told that 90% of cars are blue.
A recent paper in the journal Cognition (vol 98, 287-308) expands on this theme. Zhu & Gigerenzer found that children steadily gain in the ability to reason about probabilities, as long as the information is presented using natural frequencies. Children were told a story such as the following:
Pingping goes to a small village to ask for directions. In this village, the probability that the person he meets will lie is 10%. If a person lies, the probability that he/she has a red nose is 80%. If a person doesn't like, the probability that he/she also has a red nose is 10%. Imagine that Pingping meets someone in the village with a red nose. What is the probability that the person will lie?
Another version of the story gave natural frequencies instead of conditional probabilities, for instance "of the 10 people who lie, 8 have a red nose." None of the fourth-grade through sixth-grade children could answer the conditional probability question correctly, but sixth graders approached the performance of adult controls for the equivalent natural frequency question: 53% of them matched the correct Bayesian posterior probability. The fact that none of the kids could handle the probability question is not surprising -- they had not yet been taught the mathematical concepts of probability and percentage. What
The most interesting part of this research, for me, is less about the question of whether people "are Bayesian" (whatever that means), but rather that it highlights a very important message: representation matters. When information is presented using a representation that is natural, we find it a lot easier to reason about it correctly. I wonder how many of our apparent limitations reveal less about problems with our reasoning, and more about the choice or representation or the nature of the task.
April 21, 2006
Since the days of Kahneman & Tversky, researchers have been finding evidence showing that people do not reason about probabilities as they would if they were "fully rational." For instance, base-rate neglect -- in which people ignore the frequency of different environmental alternatives when making probability judgments about them -- is a common problem. People are also often insensitive to sample size and to the prior probability of various outcomes. (this page offers some examples of what each of these mean).
A common explanation is that these "errors" arise as the result of using certain heuristics that usually serve us well, but lead to this sort of error in certain circumstances. Thus, base-rate neglect arises due to the representativeness heuristic, in which people assume that each case is representative of its class. So, for instance, people watching a taped interview with a prison guard with extreme views will draw conclusions about the entire prison system based on this one interview -- even if they were told in advance that his views were extreme and unusual, and that most guards were quite different. The prison guard was believed to be
In many circumstances, a heuristic of this sort is sensible: after all, it's statistically unlikely to meet up with someone or something that is, uh, statistically unlikely -- so it makes sense to usually assume that whatever you interact with is representative of things of that type. The problem is -- and here I'm harking back to a theme I touched on in an earlier post -- that this assumption no longer works in today's media-saturated environment. Things make it into the news precisely
March 27, 2006
This week is spring break for both Harvard and MIT, so as per usual, we will be posting less this week. Enjoy the (sort of) spring sunshine!
March 24, 2006
Gary's posts about teaching and breakfast cereal reminded me of a teaching experience I had once while teaching in the Peace Corps in Mozambique -- this time regarding the scientific method and hypothesis testing. It might be nothing particularly exciting to those of you who habitually teach pre-college level science, but I was surprised at how well it worked.
My (secondary-level) students were extremely good at memorizing facts, but they had a very hard time learning and applying the scientific method (as many do, I think). Since I see the method as the root of what makes science actually scientific and I didn't want them to have the view that science was just a disconnected collection of trivia, this was deeply problematic -- all the more frustrating to me because I could see that, in real life, they used the scientific method routinely. We all do, whenever we try to explain people's behavior or solve any of the everyday puzzles that confront us. The trick was to demystify it, to make them see that as well.
The next day I brought in an empty coke bottle. It's not vital that this be done with a coke bottle; in fact I imagine if you have more choice of materials than I had in Africa, you could find something even better. Basically I wanted something that was very familiar to them, to underscore the point that scientific reasoning is something they did all the time.
I held up the empty coke bottle. "What do you suppose had been in it?" I asked. This was the PROBLEM. "Coke!" everyone replied. "Okay," said I, "but I could have used it after the coke was gone for something else, right? What else could it have held?" Once again, people had no trouble suggesting possibilities -- water, gasoline, tea, other kinds of soda. I pointed out that they had just GENERATED HYPOTHESES, and wrote them on the board, along with coke.
Now, I asked them, how could you find out if your hypothesis was correct? They'd ask me, they said, and I pointed out that this was one way of TESTING the hypothesis. But suppose I wasn't around, or lied to them - what else could they do? One student suggested smelling it, and another (thinking about the gasoline hypothesis) suggested throwing a match in and seeing if it caught fire. "Both of these are good tests," I said, "and you'll notice that each of them is good for certain specific hypotheses; the match one wouldn't tell the difference between tea and other kinds of soda, for instance, and smelling it wouldn't help if it were water."
Then I asked a volunteer to come up and actually perform the test - to smell it, since we didn't have any matches. He did, and reported back that it smelled like Fanta even though it was a coke bottle. This, I said, was the RESULT, and it enabled the class to draw the CONCLUSION - that I had put Fanta in the bottle after drinking all of the original coke.
The best part of this demo came when a student, seeking to "trap" me, pointed out that I could still have had water or tea in the bottle, just long enough ago that the Fanta smell was stronger. "Exactly!" I replied. This points out the two limitations of the scientific method -- the validity of your conclusion depends on your hypotheses and on how good your methods of testing are. There are always a potentially infinite number of hypotheses you haven't ruled out, and therefore we cannot draw any conclusion with 100% accuracy. Plus, if our test can't tell the difference between two hypotheses, then we can't decide between those two. For this reason it's very important to have hypotheses that you can test, and to work to develop better methods of testing so that you can eliminate more plausible hypotheses.
This led to a good discussion about the pros and cons of the scientific method and how it compared to other ways of understanding the world. If I had had more time, equipment, or room, I had hoped to make it more interactive, with stations where they had to apply the method to lots of simple real-world problems; but even as it was, it was valuable.
I was surprised at how well this demo worked... not only did they immediately understand how to apply the scientific method, but they also understood its limitations in a way that I think many people don't, even by college age. As the semester advanced, I found myself referring back to the lesson often ("remember the empty coke bottle") when I'd try to explain how we knew what we knew. And I think it was very freeing for them to realize that science wasn't some mysterious system of rules passed down from on high, but rather the best explanation we had so far (and the best way we knew of how to get that explanation). My favorite result of this demo was their realization that scientists were people just like themselves, and that they too could do it -- in fact, they already were.
March 16, 2006
In my last post I talked about computational vs. algorithmic level descriptions of human behavior, and I argued that most Bayesian models of reasoning are examples of the former -- and thus make no claims about whether and to what extent the brain physically implements them.
A common statement at this point is that "of course your models don't say anything about the brain -- they are so complicated, how could they? Do people really do all that math?" I share the intuition: the models do look complex, and I am certainly not aware of doing anything like this when I think, but I don't think the possibility can be rejected out of hand. In other words, while it's certainly possible that human brains do nothing like, say, MCMC [insert complicated computational technique here], it's not a priori obvious. Why?
I have three reasons. First of all, we really don't have any good conception of what the brain is capable of computationally -- it has billions of neurons, each of which has thousands of connections, and (unlike modern computers) is a massively parallel computing device. State of the art techniques like MCMC look complicated when written out as mathematical equations -- particularly to those who don't come from that background -- but that doesn't mean, necessarily, that they are complicated in the brain.
Secondly, every model I've seen generally gets its results after running for at most a week, usually for only a few minutes -- much less time than a human has to go about and form theories of the world. If you are studying how long-term theories or models of the world form, it's not at all clear how to compare the time a computer takes to the time a human takes: not only are the scales really different, so is the data they get (models generally have cleaner data, but far less) and so is the speed of processing (computers are arguably faster, but if a human can do in parallel what a computer does serially, this might mean nothing). The point is that comparing a computer after 5 minutes to a human over a lifetime might not be so silly after all.
Thirdly, both the strength and weakness of studying cognitive science is that we have clear intuitions about what cognition and thinking are. It's a strength in that it helps us judge hypotheses and have good intuitions -- but it's a weakness in that it causes us accept or reject ideas based on these intuitions when maybe we really shouldn't. There's a big difference between conscious and unconscious reasoning, and most (if not all) of our intuitions are based on how we see ourselves reason consciously. But just because we aren't aware of, say, doing Hebbian learning doesn't mean we aren't. It's striking to me that people who make Bayesian models of vision rarely have to deal with questions like "but people don't do that! it's so complicated!" This in spite of the fact that it's the same brain. I think this is probably because we don't have conscious awareness of the process of vision, and so don't therefore think we know how it works. But to the extent that higher cognition is unconscious, the same point applies. It's just easy to forget.
Anyway, I'd be delighted to hear objections to any of these three reasons. As I said in the last post, I'm still sorting out these issues to myself, so I'm not really dogmatically arguing any of this.
March 10, 2006
Anyone who is interested in Bayesian models of human cognition has to wrestle with the issue of whether people use the same sort of reasoning (and, if so, to what extent this is true, and how our brains do that). I'll be doing a series of posts exploring what I think about this issue (which isn't really set in stone yet -- so think of this as "musing out loud" rather than saying "this is the way it is").
First: what does it mean to say that people are (or are not) Bayesian?
In many ways the question of whether people do the "same thing" as the model is a red herring: I use Bayesian models of human cognition in order to provide computational-level explanations of behavior, not algorithmic-level explanations. What's the difference? A computational-level explanation seeks to explain a system in terms of the goals of the system, the constraints involved, and the way those various factors play out. An algorithmic-level explanation seeks to explain how the brain physically does this. So any single computational explanation might have a number of possible different algorithmic implementations. Ultimately, of course, we would like to understand both: but I think most phenomena in cognitive science are not well enough understood on the computational level to make understanding on the algorithmic level very realistic, at least not at this stage.
To illustrate the difference between computational and algorithmic, I'll give an example. People given a list of words to memorize show certain regular types of mistakes. If the list contains many words with the same theme - say, all having to do with sports, but never the specific word "sport" - people they will nevertheless often incorrectly "remember" seeing "sport". One possible computational-level explanation of what is going on might suggest, say, that the function of memory is to use the past to predict the future. It might further say that there are constraints on memory deriving from limited capacity and limited ability to encode everything in time, and that as a result the mind seeks to "compress" information by encoding the meaning of words rather than their exact form. Thus, it is more likely to "false positive" on words with similar meanings but very different forms.
That's one of many possible computational-level explanations of this specific memory phenomenon. The huge value of Bayesian models (and computational models in general) is that they make this type of explanation rigorous and testable - we can quantify "limited capacity" and what is meant by "prediction" and explore how they interact with each other, so we're not just throwing words around. There is no claim in most computational cognitive science, implicit or explicit, that people actually implement the same computations our models do.
There is still the open question of what is going on algorithmically. Quite frankly, I don't know. That said, in my next post I'll talk about why I don't think we can reject out of hand the idea that our brains are implementing something (on the algorithmic level) that might be similar to the computations our computers are doing. And then in another post or two I'll wrap up with an exploration of the other possibility: that people are adopting heuristics that approximate our models, at least under some conditions. All this, of course, is only true to the extent that the models are good matches to human behavior -- which is probably variable given the domain and the situation.
March 3, 2006
Jim's entry about the use of the word "parameter" got me thinking about a related issue I wrestle with all the time: communicating the importance and value of computational models in psychology to traditional psychologists.
There is a certain subset of the cognitive science community that is interested in computational/statistical models of human reasoning, stemming from the 70s and the 80s, first with Strong AI and the rise of connectionism. Nowadays, I think more people are becoming interested in Bayesian models, though admittedly it's hard to tell how big this is because of sample bias: since it's what my lab does, I don't have a clear sense of how many people don't know or care about this approach, since they are the very people I'm least apt to converse with.
Nevertheless, I think I can say with some confidence that a not inconsequential number of psychologists just don't see the value of computational models. Though I think some of that is for good reasons (some of which I share), I'm ever more convinced that a lot of this is because we, the computational and quantitative people, do such a lousy job of explaining why they are important, in terms that a non-computationally trained person can understand.
Part of it is word choice: as Jim says, we have absorbed jargon to the point that it is second-nature to us, and we don't even realize how jargony it might be ("parameters", "model", "Bayesian", "process", "generative", "frequentist","likelihood" - and I've deliberately tried put on this list some of the least-jargony terms we habitually use). But I think it also relates to deeper levels of conceptualization -- we have trained ourselves to the point that when something is described mathematically, we can access the intuition fairly easily, and thus forget that the mathematical description doesn't have the same effect for other people. I was recently at a talk geared toward traditional psychologists in which the speaker described what a model was doing in terms of coin flipping and mutation processes. It was perfectly accurate and certainly less vague than the corresponding intuition, but I think he lost a few people right there: since they couldn't capture the intuition rapidly enough, the model felt both arbitrary and too complicated to them. I don't think it's a coincidence that arbitrariness and "too much" complexity are two of the most common criticisms leveled at computational modelers by non-modelers.
The point? Though we shouldn't sacrifice accuracy in order to make vague, handwavy statements, it's key to accompany accurate statistical descriptions with the corresponding intuitions that they capture. It's a skill that takes practice to develop (learning this is one of the reasons I blog, in fact), and it requires being constantly aware of what might be specialized knowledge that your listener might not know. But it's absolutely vital if we want quantitative approaches to be taken seriously by more non-quantitative folks.
February 21, 2006
A recent study by Shane Frederick at MIT, published in the Journal of Economic Perspectives [pdf], has gotten press attention in the last few weeks for its claim that performance on a simple math test predicted risk-taking behavior. I'm a bit skeptical about the conclusions Frederick's draws (and I'll explain why), but regardless, the study itself is quite interesting.
The study begins by asking subjects to take the Cognitive Reflection Test (CRT), which consists of three simple math questions:
1. A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?
2. If it takes 5 machines 5 minutes to make 5 widgets, how long does it take 100 machines to make 100 widgets?
3. In a lake, there is a patch of lily pads. Every day the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half the lake?
Then subjects are asked two other types of questions:
(a) Would you rather have $3400 now or $3800 in two weeks?
(b) Would you rather have a guaranteed $1000, or a 90% chance of $5000?
Questions of type (a) provide some measure of your "time preference" - how patient you are when it comes to money matters - while questions of type (b) provide a measure of your degree of risk-taking; people who prefer the more certain but lower-expected-value item are more risk-averse than those who choose the opposite. Interestingly, Frederick found that subjects who scored well on the CRT also tended to be more "patient" on questions like (a) and more risk-taking on questions like (b). Much of the discussion in the paper is centered around why and to what extent cognitive abilities, as measured by the CRT, would have an impact on these two things.
It's fascinating work, except it seems to me that there's an alternative explanation for these results that has little to do with cognitive abilities. One strand of such an explanation (which Frederick mentions himself) is that, in addition to mathematical skills, the test measures the ability to overcome impulsive answers. Each of the questions has an "obvious" answer (10 cents, 100 minutes, 24 days) that is incorrect; high-scorers thus need to be able to inhibit the wrong answer as well as calculate the correct one; they tend to be more patient and methodical as well as better at math. It's easy to see how these abilities, not cognitive ability per se, might account for the differential performance on questions like (a).
The deeper problem is that the study failed to control for socioeconomic differences between subjects. The high-performing subjects were taken from universities like Harvard, MIT, and Princeton; the lower-performing subjects were taken from universities like University of Michigan and Bowling Green. People at the latter universities are likely to be in a far more precarious financial situation than those at the former. Why does this matter? One of the principle findings of Kahneman & Tversky's prospect theory is that as you have less money, you become more risk averse. Thus it seems entirely possible to me that the difference between subjects was because of differences in their financial situation, and had nothing to do with cognitive abilities at all (except possibly indirectly, as mediated through socioeconomic factors). I'd be interested in seeing if this finding still holds up even when SES is controlled for.
February 9, 2006
Since Martin Luther King Day was somewhat recent (okay - a month ago; stil...), I thought I'd blog about human statistical learning and its possible implications for racism. Some of this is a bit speculative (and I'm no sociologist) but it's a fascinating exploration of how cutting-edge research in cognitive science has implications for deep real-world problems.
In today's society racism is rarely so blatant as it was 50 or 100 years ago. More often it refers to subtle but ubiquitous inconsistencies in how minorities are treated (or, sometimes, perceive themselves to be treated). Different situations are probably different mixtures of the two. Racism might often be small effects that the person doing the treating might not even notice -- down to slight differences in body language and tone of voice -- that could nevertheless have large impacts on the outcome of a job interview or the likelihood of being suspected of a crime.
One of the things studying statistical learning teaches us is that almost everyone has subtly different, usually more negative, attitudes to minorities than to whites - even minorities themselves. Don't believe me? Check out the online Implicit Association Test, which measures the amount of subconscious connection you make between different races and concepts. The premise is simple and has been validated over and over in psychology: if two concepts are strongly linked in our minds, we will be faster to say so than if they are only weakly associated. For instance, you're faster to say that "nurse" and "female" are similar than "nurse" and "male", even though men can be nurses, too. I'm oversimplifying here, but in the IAT you essentially are called upon to link pictures of people of different races with descriptors like good/bad, dangerous/nice, etc. Horrifyingly, even knowing what the experiment measures, even taking it over and over again, most people are faster to link white faces with "good" words, black with bad.
Malcolm Gladwell's book "Blink" has an excellent chapter describing this, and it's worth quoting one of his paragraphs in detail: "The disturbing thing about this test is that it shows that our unconscious attitudes may be utterly incompatible with our stated values. As it turns out, for example, of the fifty thousand African Americans who have taken the Race IAT so far, about half of them, like me, have stronger associations with whites than with blacks. How could we not? We live in North America, where we are surrounded every day by cultural messages linking white with good." (85)
I think this is yet another example of where learning mechanisms that are usually helpful -- it makes sense to be sensitive to the statistical correlations in the environment, after all -- can go devastatingly awry in today's world. Because the media and gossip and stories are a very skewed reflection of "the real world", our perceptions formed by those sources (our culture, in other words) are also skewed.
What can we do? Two things, I think. #1: Constant vigilance! Our associations may be unconscious, but our actions aren't. If we know about our unconscious associations, we're more likely to watch ourselves vigilantly to make sure they don't come out in our actions; as enough people do that, slowly, the stereotypes and associations themselves may change. #2: This is the speculation part, but it may be possible to actually change our unconscious associations: not consciously or though sheer willpower, but by changing the input our brain receives. The best way to do that, I would guess, is to get to know people of the minority group in question. Suddenly your brain is receiving lots of very salient information about specific individuals with wholly different associations than the stereotypes: enough of this and your stereotype itself might change, or at least grow weaker. I would love to see this tested, or if someone has done so, what the results were.
February 2, 2006
Bayesian vs. frequentist - it's an old debate. The Bayesian approach views probabilities as degrees of belief in a proposition, while the frequentist says that a probability refers to a set of events, i.e., is derived from observed or imaginary frequency distributions. In order to avoid the well-trod ground comparing these two approaches in pure statistics, I'll consider instead how the debate changes when applied to cognitive science.
One of the main arguments made against using Bayesian probability in statistics is that it's ill-grounded and subjective. If probability is just "degree of belief", then even a question like "what is the probability of heads or tails" can change depending on who is asking the question and what their prior beliefs about coins are. Suddenly there is no "objective standard", and that's nerve-wracking. For this reason, most statistical tests in most disciplines rely on frequentist notions like confidence intervals rather than Bayesian notions like the relative probability of two hypotheses. However, there are drawbacks to doing this, even in non-cogsci areas. To begin with, many things we want to express statistical knowledge about don't make sense in terms of reference sets, e.g., the probability that it will rain tomorrow (since it will only rain once). For another, some argue that the seeming objectivity of the frequentist approach is illusory, since we can't ever be sure that our sampling process hasn't biased or distorted the data. At least with a Bayesian approach, we can explicitly deal with and/or try to correct that.
But it's in trying to model the mind that we can really see the power of Bayesian probability. Unlike as for other social scientists, this sort of subjectivity isn't a problem: we cognitive scientists are interested in degree of belief. In a sense, we study subjectivity. In making models of human reasoning, then, an approach that incorporates subjectivity is a benefit, not a problem.
Furthermore, (unlike many statistical models) the brain generally doesn't just want to correctly capture the statistical properties of the world. Actually, its main goal is generalization -- prediction, not just estimation, in other words -- and one of the things people excel at is generalization based on very little data. Incorporating the Bayesian notion of prior beliefs, which act to constrain generalization in ways that go beyond the actual data, allows us to formally study this in ways that we couldn't if we just stuck to frequentist ideas of probability.
January 27, 2006
It's a common truism, familiar to most people by now thanks to advertising and politics, that repeating things makes them more believable -- regardless of whether they're true or not. In fact, even if they know at the time that the information is false, people will still be more likely to believe something the more they hear it. This phenomenon, sometimes called the reiteration effect, is well-studied and well-documented. Nevertheless, from a statistical learning point of view, it is extremely counter-intuitive: shouldn't halfway decent learners learn to discount information they know is false, not "learn" from it?
One of the explanations for the repetition effect is related to source confusion -- the fact that, after a long enough delay, people are generally much better at remembering what they learned rather than where they learned it. Since a great deal of knowing that something is false means knowing that its source is unreliable, forgetting the source often means forgetting that it's not true.
Repetition increases the impact of source confusion for two reasons. First, the more often you hear something, the more sources there are to remember, and the more likely you are to forget at least some of them. I've had this experience myself - trying to judge the truth of some tidbit of information, actually remembering that I first read it somewhere that I didn't trust, knowing that I've read it somewhere else (but not remembering the details) and concluding that since there was some chance that this somewhere else was trustworthy, it might be true.
The second reason is that the more sources there are the more unlikely it seems that all of them believe it if it's false. This strategy makes some evolutionary and statistical sense. Hearing (or experiencing) something from two independent sources (or two independent events) makes it more likely that you can generalize on them than if you only experienced it once. This idea is the basis of getting large sample sizes: as long as the samples are independent, more samples means more evidence. Unfortunately, in the mass media today few sources of information are independent. Most media outlets get things from AP wire services and most people get their information from the same media outlets, so even if you hear item X in 20 completely different contexts, chances are that all 20 of them stem from the same one or two original reports. If you've ever been the source of national press yourself, you will have experienced this firsthand.
I tried to think of a way to end this entry on a positive note, but I'm having a hard time here. It's a largely unconscious byproduct of how our implicit statistical learning mechanisms operate, so even being aware of this effect is only somewhat useful: we know consciously not to trust things simply because we've heard them often, but so much of this is unconscious it's hard to fight. Education about it is therefore worthwhile, but better still would be solutions encouraging a more heterogeneous media with more truly independent sources.
January 12, 2006
Two of the most enduring debates in cognitive science can be summarized baldly as the "rules vs statistics" debate and the "language: innate or not?" debate. (I think these simple dichotomies are not only too simple to capture the state of the field and current thought, but also actively harmful in some ways; nevertheless, they are still a good first approximation for blogging purposes). One of the talks at the BUCLD conference, by Gary Marcus at NYU, leapt squarely into both debates by examining simple rule-learning in seven-month old babies and arguing that the subjects could only do this type of learning when the input was linguistic.
Marcus built on some earlier studies of his (e.g., pdf here) in which he familiarized seven-month infants with a list of nonsense "words" like latala or gofifi. Importantly, all of the words heard by any one infant had the same structure, such as A-B-B ("gofifi") or A-B-A ("latala"). The infants heard two minutes of these type of words, and then were presented with a new set of words using different syllables, half of which followed the same pattern as before, half of which followed a new pattern. Marcus found that infants listened longer and paid more attention to the words with the unfamiliar structure, which they could have done only if they successful abstracted that structure (not just the statistical relationships between particular syllables). Thus, for instance, an infant who heard many examples of words like "gofifi" and "bupapa" would be more surprised to hear "wofewo" than "wofefe"; they have abstracted the underlying rule. (The question of how and to what extent they abstracted the rule is rather debated, and I'm not going to address it here).
The BUCLD talk focused instead on another question: did it matter at all that the stimuli they heard were linguistic rather than, say, tone sequences? To answer this question, Marcus did the same experiment with sequences of various kinds of different tones and tambors in the place of syllables (e.g. "blatt blatt honk" instead of "gofifi"). His finding? Infants did not respond differently in testing to the structures they had heard - that is, they didn't seem to be abstracting the underlying rule this time.
There is an unfortunately large confounding factor, however: infants have a great deal more practice and exposure to language than they do to random tones. Perhaps the failure was rather one of discrimination: they didn't actually perceive different tones to be that different, and therefore of course could not abstract the rule. To test this, Marcus trained infants on syllables but tested them on tones. His reasoning was that if it was a complete failure of discrimination, they shouldn't be able to perceive the pattern in tones when presented in testing any more than they could when presented in training. To his surprise, they did respond differently to the tones in testing, as long as they were trained on syllables. His conclusion? Not only can infants do cross-modal rule transfer, but they can only learn rules when they are presented linguistically, though they can then apply them to other domains. Marcus argued that this was probably due to an innate tendency in language, not a learnt effect.
It's fascinating work, though rather counterintuitive. And, quite honestly, I remain unconvinced (at least about the innate tendency part). Research on analogical mapping has shown that people who have a hard time perceiving underlying structure in one domain can nevertheless succeed in perceiving it if they learn about the same structure in another and map it over by analogy. (This is not news to good teachers!) It's entirely possible - and indeed a much simpler hypothesis - that babies trained on tones lack the experience they have with language and hence find it more difficult to pick up on the differences between the tones and therefore the structural rule they embody. But when first trained on language - which they do have plenty of practice hearing - they can learn the structure more easily; and then when hearing the tones, they know "what to listen for" and can thus pick out the structure there, too. It's still rule learning, and even still biased to be easier for linguistically presented things; but that bias is due to practice rather than some innate tendency.
January 3, 2006
An issue inherent in studying language acquisition is the sheer difficulty of acquiring enough accurate naturalistic data. In particular, since many questions hinge on what language input kids hear - and what language mistakes and capabilities kids show - it's important to have an accurate way of measuring both of these things. Unfortunately, short of following a child around all day with a tape recorder (which people have done!), it's hard to get enough data to have an accurate record of low-frequency items and productions; it's also hard to know what would be enough. Typically, researchers will record a child for a few hours at a time for a few weeks and then hope that this represents a good "sample" of their linguistic knowledge.
A paper by Caroline Rowland at the University of Liverpool, presented at the BUCLD conference in early November, attempts to assess the reliability of this sort of naturalistic data by comparing it to diary data. Diary data is obtained by having the caregiver write down every single utterance produced by the child over a period of time; as you can imagine, this is difficult to persuade someone to do! There are clear drawbacks to diary data, of course, not least of which is that as the child speaks more and more it becomes less and less accurate. But because it has a much better likelihood of incorporating low-frequency utterances, it provides a good baseline comparison in that respect to naturalistic, tape-recorded data.
What Rowland and her coauthor found is perfectly in line with what is known about statistical sampling. As the subsets of tape-recorded conversations got smaller, estimates of low-frequency terms became increasingly unreliable, and single segments less than three hours were nearly completely useless (as they said in the talk, they were "rubbish." Oh how I love British English!) It is also more accurate to use, say, four one-hour chunks from different conversations rather than one four-hour segment, as the former avoids "burstiness effects" that come from conversations and settings predisposing to certain topics.
Though this result isn't a surprise from a statistical sampling point of view, it is nice for the field to have some estimates of how little is "too little" (though of course how little depends somewhat on what you are looking for). And the paper highlights important methodological issues for those of us who can't trail after small children with our notebooks 24 hours a day.
December 20, 2005
The annual Boston University Conference on Language Development (BUCLD), this year held on November 4-6th, consistently offered a glimpse into the state of the art in language development. The highlight this year for me was a lunchtime symposium titled "Statistical learning in language development: what is it, what is its potential, and what are its limitations?" It featured a dialogue between three of the biggest names in this area: Jeff Elman at UCSD, who studies connectionist models of many aspects of language development; Mark Johnson at Brown, a computational linguist who applies insights from machine learning and Bayesian reasoning to study human language understanding; and Lou-Ann Gerken at the University of Arizona, who studies infants' sensitivity to statistical aspects of linguistic structure.
I was most interested in the dialogue between Elman and Johnson. Elman focused on a number of phenomena in language acquisition that connectionist models capture. One of them is "the importance of starting small," an argument that says essentially that beginning with limited capacities of memory and perception might actually be a helpful way of learning ultimately very complex things because it "forces" the learning mechanism to notice only the broad, consistent generalizations first and not to be led astray by local ambiguities and complications too soon. Johnson seconded that argument, and pointed out that models learning using Expectation Maximization embody this just as well as neural networks do. Another key insight of Johnson's was that statistical models implicitly extract more information from input than purely logical or rule-based models. This is because statistical models generally assume some underlying distributional form, and therefore when you don't see data from that distribution, that is a valuable form of negative evidence. Because there are a number of areas in which people appear not to receive much negative evidence, they must either incorporate or use statistical assumptions or be innately biased toward the "right" answer.
The most valuable aspect of the symposium, however, was the clarification of many of the issues in statistical learning and cognitive science in general that statistical learning can help to answer. Some of these important questions: in any given problem, what are the units of generalization that human learners (and hence our models) should and do use? [i.e., sentences, bigram frequencies, words, part of speech frequencies, phoneme transitions, etc] What is the range of computations that the human brain is capable of (possibly changing at different stages of development)? What statistical and computational models capture these? What is the nature of the input (the data) that human learners see; to what extent does this depend on factors external to them (the world) and to what extent is it due to internal factors (attentional biases, mental capacities, etc)?
If we can answer these questions, we will have answered a great many of the difficult questions in cognitive science. If we can't, I'd be very surprised if we make much real progress on them.
December 6, 2005
There are two ways of thinking about almost anything. Consider family and kinship. One the one hand, we all know certain rules about how people can be related to each other -- that your father's brother is your uncle, that your mother cannot be younger than you. But you can also do probabilistic reasoning about families -- for instance, that grandfathers tend to have white hair, that it is extremely unlikely (but possible) for your mother to also be your aunt, or that people are usually younger than their uncles (but not always). These aren't logical inferences; they are statistical generalizations based on the attributes of families you have experienced in the world.
Though the statistics-rule dichotomy still persists in a diluted form, today many cognitive scientists are not only recognizing that people can do both types of reasoning much of the time but also beginning to develop behavioral methods and statistical and computational models that can clarify exactly how they do it and what that means. The BLOG inference engine, whose prototype was released very recently by Stuart Russell's computer science group at Berkeley, is one of the more promising computational developments for this goal.
BLOG (which stands for Bayesian LOGic, alas, not our kind of blog!) is a logical language for generating objects and structures, then doing probabilistic inference over those structures. So for instance, you could specify objects, such as people, with rules for how those objects could be generated (perhaps a new person (a child) is generated with some probability from two opposite-gender parents), as well as how attributes of these objects vary. For example, you could specify that certain attributes of people depend probabilistically on family structure - if you have a parent with that attribute, you're more likely to have that attribute yourself. Other attributes might also be probabilistically distributed, but not based on family structure: we know that 50% of people are male and 50% are female regardless of the nature of their parents.
The power of BLOG is that it allows you both to specify quite complex generative models and interesting logical rules and to do probabilistic inference given the rules you've set up. Using BLOG, for instance, you could ask things such as the following. If I find a person with Bill's eyes, what is the probability that this person is Bill's child? Is it possible for Bill's son to also be his daughter?
Though a few things are unexpectedly difficult in BLOG - reasoning about symmetric relations like "friend," for instance - I think it promises to be a tremendously valuable tool for anyone interested in how people do probablistic reasoning over structures/rules, or in doing it themselves.
November 21, 2005
I'm fascinated by the ongoing evolution controversy in America. Part of this is because as a scientist I realize how important it is to defend rational, scientific thinking -- meaning reliance on evidence, reasoning based on logic rather than emotion, and creating falsifiable hypotheses. I also recognize how deeply important it is that our students are not crippled educationally by not being taught how to think this way.
But from the cognitive science perspective, it's also interesting to try to understand why evolution is so unbelievable and creationism so logical and reasonable to many fairly intelligent laypeople. (I doubt it's just ignorance or mendacity!) What cognitive heuristics and ways of thinking cause this widespread misunderstanding?
There are probably a number of things. Two I'm not going to talk about include emotional reasons for wanting not to believe in evolution as well as the tendency for people who don't know much about either sides of an issue to think the fair thing to do is "split the middle" and "teach both sides." The thing I do want to talk about today-- the one that's relevant to a statistical social science blog -- concerns people's notions of simplicity and complexity. My hypothesis is that laypeople and scientists probably apply Occam's Razor to the question of evolution in very different ways, which is part of what leads to such divergent views.
[Caveat: this is speculation; I don't study this myself. Second caveat: I am neither saying that it's scientifically okay to believe in creationism, nor that people who do are stupid; this post is about explaining, not justifying, the cognitive heuristics we use that make evolution so difficult to intuitively grasp].
Occam's Razor is a reasoning heuristic that says, roughly, that if two hypotheses both explain the data fairly well, the simpler is likely to be better. Simpler hypotheses, generally formalized as those with fewer free parameters, don't "overfit" the data too much and thus generalize to new data better. Simpler models are also better because they make a strong predictions. Such models are therefore falsifiable (one can easily find something they don't predict, and see if it is true) and, in probabilistic terms, put a lot of the "probability mass" or "likelihood" on a few specific phenomena. Thus, when such a specific phenomenon does occur, simpler models explain it better than a more complex theory, which spread the probability mass over more possibilities. In other words, a model with many free parameters -- a complicated one -- will be compatible with many different types of data if you just tweak the parameters. This is bad because it then doesn't "explain" much of anything, since anything is consistent with it.
When it comes to evolution and creationism, I think that scientists and laypeople often make exactly the opposite judgments about which hypothesis is simple and which is complex; therefore their invokation of Occam's Razor results in opposite conclusions. For the scientist, the "God" hypothesis (um, I mean, "Intelligent Designer") is almost the prototypical example of a hypothesis that is so complex it's worthless scientifically. You can literally explain anything by invoking God (and if you can't, you just say "God works in mysterious ways" and feel like you've explained it), and thus God scientifically explains nothing. [I feel constrained to point out that God is perfectly fine in a religious or spiritual context where you're not seeking to explain the world scientifically!] This is why ID is not approved by scientists; not because it's wrong, but because it's not falsifiable -- the hypothesis of an Intelligent Designer is consistent with any data whatsoever, and thus as theories go ... well, it isn't one, really.
But if you look at "simplicity" in terms of something like number of free parameters, you can see why a naive view would favor ID over evolution. On a superficial inspection, the ID hypothesis seems like it really has only one free parameter (God/ID exists, or not); this is the essence of a simple hypothesis. By contrast, evolution is complicated - though the basic idea of natural selection is fairly straightforward, even that is more complicated than a binary choice, and there are many interesting and complicated phenomena arising in the application of basic evolutionary theory (simpatric vs. allopatric speciation, the role of migration and bottlenecks, asexual vs sexual reproduction, different mating styles, recessive genes, junk DNA, environmental and hormonal affects on genes, accumulated effects over time, group selection, canalization, etc). The layperson either vaguely knows about all of this or else tries to imagine how you could get something as complicated as a human out of "random accidents" and concludes that you could only do so if the world was just one specific way (i.e. if you set many free parameters just exactly one way). Thus they conclude that it's therefore an exceedingly complex hypothesis, and by Occam's Razor one should favor the "simpler" ID hypothesis. And then when they hear scientists not only believe this apparently unbelievable thing, but refuse to consider ID as a scientific alternative, they logically conclude that it's all just competing dogma and you might as well teach both.
This is a logical train of reasoning on the layperson's part. (Doesn't mean it's true, but it's logical given what they know). The reason it doesn't work is twofold: (a) a misunderstanding of evolution as "randomness"; seeing it as a search over the space of possible organisms is both more accurate and more illuminating, I think; and (b) misunderstanding the "God" hypothesis as the simple one.
If I'm right that these are among the fundamental errors the layperson makes in reasoning about evolution, the the best way to reach the non-mendacious, intelligent creationist is by pointing out these flaws. I don't know if anybody has studied whether this hunch is correct, but it sure would be fascinating to find out what sorts of arguments work best, not just because it would help us argue effectively on a national level, but also because it would reveal interesting things about how people tend to use Occam's Razor in real-life problems.