28 May 2010
David Brooks has a column in today's New York Times about our difficulties assessing risk, in light of the oil leak in the Gulf of Mexico. One tendency he highlights is our excessive faith in devices aimed at minimizing risk. Although his general point is probably correct, I think he makes a statistical error in the example he uses to illustrate this tendency. Brooks seems to confuse numbers with rates.
He writes: "More pedestrians die in crosswalks than when jay-walking. That's because they have a false sense of security in crosswalks and are less likely to look both ways." I think that more pedestrians die in crosswalks than when jaywalking because more pedestrians cross the street in crosswalks than anywhere else. If the rate of dying is higher for pedestrians in crosswalks than when jaywalking, we might attribute the difference to our overconfidence in safety devices. However, the rate of dying need not be higher in crosswalks than elsewhere for more pedestrians to die in crosswalks. As long as many more people cross in crosswalks, it is likely that more will die in crosswalks. Brooks' point about risk assessment would have been stronger if he had considered not only the numerator (the number of deaths) but also the denominator (the number of people at risk of death) of the rate of pedestrian deaths that he's interested in.
19 May 2010
Measuring the extent to which our peers influence our behavior is hard for many reasons: one of the most basic is the difficulty of measuring who is a peer.
Manski catalyzed an econometric literature on how to identify three types of peer effects: "(a) endogenous effects, wherein the propensity of an individual to behave in some way varies with the behavior of the group; (b) exogenous (contextual) effects, wherein the propensity of an individual to behave in some way varies with the exogenous characteristics of the group and (c) correlated effects, wherein individuals in the same group tend to behave similarly because they have similar individual characteristics or face similar institutional environments." Bramoullé, Djebbari, and Fortin have a nice paper in which they show that when there are no unobserved correlated effects, you can use directed social network data to identify endogenous and exogenous effects (as long as the population is not partitioned into groups in which everyone in a certain group is influenced by everyone else in that group and no one outside that group). The intuition is that we can instrument for our friend's actions with the actions of our friend's friends who are not our friends. Identification thus relies on the presence of intransitive triads. Personally I think this is a really neat idea. Certainly it seems reasonable (and is empirically regular) to observe such triads, in which A is friends with B and B is friends with C but A is not friends with C. However, we also know that transitive triads occur frequently.
In all studies of social network effects, we rely on the network being accurately measured. In reality, there is a lot of room for measurement error in network data (Marsden has an overview). If you actually are friends with your friend's friends even though in the observed network data there are no direct links to indicate these friendships, then the identification strategy suggested by Bramoullé, Djebbari, and Fortin is problematic. In observed network data, we usually don't know for sure whether the absence of a link indicates that no relationship exists between two people or that a relationship exists but we did not observe it. However, it is possible to simulate network data and then ask, given that we observe an indirect link between two people, how likely is it that they have a direct link?
I used igraph to generate a series of random graphs to determine, at least in a few cases, the probability that two nodes have a direct link given that they have a path length of three or less. I used two network models: the (directed) Erdős-Rényi model, in which the connection probability between any two nodes is constant, and the (directed) Barabási-Albert model, in which the connection probability is proportional to the number of links a node already has. The amount of "preferential attachment" is tuned by the power parameter. I examine graphs of different sizes and different link probabilities (for Erdős-Rényi) or powers (for Barabási-Albert). The results are shown in the above figure. Each point represents the average probability of a direct link between two nodes in a graph given a path length of three or less between the two nodes, where the average is taken over 1,000 simulations of the graph.
While these results may be sensitive to the particular parameters I've chosen, a few patterns seem to stand out. The size of the network seems to matter less than the sparsity. Unsurprisingly, as tie probabilities increase, the probability of a direct link given that we observe an indirect link (path length three of less) also increases. Perhaps most notable is the "baseline" -- even when link probabilities are quite low, it is not very unlikely that an observed indirect link is actually a direct link (and this is particularly true in the somewhat more realistic Barabási-Albert model).
What do you think, should we worry about measurement error in networks? Do you know of good ways of handling such error?
7 May 2010
James Montgomery, University of Wisconsin sociologist and economist, has a draft of a new book on mathematical sociology available for download. Unlike James Coleman's classic 1964 text, Montgomery's comes complete with Matlab code --- a sign of 45 years of progress :)