31 May 2008
I'm grateful for the strong response to my original query for quality, free PDF annotation for Linux. In general, there seem to be a few categories.
-Windows-based editors, adaptable through emulators: PDF X-change, Foxit (free version), primopdf
-Linux editors with non-portable annotations: Okular, which has hidden XML files for its annotations (skim, for OS X, has the same scheme)
-early, incomplete solutions that will eventually be good: GNU's PDF project, Xournal
-early, incomplete solutions that aren't user-friendly: pdfedit, Cabaret Stage
-early solutions that are still in progress: evince
Of all of these options, I like Okular the best, mainly because integrating its XML-saved annotations into the PDF is but one plugin away (which might already exist, for all I know), and it's theoretically portable to Windows by installing qt4 binaries. Using an emulator like wine is a hassle big enough that I've avoided it, for the same reason I don't use cygwin on Windows systems.
So we're close to a (more) universal free editing environment. But I'm still not a fan of doing all my work on a screen, and also not willing to print. So I'm trying a middle road.
I bought an iLiad e-paper reader this past week, and so far I'm impressed with how it handles (though its price tag, $600 for the model I bought, definitely isn't for everyone, and was almost not for me). The screen is easily readable, the battery lasts, and I can zoom in and rotate documents to get a half-page display with larger text. More importantly, the device runs Linux and iRex has made a point to try and use open source software as much as possible, in contrast to Amazon and the Kindle (which is half the size, can't read PDFs and can't edit books.)
However, as the project is still in its relative infancy, there are a few functions it has yet to incorporate that I really would like, and they're the same ones I want in a computer-based annotator: highlighting multiple-column text, for example, so that I can extract passages I want later at the push of a button. And like Okular, the annotations made on the iLiad are saved in a companion XML file rather than the original PDF, but the company offers a free program to do the merging.
I'm going to continue to explore what the iLiad can do as far as editing, but it's definitely reassuring that everyone who's seen me used it has oohed and aahed at it.
To sum up, I've now got a free platform for reading, editing and annotating PDFs on a Linux machine, and an auxiliary paper-free method for reading them later which is admittedly not free. And I have more needs as well, but I can at least see them being met soon. What else do people want in paperless work we haven't covered yet?
P.S. If the people from iRex are reading this and want me to shill for them for real, they can let me know directly.
26 May 2008
I'm a Linux user in need of a quality PDF reader with basic annotation tools, and I need it to be available for free. Think I'm asking for too much?
We're at a point where the level of content available online dwarfs our ability to print it all onto paper for examination and notation. As academics, we're expected to sort through volumes of other people's work in order to verify that our own is original, as well as comment, annotate, and on occasion make corrections or forward-references to later works.
But despite a boom in computational power and information bandwidth, the software to do this without resorting to printed or copied matter isn't accessible to most students without paying through the nose. Full software suites like Adobe Acrobat aren't necessary for the kind of work academics need to do. There are a few functions that are essential to the task, currently available in commercial software:
-Adding and reading notes, whether free-floating or attached to highlighted text
-The ability to select and copy multi-column text (none of the free ones seem to be able to get this one right)
-I'd like that when LaTeX creates a link to a footnote or citation, hovering over the displayed link should cause a pop-up box to display the information.
I'm a man with big ideas but no time, and more importantly, no budget, to motivate and drive the development and use of a free PDF reader with mild annotation capabilities. I can't resort to the for-pay software available from the school website because I'm running Linux, and I shouldn't have to go to a virtual machine or another computer to do this kind of annotation. Likewise, others shouldn't have to spend hundreds for software where they only need a few simple functions.
I suppose the issue is that everyone has their own toys they want included in a PDF editor, which is why the commercial package makes sense. But as academics, wouldn't we be happy with "the basics plus"?
22 May 2008
Professor Nicholas Christakis and Professor James Fowler's study on social network and smoking cessation is featured in the New York Times, which is also going to appear in the New England Journal of Medicine this Thursday. Congratulations to them!
Their basic findings are that smokers are likely to quit in groups (As Nicholas said, "Whole constellations are blinking off at once.") and that the remaining smokers tend to be socially marginalized.
One interesting question I have for their study is that, if friends tend to quit smoking together, will this partly contribute to the simultaneous weight gains among friends, a result Nicholas and James have found last year using the same dataset? In other words, I totally accept that social ties have important impacts on individuals' wellbeing, but if you try to research a certain outcome of wellbeing and do not control for the "contaminating" effects from other outcomes, the estimation of the social network effects on the former outcome could be biased. For example, the weight gains among friends, from this point of view, could be partially resulted from their simultaneous quitting from smoking. Of course, if smokers only consist of a very small fraction of the participants in the studied sample and their weight changes are not too extreme, the bias of the estimation should not invoke a serious problem.
See the following link for a glimpse of their study.
Study Finds Big Social Factor in Quitting Smoking
Sorry for the duplicate if you have noticed this news.
20 May 2008
Jointly with Dave Kane, an IQSS fellow and head of Kane Capital, I've been working on applying causal inference techniques to the financial problem of performance evaluation. We have a draft on SSRN up here.
The problem: how do you evaluate a stock portfolio's performance? This is usually done by comparing the returns on the manager's portfolio against those of a counterfactual portfolio of investments the manager could have chosen, but did not. A common choice is a passive portfolio like the S&P 500. If a manager can't perform at least as well as a passive benchmark like this, why not just invest in the S&P 500? But this may not be a fair comparison, since the S&P 500 contains only large-cap stocks, while the manager may actually have considered a wider universe of possibilities. Any difference in returns could be due to the portfolio's smaller capitalization rather than the manager's stock-picking ability.
Dave Kane and I view performance evaluation as a causal inference problem. We consider the treatment to be the manager's claimed advantage. Does he time the market? Does he pick hot sectors? Most commonly a manager claims an ability to pick stocks. Then the covariates are the set of confounding factors: observable characteristics of stocks, such as their capitalization, sector and country.
To get a better benchmark, we propose forming a matching portfolio of stocks with similar characteristics, but which are not held in the portfolio. In the leftmost figure above, the black dots represent the characteristics of holdings in a particular portfolio we considered (an equal-weighted portfolio based on the StarMine indicator). The gray dots represent non-holdings. We form the matching portfolio by matching each black dot to a nearby gray dot, using a propensity score method. When we're done, we end up with a well-matched portfolio -- the exposures are compared in the second figure, and they line up nicely. Notice from the figure that there are several possible matched portfolios -- we consider a random set of 100 of them, matching within a thin caliper, as part of our benchmark.
Finally, we compare the realized portfolio return against the returns of the matched portfolios. When we do that, we obtain the histogram below. The portfolio outperforms 75% of the matched portfolios, suggesting there's a moderate but not overwhelming amount of evidence for the stock-picking ability of the StarMine indicator.
In the paper we consider several extensions of this framework to situations with non-equal portfolio weights and to long-short portfolios. We employ the generalized propensity score of Imai and Imbens to form the matching portfolios in this case, treating the portfolio weights of the stocks as a continuous treatment.
We welcome any comments, thoughts or reactions to these ideas! The SSRN draft is linked above, and an accompanying R package is available here if you want to reproduce the computations.
19 May 2008
Mark Blumenthal from pollster.com has been posting interviews with scholars at the 2008 AAPOR conference, including two with our very own Sunshine Hillygus and Chase Harrison from the Program on Survey Research:
15 May 2008
I just finished reading an interesting paper on placebo effects in drug trials by Anup Malani. Malani noticed that participants in high probability trials know that they more likely to get active treatment (because of informed consent prior to the trial). They have higher expectations and hence should have higher placebo effects than patients in low probability trials. Malani compares outcomes across trials with different assignment probabilities and finds evidence for placebo effects. A related finding is that the control group in high probability trials reports more side effects.
The paper discusses some potential implications of placebo effects, e.g. that patients who are optimistic about the outcome might change their behavior and hence get better even without the active drug. It makes me wonder how this might translate into non-medical settings and whether there are studies of placebo effects in the social sciences. Also, if placebo drugs can improve health outcomes, maybe ineffective social programs would still work as long as participants don’t know whether the program works or doesn’t? Maybe this is the role of politics. But what about the side-effects?
Malani, A (2006) “Identifying Placebo Effects with Data from Clinical Trials” Journal of Political Economy, Vol. 114, pp. 236-256. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=901838
A medical treatment is said to have placebo effects if patients who are optimistic about the treatment respond better to the treatment. This paper proposes a simple test for placebo effects. Instead of comparing the treatment and control arms of a single trial, one should compare the treatment arms of two trials with different probabilities of assignment to treatment. If there are placebo effects, patients in the higher-probability trial will experience better outcomes simply because they believe that there is a greater chance of receiving treatment. This paper finds evidence of placebo effects in trials of antiulcer and cholesterol-lowering drugs.
13 May 2008
I know this isn't my normal day, but three points today:
I'm less worried about the turnout discrepancy; it happened because there had been no semi-open Democratic primary since Huckabee dropped out of the Republican contest. I was forced to use Pennsylvania (a closed primary) and Ohio (a semi-open primary, but with Huckabee still formally in) to predict turnout, which resulted in my underestimates. I'm more confident about my turnout projection in West Virginia, which is a semi-open primary, now that I have North Carolina to use as a predictor.
In predicting voter shares, my overall county-level correlations were .81 for Indiana and .88 for North Carolina -- on the whole pretty good, but with some problems. Below are spatial plots of residuals for North Carolina, and Indiana's appear above. Dark red corresponds to overestimation of Obama's support, and dark grey to underestimation of Obama's support.
The biggest mistake in my North Carolina predictions came with rural Blacks, who had not appeared significantly in my training data. The largest-magnitude residual was Greene County, a rural county that's 50% White and 40% Black (it's the small dark red). I projected a 70%-30% Obama victory, as is typical for counties with this racial split (note that among Democrats in such a county, Blacks will dominate). But somehow Clinton actually won this county 53% to 47%, putting me 23% off. In all of the neighboring rural black counties I had similarly overestimated Obama's support. This points to a possible interaction effect -- that rural blacks are more pro-Clinton than urban blacks.
Now to my top-line West Virginia prediction: Clinton 70.5%, Obama 29.5%, with a turnout of 300,000 votes. The map is below. I have Clinton taking every county in the state. Obama comes closest in Jefferson (a high-income, well-educated county next to Virginia) and Monongalia (a well-educated urban county that’s part of Pittsburgh tri-state).
With Clinton's impending departure, however, I plan to abandon these projections and move on to other fun. I really want to try a language model on Obama's and McCain's speeches.
I recently came across Datamob.org, a site featuring public datasets and interfaces that have been built to help the public explore them.
From datamob's about page:
Our listings emphasize the connection between data posted by governments and public institutions and the interfaces people are building to explore that data.
It's for anyone who's ever looked at a site like MAPLight.org and wondered, "Where did they get their data?" And for anyone who ever looked at THOMAS and thought, "There's got to be a better way to organize this!"
I continue to wonder how the types of interfaces featured on datamob will affect the dissemination of information in society. The dream of a lot of these interface builders is to disintermediate information provision -- ie, to make it possible for citizens to do their own research, produce their own insights, publish their findings on blogs and via data-laden widgets. (We welcomed Fernanda and Martin from Many Eyes, two prominent participants in this movement, earlier this year at our applied stats workshop.) At the same time, the new interfaces make it cheaper for professional analysts -- academics, journalists, consultants -- to access the data and, as they have always done, package it for public consumption. It makes me wonder to what extent the source of our data-backed insights will really change, ie, how much more common will "I was playing around with data on this website and found out that . . . " become relative to "I heard about this study where they found that . . ."?
My hunch is that, just as blogging and internet news has democratized political commentary, the new data resources will make it possible for a new group of relatively uncertified people to become intermediaries for data analysis. (I think FiveThirtyEight is a good example in political polling, although since the site's editor is anonymous I can't be sure.) People will overwhelmingly continue to get data insights as packaged by intermediaries rather than through new interfaces to raw data, but the intermediaries (who will use these new services) will be quicker to use data in making their points, will become much larger in number, and will on average become less credentialed.
9 May 2008
fabulous three part series on further adventures in identification on the Freakonomics blogs here, here, and here. The story features Kennedy School Professor Robert Jensen in his five year long quest of achieving rigorous identification for Giffen effects. After finding correlational evidence for Giffen goods in survey data he and his co-author actually followed up by running an experiment in China and guess what, they do find evidence for Giffen behavior. Impressive empirics and a funny read, enjoy!
8 May 2008
Last week we had an International Meeting on Methodology for Empirical Research on Social Interactions, Social Networks, and Health here at the IQ., thanks to the organization by Professor Charles Manski and Professor Nicholas Christakis. Some people told me that the second day of the meeting was much more "dynamic and interactive" than the first day and based on what I have seen, I believe it was true. I saw at least three cliques of speakers were automatically formed on site along the disciplinary lines: statisticians, economists, and sociologists and political scientists. There were even sub-cliques and backfires! Fortunately, nobody was severely wounded. But anyway, it was a great intellectual exchange between disciplines. Below are some brief notes I took at the second day of the meeting, particularly at the last 20 minutes of the meeting when speakers talked about the future directions of network analysis in social sciences. Sorry for that I forgot to jot down exactly who said what, and that I also squeezed into the notes some of my personal thoughts. I took full responsibility for all errors in the notes.
1. Need to combine game theory with social network analysis, particularly evolutionary game theory (and transaction costs theory).
2. Need to further develop social network analysis based on (random) graph theory, typology and random matrix theory.
3. Network studies tend to focus on network structure and typology as dependent variables while social sciences are more concerned with how network positions and features affect node level of problems. To put simply, network studies tend to start from nodes and end at network while social sciences are more like a top-down approach.
4. In either case, however, it is very crucial to understand the data/tie generating mechanism. Especially, think that the formation of ties can go two ways: influence and selection. For example, smokers can become friends either because a person is influenced by his/her smoking friend to start smoking or because they are both smokers and then become friends. For another example, a highly educated person is usually less likely to be nominated by others as the best friend. This could be either because the highly educated person is less trustworthy or incapable to maintain friend ties or because he/she is more independent and less wiling to associate with others.Longitudinal data may help solve the influence vs. selection issue.
5. Network analysis assumes that the probability of forming ties between nodes is the same between any pair of nodes. So start with a meaningful number of nodes to build network so that each node have roughly the same probability to form ties with one another.
6. How the sever of an existing tie and the formation of a new tie will affect the structure of social network? How ties can bring more ties and lead to polarized network? Nonlinear generating processes and dynamics in network can lead to dramatic difference in network structure for any tiny changes at the node level. How network size can affect network structure? (Think about the difference among monopolistic market, oligarchic market and perfect competitive market.)
7. How to define homophyly between friends? One dimension vs. multiple dimensions? Suppose it is one dimension, there are still two approaches: 1) do a mean test between the tie senders and the tie receivers. 2) Use the ratio of the number of ties whose connected nodes are in the same group (e.g., age +/- 5) that you defined to the total number of ties as an alternative measure. What else?
8. Need to think about how to incorporate network analysis into traditional regression framework. We can either include network properties into regression models to study how network affect personal/clique level of phenomena or use regressions to evaluate how network properties are determined by socioeconomic variables.
9. How to deal with the dependence structure among node level of variables since the errors are not iid.? Is it enough to just using correlation matrix to weight the standard errors and get robust SEs?
10. Need to combine network software with traditional statistical software. The stat-net is getting there. But for Stata users, canned programs are needed to generate network data inside of Stata.
Lastly, for those of you who are interested in causal analysis, read Patrick Doreian (2001), "Causality in Social Network Analysis" (Sociological Methods and Research 30: 81-114) and see if you can improve upon his study.
7 May 2008
Here's a link to a free, 18-hour mini-course on recent advances in econometrics and statistics from the National Bureau of Economic Research. It's co-taught by Guido Imbens and Jeffrey Wooldridge. The intended audience is obviously economists, but there are several topics (Bayesian inference, missing data, etc.) that are likely of interest to a wide range of social scientists. The course includes lecture videos, slides, as well as detailed notes on each topic.
One of the pesky things I've found in my (limited) experience with survival analysis is that it's almost impossible to plot several survival curves in the same space and include measures of uncertainty without the entire plot becoming incomprehensible. So, to build on the great R discussions Ellie and Andy have provided in recent blog posts, I'd like to offer an extension of my own. I've created a fairly flexible function that allows one to plot several survival curves along with estimation uncertainty from Zelig's Cox proportional hazards output (which was developed by Patrick Lam). Here are two examples of what my surv.plot() function can provide:
Hopefully this will be of some interest to a few readers. More details and example code below.
Here is the syntax for the command:
s.out: Simulated output from Zelig for each curve organized as a list()
duration: Surival time
censor: Censoring indicator
type: Display type for confidence bands. The default is "line" but "poly" is also supported (to create the shaded region in the right plot above).
plotcensor: Creates rug() plot indicating censoring times (Default is TRUE)
plottimes: plots a point for each survival time in the step function (Default is TRUE)
int: Desired uncertainty interval (Default is c(0.025,0.975) which corresponds to a 95% interval)
Here's the plot.surv() source code, and below I've copied the R code I used to create the plots above:
# Fit the Cox Model
z.out1 <- zelig(Surv(duration,ciep12)~invest+numst2+crisis,
# Set Low and High Quantities of Interest
low <- setx(z.out1,numst2=0)
high <- setx(z.out1,numst2=1)
# Simulate for Each
s.out1 <- sim(z.out1,x=low)
s.out2 <- sim(z.out1,x=high)
# Create list output that contains both simulations
out <- list(s.out1,s.out2)
# Plot the results
6 May 2008
I've been programming in R for four years now, and it seems that no how much I learn there are a million tiny ways that I could do it better. We all have our own programming styles and frequently used functions that may prove useful to others. I often find that a casual conversation with an office mate yields new approaches to a programming quandary. I'm speaking not of statistical insights, though those are important too, but rather the "simple" art of data manipulation and programming implementation--those essential tricks that help to improve coding efficiency. So, to that end I'm announcing the beginning of a bi-weekly "Tuesday Tips & Tricks" posting. These tips may include the description of a useful and perhaps obscure function, or the solutions to common coding problems. I'm selfishly hoping that if readers of this blog know of better or alternate approaches, they'll respond in the comment section. So I'm looking forward to reading your responses.
This week's tip: How to quickly summarize contents of an object.
Answer: summary(), str(), dput()
The primary option, of course, is the familiar summary() command. This command works well for viewing model output, but also to get a quick sense of data frame, matrices and factors. For example, summary of a data frame or matrix shows the following:
Hello test citynames
Min. :1.00 Min. :-3 Length:2
1st Qu.:1.25 1st Qu.:-2 Class :character
Median :1.50 Median :-1 Mode :character
Mean :1.50 Mean :-1
3rd Qu.:1.75 3rd Qu.: 0
Max. :2.00 Max. : 1
This is an incredibly useful function for numeric data, but is less useful for string data. For character vectors the summary function only reveals the length, class, and mode of the variable. In this case, to get a quick look at the data, one might want to use str(). Officially str() "compactly displays the structure of an arbitrary R object", and in practice this is incredibly useful. So using the same dataframe as an example:
'data.frame': 2 obs. of 3 variables:
$ Hello : num 1 2
$ test : num -3 1
$ citynames: chr "Cambridge" "Rochester"
In this case, this is just a 2 x 3 data frame, where the first variable is Hello, it's a numeric variable, and the values of the variable Hello are: 1, 2. In this case, the character vector for citynames is much more usefully displayed. While this is a small example, the function works just as well for much larger data frames and matrices where it only displays the first ten values of each variable.
For smaller objects, the function dput() might also prove useful. This function shows the ASCII text representation of the R object and it's characteristics. So for this same example:
structure(list(Hello = c(1, 2), test = c(-3, 1), citynames = c("Cambridge",
"Rochester")), .Names = c("Hello", "test", "citynames"), row.names = c(NA,
-2L), class = "data.frame")
4 May 2008
Since I have qualifying exams tomorrow, I'll keep this entry unimaginative. I've re-run my predictions for the Indiana and North Carolina primaries on Tuesday, adding a few new bells and whistles:
With the help of a turnout model, I can actually predict the election result by multiplying turnout by population and adding up votes for Clinton and Obama. When I do that, I get:
Indiana: Clinton 53.5%, Obama 46.5%; turnout 950,000
North Carolina: Obama 58%, Clinton 42%; turnout 1,200,000
Yowzers! We'll see how the real numbers pan out. Here are a few details on the two models:
This time I included even more covariates for both models. Next to the ones found to be important, I've placed their effect in parentheses.
How do my results stack up against the current polls? In Indiana, the RealClearPolitics average has Clinton +6%, only a point from my prediction. In North Carolina, the RCP average has Obama +8%, significantly below my predicted 16% victory. Two factors shed light on this discrepancy:
We'll see how it pans out on Tuesday. I'm more than willing to eat crow :)
1 May 2008
James Heckman has a new NBER working paper ``Econmetric Causality’’ which some of you might interesting. To give you a flavor, Heckman writes
``Unlike the Neyman–Rubin model, these [selection] models do not start with the experiment as an ideal but they start with well-posed, clearly articulated models for outcomes and treatment choice where the unobservables that underlie the selection and evaluation problem are made explicit. The hypothetical manipulations define the causal parameters of the model. Randomization is a metaphor and not an ideal or “gold standard".’’ (page 37)
Heckman, J (2008) ``Econometric Causality’’ NBER working paper #13934. http://papers.nber.org/papers/W13934
Abstract: This paper presents the econometric approach to causal modeling. It is motivated by policy problems. New causal parameters are defined and identified to address specific policy problems. Economists embrace a scientific approach to causality and model the preferences and choices of agents to infer subjective (agent) evaluations as well as objective outcomes. Anticipated and realized subjective and objective outcomes are distinguished. Models for simultaneous causality are developed. The paper contrasts the Neyman-Rubin model of causality with the econometric approach.