November 2012
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30  

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries

Categories

Blogroll

SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


« App Stats: Hazlett and Hainmueller on "Kernel Regularized Least Squares: Moving Beyond Linearity and Additivity Without Sacrificing Interpretability" | Main | App Stats: Pattanayak on "A Potential Outcomes, and Typically More Powerful, Alternative to 'Cochran-Mantel-Haenszel'" »

5 November 2012

App Stats: Bischof on "Summarizing Topical Content in Document Collections with Word Frequency and Exclusivity"

We hope you can join us this Wednesday, November 7, 2012 for the Applied Statistics Workshop. Jon Bischof, a Ph.D. candidate from the Department of Statistics at Harvard University, will give a presentation entitled "Summarizing Topical Content in Document Collections with Word Frequency and Exclusivity". A light lunch will be served at 12 pm and the talk will begin at 12.15.

"Summarizing Topical Content in Document Collections with Word Frequency and Exclusivity"
Jon Bischof
Department of Statistics, Harvard University
CGIS K354 (1737 Cambridge St.)
Wednesday, November 7th, 2012 12.00 pm

Abstract:

An ongoing challenge in the analysis of document collections is how to summarize content in terms of a set of inferred themes that can be interpreted substantively in terms of topics. However, the current practice of summarizing themes in terms of most frequent words limits interpretability by ignoring the differential use of words across topics. We argue that words that are both frequent and exclusive to a theme are more effective at characterizing topical content. We consider a setting where professional editors have annotated documents to a collection of topic categories, organized into a tree, in which leaf-nodes correspond to the most specific topics. Each document is annotated to multiple categories, at different levels of the tree. We introduce Hierarchical Poisson Convolution (HPC) as a model to analyze annotated documents in this setting. The model leverages the structure among categories defined by professional editors to infer a clear semantic description for each topic in terms of words that are both frequent and exclusive. We develop a parallelized Hamiltonian Monte Carlo sampler that allows the inference to scale to millions of documents.

Posted by Konstantin Kashin at November 5, 2012 11:29 AM