April 2008
Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30      

Authors' Committee


Matt Blackwell (Gov)


Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Weekly Workshop Schedule

Recent Comments

Recent Entries



SMR Blog
Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
The Education Wonks
Empirical Legal Studies
Free Exchange
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science



Powered by
Movable Type 4.24-en

« April 16, 2008 | Main | April 21, 2008 »

18 April 2008

Linguistics of the Debate

In last week's debate in Philadelphia,

  • Clinton's favorite phrase was "You know," which she used 49 times to Obama's 18
  • Obama's favorite phrase was "American people," which he used 16 times to Clinton's 1
  • Obama was the only one to use the words "politics" (10 times), "economic" (9 times) and "election" (9 times).

Last week's debate provides a small but interesting corpus to analyze the candidates' favorite linguistic formulations. Overall,

  • 12,329 words were uttered by a candidate
  • Obama uttered 6,206 words (1,331 unique) in 40 chunks
  • Clinton uttered 6,123 words (1,250 unique) in 37 chunks

So all in all, the candidates spoke about the same number of words. But which words? We can test that using a basic corpus comparison method. In all, there were 1,971 unique words. For each of these, we test the hypothesis that the candidates spoke the word with equal probability, using a simple chi-squared test. Next we sort all words by their p-values so that the most differentially expressed words percolate to the top. Here are the top 20 words by p-value, along with their frequencies from Obama and Clinton.

will 18 560.0000
know 23 640.0000
that's 43 120.0001
she 16 00.0002
it 41 790.0005
how 36 120.0010
clinton 14 10.0021
he 5 210.0029
politics 10 00.0047
this 58 300.0047
american 20 50.0056
begin 0 90.0072
york 0 90.0072
decade 9 00.0081
economic 9 00.0081
election 9 00.0081
going 49 260.0128
give 1 100.0149

Sometimes control words (I, it, etc.) are excluded from analysis, but here I thought it would be fun to leave them in so we could see each candidate's preferred constructions. Besides the points listed above, here are a few interesting notes:
- Clinton used the word "I" 205 times to Obama's 150
- Obama loves to start sentences with "That's:" "That's why I'm...", "That's what we're," etc.
- Obama loves the word "decade" -- evidently he used the phrase "decades after decades" several times

Of course, unigrams -- single words -- can only tell you so much. If we do the same analysis using bigrams, a few more bits of information drip out:

you know18490.0002
american people16 10.0008
senator clinton13 00.0009
the american17 20.0014
and that's13 10.0035
have a 5200.0046
this country10 00.0047
i will 7230.0055
going to46220.0061
new york 0 90.0072

So Clinton always punctuates her thoughts with "you know," while Obama attributes his goals to the "American people."

It will be interesting when McCain gets into the mix with one of these two. I think it would be fun to construct a language model -- a model for the probability that each candidate spoke a certain sentence. Given the differences, I bet that given a sentence, it could easily figure out whether Obama, Clinton or McCain said it!

Posted by Kevin Bartz at 12:44 PM