13 May 2009
The social sciences have long embraced the idea of text-as-data, but in recent years, increasing numbers of quantitative researchers are investigating how to have computers find answers to questions in texts. This task might appear easy on the outset (as it apparently did to early researchers in machine translation), but, as we know, natural languages are incredibly complicated. In most of the applications in social science, analysts end up making a "bag of words" assumptions--the relevant part of a document are the actual words, not their order (this is not a unreasonable assumptions, especially given the questions being asked).
When I see applications of natural language processing (NLP) in the social sciences, I typically think very quickly to its future. Computers are making strides at being able to understand, in some sense, what they are reading. Two recent articles , however, give a good overview of the challenges that NLP faces. First, John Seabrook of the New Yorker had an article last summer, Hello, Hal, which states the problem clearly:
The first attempts at speech recognition were made in the nineteen-fifties and sixties, when the A.I. pioneers tried to simulate the way the human mind apprehends language. But where do you start? Even a simple concept like "yes" might be expressed in dozens of different ways--including "yes," "ya," "yup," "yeah," "yeayuh," "yeppers," "yessirree," "aye, aye," "mmmhmm," "uh-huh," "sure," "totally," "certainly," "indeed," "affirmative," "fine," "definitely," "you bet," "you betcha," "no problemo," and "okeydoke"--and what's the rule in that?
The article is mostly about speech recognition, but it definitely hits the main points about why human-generated language is so hard tricky. The second article, in the New York Times recently, is a short story about Watson, the computer that IBM is creating to compete on Jeopardy! IBM is trying to push the field of Question Answering quite a bit forward with this challenge. This goal is to create a computer that you can ask a natural language question to and get the correct answer. A quick story in the article indicates that they may a bit to go:
In a demonstration match here at the I.B.M. laboratory against two researchers recently, Watson appeared to be both aggressive and competent, but also made the occasional puzzling blunder.
For example, given the statement, "Bordered by Syria and Israel, this small country is only 135 miles long and 35 miles wide," Watson beat its human competitors by quickly answering, "What is Lebanon?"
Moments later, however, the program stumbled when it decided it had high confidence that a "sheet" was a fruit.
This whole Watson enterprise makes me wonder if there are applications for this kind of technology within the social sciences. Would this only be useful as a research aid, or are there empirical discoveries to be made with this? I suppose it comes down to this: if a computer could answer your question, what would you ask?