27 September 2007
Not to take anything away from David Lazer's presentation today at the Applied Stats workshop, but the star of his talk was the data. The crowd favorite appeared to be a dataset of all cell phone transactions over a several-week period for 7,000,000 subscribers somewhere in Europe (wouldn't say where). David and his colleagues have built a graph of interpersonal connections based on the call data, and are trying to answer questions like, "How many degrees of separation are there between two randomly selected people in the network?" (Answer: 13.) But to me an even more compelling question came up in the Q&A session: where do you get data like this?
David's answer was basically that you need to know the right people; it sounded as if he or one of his colleagues knew key executives at the phone company who were able to provide the call records. Lee Fleming offered that grad students might find their way to data like this by getting to know scholars like David who have access to it. (How many degrees of separation are there between you and your dream dataset?)
But the importance of knowing cell phone execs would be the wrong takeaway from David's talk, which after all was basically about how we are all awash in data these days. Yes, to get data on cell phone calls you may need to have friends at the phone company, and yes, to get information on where a group of MIT students spends every hour of the day over a few weeks you will have to launch your own experiment (as described in David's talk today), but for those of us with fewer connections and smaller research budgets there is still an enormous amount of data out there to collect, much of it from the web. I've actually spent a fair amount of time in the past year learning how to collect data from the web, and I look forward to blogging here about web scraping and other data collection approaches in the next few months. But right now I'm going to go check whether David left any tracking devices in my bag.
Posted by Andy Eggers at September 27, 2007 12:44 AM