13 April 2014
Please join us for this workshop on Github.
Tuesday, April 15
Center for Complex Network Research
Dana Building, 5th Floor
with Navid Dianati (LazerLab, Northeastern University)
Come review the basics of the popular and open source revision control software Git, as well as the free online hosting service GitHub. Through examples, see how Git may be used for source code documentation and revision control in single-author and collaborative projects.
1 April 2014
Another month, another set of predictions...
costa rica - incumbent party lose - 70.0%
guinea bissau - incumbent party win - 79.1%
macedonia - incumbent party win - 97.3%
afghanistan - incumbent party win - 55.9%
algeria - incumbent party win - 99.2%
south africa - incumbent party win - 98.3%
panama - incumbent party win - 98.9%
lithuania - incumbent party win - 93.6%
malawi - incumbent party lose - 55.5%
colombia - incumbent party win - 97.5%
ukraine - incumbent party lose - 70.3%
indonesia - incumbent party lose - 77.9%
turkey - incumbent party lose - 75.1%
Ref: C1, DM1, and DP10
24 March 2014
In my posts over the last few weeks, I have focused on the "massive passive" data collections that occur through the observation of behavior on platforms that happen to record those behaviors, such as Google, Twitter, etc. There is another thrust of "big data", which is the purposeful instrumentation of humans so as to collect detailed information about their behaviors. One example of this is this 2009 paper in PNAS, with Nathan Eagle and Sandy Pentland, which involved programming ~100 phones to collect on going communication, locational, and proximity data. Such deep data collection has enormous scientific possibilities, but also poses challenges to standard pillars of protection of human subjects. How to manage consent when people don't know what data collection months into the future might reveal? What are the practical challenges in keeping the data secure, as it makes its way from (say) a smartphone to some centralized data infrastructure? A number of us, led by Arek Stopczynski and Sune Lehmann, have gathered some thoughts on the issues and the practicalities, which we have posted on Arxiv. This is very much a work in progress, and we would very much appreciate feedback.
Arkadiusz Stopczynski, Riccardo Pietri, Alex Pentland, David Lazer, Sune Lehmann
(Submitted on 20 Mar 2014)
In recent years, the amount of information collected about human beings has increased dramatically. This development has been partially driven by individuals posting and storing data about themselves and friends using online social networks or collecting their data for self-tracking purposes (quantified-self movement). Across the sciences, researchers conduct studies collecting data with an unprecedented resolution and scale. Using computational power combined with mathematical models, such rich datasets can be mined to infer underlying patterns, thereby providing insights into human nature. Much of the data collected is sensitive. It is private in the sense that most individuals would feel uncomfortable sharing their collected personal data publicly. For this reason, the need for solutions to ensure the privacy of the individuals generating data has grown alongside the data collection efforts. Out of all the massive data collection efforts, this paper focuses on efforts directly instrumenting human behavior, and notes that -- in many cases -- the privacy of participants is not sufficiently addressed. For example, study purposes are often not explicit, informed consent is ill-defined, and security and sharing protocols are only partially disclosed. This paper provides a survey of the work related to addressing privacy issues in research studies that collect detailed sensor data on human behavior. Reflections on the key problems and recommendations for future work are included. We hope the overview of the privacy-related practices in massive data collection studies can be used as a frame of reference for practitioners in the field. Although focused on data collection in an academic context, we believe that many of the challenges and solutions we identify are also relevant and useful for other domains where massive data collection takes place, including businesses and governments.
15 March 2014
One of the challenges in translating searches for health related information into frequency information (such as Google Flu Trends attempts) is that we don't know why a particular person is searching for that information. If I search for "flu cures" am I searching because I have the flu? Because someone in my family has the flu? Is it because I want to be prepared, or because I am just interested in the subject? And exactly which searches tend to be associated with which motivations? If we knew the signal relationship between searches and whether someone was sick (say, with the flu), and how that varied over time (say, with season) and with exogenous shocks (like news coverage) then it would be possible to build stronger legs underneath efforts such as GFT.
Interestingly, Google has (or did) have such an effort to disentangle exactly the motivations when people were searching for health related information, launched around 5 years ago. My reaction is two-fold: bravo, for complementing their GFT efforts with this survey based approach for validation of GFT, and shame on them, for not somehow sharing the insights/data. Note: if I am wrong about the latter, please do drop me a note/tweet me and I will post a correction. In any case, it does reflect the kind of effort that should undergird big data efforts of this type; and it again highlights the need for transparency in this type of research.
Judging from the number of tweets with "big data hubris" (I'll admit the irony of my using this metric) our paper on Google Flu Trends has gotten a bit of a buzz. There is one small point I would like to elaborate on. A few people have suggested perhaps GFT is right and the CDC data are wrong. In our analysis/discussion we are not assuming that the CDC data are "right" (indeed, in a trivial sense they must be wrong and the statistical question is, generally, how wrong are they). However, GFT is built on top of CDC data--technically, it's not a predictive model of flu prevalence, it's a predictive model of future CDC reports about the present. If the CDC data have warts, if GFT is working well, it will fit those warts. If the CDC data underestimate flu prevalence in certain regions, say, then GFT will as well.
As we note in the paper, the interpretation would be different for a methodology that directly aimed to measure flu prevalence. In that case, if there were a deviation, one would have make an assessment as to which method was more likely to be accurate.
Of course, there are a few minor caveats to this. For example, if for some exogenous reason the CDC data were to dramatically drop in quality at a certain point in time (say, if funding for data collection were slashed) then there could be an argument that an approach such as GFT would, for a period at least, be more accurate than CDC data (since GFT would have been fit to the previously higher quality CDC data). But I don't think we have any reason to believe that this is the case currently.
14 March 2014
As a fellow at Harvard's Center Berkman Center for Internet & Society, Danah Boyd, has produced research on social media, social networks and its users (e.g. teens) that was also referred on this blog in the past. She recently published a book titled "It's complicated - The Social Lives of Networked Teens". She explores a range of issues, research and trends related to teenager's (and their parents use/role) utilization of new technology/social media. You can buy it but since she is interested in sharing her research with as many people as possible, the complete book can be downloaded as a PDF for free.
13 March 2014
As I've discussed before on this blog, big data have awesome potential in understanding human behavior and societal scale. Indeed, I expect to have emerged from the social and behavioral sciences over the next generation a "societal science" driven by the research opportunities that big data offer.
However, big data last winter had its "Dewey beats Truman" moment, when the poster child of big data (at least for behavioral data), Google Flu Trends (GFT), went way off the rails in "nowcasting" the flu--overshooting the peak last winter by 130% (and indeed, it has been systematically overshooting by wide margins for 3 years). Tomorrow we (Ryan Kennedy, Alessandro Vespignani, and Gary King) have a paper out in Science dissecting why GFT went off the rails, how that could have been prevented, and the broader lessons to be learned regarding big data.
[We have posted the pre-accepted version of the manuscript The Parable of Google Flu (WP-Final).pdf. We have also posted an SSRN paper evaluating GFT for 2013-14, since it was reworked in the Fall.]
Key lessons that I'd highlight:
1) Big data are typically not scientifically calibrated. This goes back to my post last month regarding measurement. This does not make them useless from a scientific point of view, but you do need to build into the analysis that the "measures" of behavior are being affected by unseen things. In this case, the likely culprit was the Google search algorithm, which was modified in various ways that we believe likely to have increased flu related searches.
2) Big data + analytic code used in scientific venues with scientific claims need to be more transparent. This is a tricky issue, because there are both legitimate proprietary interests involved and privacy concerns, but much more can be done in this regard than has been done in the 3 GFT papers. [One of my aspirations over the next year is to work together with big data companies, researchers, and privacy advocates to figure out how this can be done.]
3) It's about the questions, not the size of the data. In this particular case, one could have done a better job stating the likely flu prevalence today by ignoring GFT altogether and just project 3 week old CDC data to today (better still would have been to combine the two). That is, a synthesis would have been more effective than a pure "big data" approach. I think this is likely the general pattern.
4) More generally, I'd note that there is much more that the academy needs to do. First, the academy needs to build the foundation for collaborations around big data (e.g., secure infrastructures, legal understandings around data sharing, etc). Second, there needs to be MUCH more work done to build bridges between the computer scientists who work on big data and social scientists who think about deriving insights about human behavior from data more generally. We have moved perhaps 5% of the way that we need to in this regard.
Anyhow, there's more in the paper, so feel free to check it out.
I suspect we will be getting an upsurge in traffic over the next few days. I hope that while you are visiting you would be willing to volunteer for one of our citizen science projects:
1) to participate in some online experiments at Volunteer Science; or
2) if you have call/text log data that stretch back to last April + an Android phone, that you participate in our study on communication during the Boston Marathon (non Bostonians are welcome to participate!)
8 March 2014
As we celebrate International Women's Day, my Twitter feed brims with links to reports on the state of women in the workplace. Several of them are fairly upbeat about the upward trends of the past decade. For example, a Financial Times Special Report on Women in Business in Emerging Markets shows that the proportion of women in senior management positions has increased globally, from 19% in 2004 to 24% in 2014 (in case you wonder, the US ranks as one of the bottom 10 countries with 22%, ahead of Germany and Denmark (14%), which are tied with India and the UAE for bottom 4. See this infographic for a summary).
Another Women's Day tweet links to a blog post by CERN's director Rolf Heuer, who proudly reports a rise in female staff members from 17% to 20% in the last decade.
This is progress, but the numbers these reports show are still alarmingly low. As Rosabeth Moss Kanter wrote on occasion of Women's Day a few years back over at the HBR blog,
"By 2010, there is much progress to celebrate, and much left to do [...]. Increasing numbers of women have achieved powerful positions where they can lend a helping hand to women who are still victims of poverty, violence, health disparities, and limited education. Women leader networks that once functioned as career-builders and support systems for members, such as the International Women's Forum, now focus on what leaders can give to less-advantaged women. [...] So I'm cheering two-thirds of the way. I will release Cheer Three when gender gaps close, limitations fade, and there is attention to problems of opportunity and inclusion throughout the year. I'm not holding my breath. But I have a bottle of vitamin water ready for a toast to gender equity, just in case."
Many reasons for gender gaps are well known and widely discussed (Kanter cites a few above, and the World Economic Forum issues a comprehensive annual report on gender gaps). But even educated, middle-class women are held back in the workplace. Sheryl Sandberg's recent call for women to Lean In is based on the assumption that what holds us back is mostly within ourselves, so if we stop making excuses and instead give it our all, speak up, negotiate better, i. e. if we lean in to our careers, we can become leaders in our profession. However, for most women fully leaning in is prohibitively expensive (see "The True Cost of Leaning In", which cites a figure of around $96,000 per year). So while I agree with Sandberg that women need to believe in and stand up for themselves, I also believe that the individual view needs to be enriched by a relational view. Women need to lean in, but they also need to reach out. I want to concentrate here on two relational issues that I believe hold women back in the workplace. One of them is a lack of female role models, and the other one is a lack of female networking. Let's start with female role models. Although the proportion of female executives is now at 24% worldwide, there is a comparative dearth of success stories featuring women leaders in business. Earlier this year, Nitin Nohria, the Dean of Harvard Business School, made news when he pledged to more than double the case studies with female protagonists from 9% to 20% over the next 5 years, following a major debate on gender equity at HBS.
This resonates with my personal experience. Up until very late in graduate school, I had not had a single female role model. Granted, I grew up outside the US, and held my first jobs/went to school in countries where (at least at the time) women were extremely rare creatures among executives and senior academics. When I accepted my first academic job in the US, I was one of two women in my department (the other woman was junior, too, and resigned when she had a baby). I went on to become the first woman ever to be tenured in my department (another woman has been tenured since, and the overall female quota in the department is now close to 30%, so that is a cause for celebration). I've had the good fortune of having several amazing male role models and mentors over the years, and I am really thankful for that. I wouldn't have gotten ahead without them. But I definitely felt the absence early in my career of female senior colleagues I could use as a frame of reference. Ann-Marie Slaughter, in her critique of Sheryl Sandberg's "Lean In", writes:
"Young women might be much more willing to lean in if they saw better models and possibilities of fitting work and life together: ways of slowing down for a while but still staying on a long-term promotion track; of getting work done on their own time rather than according to a fixed schedule; of being affirmed daily in their roles both as parents and as professionals."
The second issue is female networking. Few would argue against the idea that networking is the key to professional advancement and successful careers. However, while there is a rich history of "old boys networks", much less is known about women's informal networks. My colleague Marta Gutman, an architectural historian, uncovered a fascinating exemplar in her new book "A City For Children". She shows how women in the US maintained active professional networks for 100 years that were directed at providing care for needy children (spoiler alert: those networks worked really well). Gutman views the success of these networks as a function of the highly gendered field they were embedded in - the social welfare economy. Outside of female-dominated fields, networking is harder for women. This is not merely a question of numbers, but also a question of time, rooted in the perception of work/life balance by many women. As Arlie Hochschild described in her seminal book "The Second Shift", women perceive their "double day" (work and home) as an individual problem, not as a social problem, which it actually is, and the "supermom strategy" is for the working mother to do it all. To me, an obvious consequence of this strategy is that women try to find ways to save time during the work day, and one of the first things to be cut is time spent in informal situations, such as hallway chats, long lunches, or receptions. But it is a short-sighted strategy, and it might be the very reason we tread in place. We have to tune out the ticking of the babysitter clock, and instead make conscious room for networking in our schedules. Networking is not a waste of time. It is a way out of the still gaping gender gap in the workplace. So here is my wish for International Women's Day 2014: Let's make a commitment to not only lean in, but to also reach out.
2 March 2014
We omitted one election from our set, Turkey's August presidential election:
turkey - incumbent party lose - 75.1%