20 December 2014
I hope you will forgive a personal digression....
Yesterday a high school teacher and friend of mine, Melanie Krieger died, after an extended illness. I am very sad about this, but her life does offer a chance to reflect on how a life, well lived, continues to provide gifts far beyond their source. Melanie's life offers us much to think about, about how to live, about how our lives are tied invisibly together, and in how education should be done in the 21st century.
Melanie's claim to fame was her success at supervising independent research in high school. She began teaching "Westinghouse prep" shortly after I graduated from Ward Melville (a high school in Long Island). Since that time, Ward Melville has been a powerhouse in elite science competitions, outpacing far more famous high schools. Further, that success has proved contagious, as other high schools on Long Island emulated Ward Melville, with a surge of Westinghouse/Intel prize winners since Melanie began teaching (see figure).
We often wring our hands about the teaching of science in this country--what are the lessons we can learn from this success? As I see it, there are four lessons:
It takes a relationship between teachers and students. It seems sometimes that teachers are viewed like they are content delivery systems. Assign students to teachers, run through curriculum, apply tests, repeat. The creative process does not fit this model. Among other things, this paradigm ignores the importance of knowing the child, and of building a relationship between children and teachers. Melanie knew her students, their strengths and their weaknesses, and they knew and trusted her. The fact that I was friends with her 30 years after taking a single class with her is testimony to that fact.
It takes a bulldozer--someone who understands the system, and can guide students through that system. How does one submit a science project to a competition? What are the idiosyncracies in form that judges for a particular competition might be looking for? Melanie was such a bulldozer, allowing the students to focus on the substance, and guiding them through the form and the forms.
It takes a network. The success of the program at Ward Melville was grounded in Melanie's building bridges to faculty mentors and proximity to their labs at Stony Brook University. This does not just happen--I live in a town outside Boston with a fine high school, and one of the highest densities of academics of any town in the world. There is nothing like what existed at Ward Melville. The role she played was as catalyst, fostering a virtuous cycle of mentorship, of indirect reciprocity among many parents and children.
It takes values. Ours is not a society that glorifies intellectual achievement; and intellectuals are typically not at the top of the high school food chain. Competitions, both local and national (such as Intel and Siemens) offer opportunities to recognize scientific achievement by high school students; investments by schools in programs such as "Westinghouse Prep" in part nudges our culture toward valuing science and scientific discovery.
In the 9 years since Melanie retired, a conservative estimate is that there have been about 27 finalists in the Intel science competition from Long Island attributable to the Krieger effect. The question we can ask as a country is how we can spark such creativity in our high schoolers nationwide. The question we can ask ourselves is how do we leave a legacy as rich as Melanie's.
1 December 2014
Nigeria - incumbent party win - 95.4%
Uzbekistan - incumbent party win - 91.6%
Togo - incumbent party win - 85.6%
Sudan - incumbent party win - 97.0%
Notes: In this particular cycle of elections, we are primarily dealing with countries that are not democratic. The probabilities for Uzbekistan and Togo are somewhat lower than we might expect because of the lack of credible polling in these areas. In Sudan, some major opposition parties have already announced that they will boycott the election.
Ref: C1, DM2, and DP18
2 November 2014
Sorry, belated by a few hours this month, but everything is reproducible for November 1.
Uruguay - incumbent party win - 83.6%
Namibia - incumbent party win - 54.2%
Romania - incumbent party lose - 75.2%
Tunisia - incumbent party lose - 100%
Nigeria - incumbent party win - 95.0%
We again place Tunisia at 100% confidence of the incumbent party losing because of the dissolution of the incumbent party. We have had difficulty finding polling data on the Namibian election, which is the reason for the low confidence in the expected result.
Our confidence in the Nigerian election has increased due to the current office holder's announcement that he is running and early polling in his favor. We have started tracking some additional elections that will occur in the next six months, but are holding off predictions until candidates are named.
Ref: C1, DM2, and DP17
31 October 2014
As we get to the last few hours of October, I am pleased to announce that we hit a new record for number of subjects in a month-- at 11,000+! I would wager that is more than all other behavioral research labs in Boston put together in October. Please do help us continue recruiting subjects, and in the not too distant future we will be recruiting more researchers to conduct experiments as well.
Google just released a new version of Google flu trends (GFT). GFT, as readers of the blog likely know, is an effort--launched with a paper in Nature-- by Google to track the flu based on search terms. The idea is that when lots of people are sick with the flu, there are more searches for things like "cures for the flu." This project has come to represent the possibilities and foibles of "big data", and along with coauthors I critiqued GFT in a paper this last March. GFT had been missing by large margins for a number of years, and there were a number of major statistical problems that we identified (most importantly, that GFT added only incrementally to the lagged CDC data, and should have integrated those data into the projections), but the most critical issue is that the methodology and data are opaque.
So: what about the new version of GFT? Well, there's good news, and there's bad news. The good news is that the new method claims to take "official CDC flu data into account". The (really) bad news is that the methodology is now much more opaque. As of today, there is no accounting whatsoever of how these numbers are generated, and because those numbers are now an unknown and perhaps dynamic mix of search and CDC data, third parties can no longer mash up the GFT data with other types of signals of flu prevalence.
Why not share the underlying data streams of the 50 or so GFT search terms? Surely the community of data scientists/researchers and the like could do something valuable with these data. The answer from the project lead at Google, Christian Stefansen, quoted in the Wall Street Journal, states "We would love to, but if we were to do that, it would be easy for someone to game the model.... We're at this intersection between providing a service for free and making it researchable, so we're trying to strike the best of both worlds."
Here is my proposal to Google: don't give the research community the GFT search terms. But give us counts for the next 100, at the state level. Make it a contest to see what teams do the best, by some reasonable criterion. Then take some of the top methodologies used to aggregate those terms and apply to the core GFT search terms. This harnesses the crowd while allowing the core set of terms to remain hidden. What is the downside to this? Because the data would be aggregated at the state level, there would be no privacy issues implicated. And the incremental impact on any leakage of proprietary information regarding how the Google algorithm works would, arguably, be quite tiny. The upside is clear-- hundreds of able minds competing to improve GFT. Further, releasing information at the state level would allow more finely granular projections of flu prevalence, which would be much more valuable for policy makers and for modeling efforts in projecting into the future.
So, Google: how about it?
1 October 2014
Here are our October predictions:
Brazil - incumbent party win - 91.4%
Bosnia and Herzegovina - incumbent party lose - 59.2%
Bosniak Election - incumbent party lose - 53.0%
Croat Election - incumbent party win - 90.5%
Serbian Election - incumbent party win - 83.7%
Bolivia - incumbent party win - 93.5%
Mozambique - incumbent party win - 89.5%
Uruguay - incumbent party win - 82.3%
Namibia - incumbent party win - 53.9%
Romania - incumbent party lose - 75.3%
Tunisia - incumbent party lose - 100%
Nigeria - incumbent party win - 59.1%
Yemen - incumbent party win - 80.4%
Here are the notes for this month's predictions. First, we are still splitting up Bosnia and Herzegovina. While we were able to access some polling data that disaggregates the complex structure of the election last month, it should be noted that they indicate very close races (all within margin of error) and a high level of undecided voters. This means our predictions are probably overconfident.
Second, we again place Tunisia at 100% confidence of the incumbent party losing because of the dissolution of the incumbent party.
Finally, the Nigeria and Yemen election predictions should be considered very tentative, since they are being made so far out from the election.
Ref: C1, DM2, and DP16
1 September 2014
This month, we are only posting one set of predictions. The beta version 2.0 has been doing well enough that we think it is time we move over to it.
Brazil - incumbent party win - 51.8%
Bosnia and Herzegovina - incumbent party lose - 59.3%
Bosniak Election - incumbent party lose - 50.7%
Croat Election - incumbent party win - 90.7%
Serbian Election - incumbent party win - 84.0%
Bolivia - incumbent party win - 93.6%
Mozambique - incumbent party win - 89.8%
Uruguay - incumbent party win - 83.3%
Namibia - incumbent party win - 53.3%
Romania - incumbent party lose - 75.6%
Tunisia - incumbent party lose - 100%
Nigeria - incumbent party win - 57.1%
Yemen - incumbent party win - 78.8%
Note, we again place Tunisia at 100% confidence of the incumbent party losing because of the dissolution of the incumbent party.
Finally, the Nigeria and Yemen elections have some uncertainty in their coding because it is still uncertain who will run. In Nigeria, for example, the incumbent has states that he will not run, but is facing political pressure to change his mind.
Ref: C1, DM2, and DP15
12 August 2014
Please come by Northeastern next week for this talk by Brian Granger, especially if you are interested in IPython/Jupyter.
Open, Reproducible and Exploratory Data Science
Professor Brian Granger
Physics, Cal Poly State University
Lead Developer and Co-Founder, IPython and Jupyter Projects
Center for Complex Network Research (CCNR), Northeastern University
3:30 - 5:00pm, Thursday, August 21
Data Science involves the application of scientific methodologies to data driven computations across a wide range of fields. As Drew Conway has clarified, it sits at the intersection of hacking/programming, math/statistics and domain specific expertise. Because data science is data- and computing-centric it requires powerful software tools. In this talk I will describe open source software tools for data science that i) are built with open languages, architectures and standards, ii) promote reproducibility and iii) are optimized for exploratory data analysis and visualization.
In particular, I will describe the Jupyter Notebook (formerly named IPython), an open-source, web-based interactive computing environment for Python, R, Julia and other programming languages. The Notebook enables users to create documents that combine live code, narrative text, equations, images, video and other content. These notebook documents provide a complete and reproducible record of a computation, its results and accompanying material and can be shared over email, Dropbox, GitHub or converted to static PDF/LaTeX, HTML, Markdown, etc. Most importantly, the Jupyter Notebook is built on top of an open architecture for interactive computing that is completely language neutral, allowing it to serve as a foundation for other data science projects and products.
Throughout the talk, I will provide examples of how IPython is being used across a wide range of fields including science, engineering, social sciences, finance, computer science, industry, publishing and journalism. Jupyter/IPython is funded through the Alfred P. Sloan Foundation, the Simons Foundation, the National Science Foundation, Microsoft and Rackspace.
Brian Granger is an Associate Professor of Physics at Cal Poly State University in San Luis Obispo, CA. He has a background in theoretical atomic, molecular and optical physics, with a Ph.D from the University of Colorado. His current research interests include quantum computing, symbolic computer algebra, parallel and distributed computing and interactive computing environments for scientific computing and data science. He is a lead developer on the IPython project, a co-founder of Project Jupyter, creator of PyZMQ and is an active contributor to a number of other open source projects focused on scientific computing in Python. He is @ellisonbg on Twitter and GitHub.
1 August 2014
As with last month, there are two sets of predictions presented this month. The first is with our version 1.0 model - the same model as what we have used for previous predictions and that will serve as a reference point for version 2.0 (which also uses slightly updated data as input, hence the DM1 vs DM2):
Model Version 1.0 Predictions
Turkey - incumbent party win - 98.6%
Brazil - incumbent party win - 98.8%
Bosnia and Herzegovina - incumbent party win - 82.9%
Bosniak Election - incumbent party win - 96.8%
Croat Election - incumbent party win - 74.9%
Serbian Election - incumbent party win - 74.9%
Bolivia - incumbent party win - 99.7%
Mozambique - incumbent party win - 99.6%
Uruguay - incumbent party win - 98.7%
Namibia - incumbent party win - 56.0%
Romania - incumbent party lose - 80.9%
Tunisia - incumbent party lose - 100%
Ref: C1, DM1, and DP14
Model Version 2.0 Predictions
Turkey - incumbent party win - 83.4%
Brazil - incumbent party win - 91.5%
Bosnia and Herzegovina - incumbent party lose - 58.2%
Bosniak Election - incumbent party win - 66.6%
Croat Election - incumbent party lose - 60.8%
Serbian Election - incumbent party lose - 60.8%
Bolivia - incumbent party win - 93.0%
Mozambique - incumbent party win - 89.4%
Uruguay - incumbent party win - 81.3%
Namibia - incumbent party win - 53.7%
Romania - incumbent party lose - 75.8%
Tunisia - incumbent party lose - 100%
Ref: C1, DM2, and DP14
Here are the notes for this month's predictions. First, Tunisia is set to 100% incumbent party loss probability because the incumbent's party was disbanded. Second, results for Bosnia and Herzegovina are difficult to interpret because there are three presidents being elected. We have attempted to break it down by individual presidential election, but all of these results should be taken as tentative until we have public opinion data for the elections. We expect to receive better data in the next few weeks that should help us better discriminate these elections in both models. Third, the prediction for Turkey has shifted because of the release of public opinion polls showing the incumbent party's candidate with a strong lead.
2 July 2014
Many/most/all of the readers of this blog have now heard of the Facebook emotion contagion study published in PNAS last week. Briefly: Facebook researchers, in collaboration with scholars at Cornell and UCSF, experimentally manipulated the algorithm that determines the subset of posts you see on Facebook, such that some people saw more positive posts, and others more negative posts. Their finding was, roughly, that negativity begets negativity, and positivity positivity. This paper has gone through a remarkably fast cycle of "isn't that interesting but a bit creepy" to methodological critiques ("they're not really measuring emotion") to a vast "Facebook is unethically manipulating our emotions!"
This post is not a commentary on the science, the ethics of this study, when is consent required, the structure of ethical self regulation (via IRB) in the US vs other countries (usually with no equivalents of IRBs), or the generally important question of the implications of our increasingly algorithmically organized societies. These will be subjects for future posts, and of many future classroom discussions I will have with doctoral students about research ethics. Rather, my concern right now is that this event has the potential to damage our collective capacity to create knowledge regarding human society because of the potential for public relations fiascos for companies. Of course--knowledge production will continue regardless, but perhaps all be safely proprietary, within the research departments of companies. Such an outcome would be terrible, not only for our collective understanding of human society, but also for these companies, because, paradoxically, the participation in vigorous public intellectual debates is important for the capacity of developing proprietary knowledge. Knowledge does not grow in hermetically sealed silos, and it is not coincidence that our creative industries have grown up in near proximity to universities, which at their best are highly permeable intellectual hot houses.
I'd therefore like to make a modest proposal about academic-industry cooperation, which is that companies like Facebook should create opt-in experimental panels, with an initial clear, short, transparent and in your face, fairly general and flexible consent about the types of ways their sociotechnical environment would be (modestly) experimentally varied. (And if certain experiments exceeded those parameters, there could be an additional consent required for specific experiments.) Subsequent to the completion of a study, study participants would be informed of the study, with a plain English explanation of the findings, as well as access to subsequent publications. Indeed, I'd note that my team has created a platform along this model, Volunteer Science, which is partially built on top of the Facebook API. Our challenge is building a user base. Facebook would not have a problem building a volunteer army to help out science--they could have a million recruits tomorrow.
I don't claim this is a cure all, but it would cure a lot--indeed, I think the entire current mess would have been avoided if the research had been done on such a volunteer base.
I'd note that Facebook and the like would (and will) continue to do A/B testing, and generally experimentally tweaking their algorithms in ways that (1) create variations in individual experience, and (2) have potentially important consequences, individually and collectively. This should be vigorously studied by scholars, and debated and scrutinized in the broader society. But the issue of whether and how a company like Facebook can participate in academic research, and in particular conduct field experiments, is actually solvable.