August 2008
Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            

Editor Login


Convener in chief:


David Lazer
(Methodology, Networked Governance)

Editors:


Stanley Wasserman
(Current Trends, Methodology, Social Networks)

Guy Stuart
(Economic Sociology, Finance)

Allan Friedman
(Simulations)

Nathan Eagle
(Technology, Social Computing, Powerlaws, Current Trends)

Ben Waber
(Technology, Social Computing)
Ines Mergel
(Knowledge Sharing, Social Computing, Social Software, Current Trends)

Maria Binz-Scharf
(Qualitative Methodology, Knowledge Sharing, eGovernment)

Alexander Schellong
(Admin, eGovernment, Citizen Relationship Management)

Categories

Archives

Recent Entries

Recent Comments

    Notification


    « Computing Culture: National Communication Logs | Main | FFF? (Facebook friends forever?) »

    4 August 2008

    Save the data! the Dataverse Network initiative

    I recently made an effort to track down the data from Theodore Newcomb's classic work on The Acquaintance Process. These were network and attitudinal data he collected from incoming students at the University of Michigan in the 1950s. These were unique, and costly data to collect. The network data were preserved by a student in a dissertation, and are now immortalized (?) in the data included with the UCINET software. However, to someone like me interested in how networks and people co-evolve (see my 2001 JMS paper on "The coevolution of individual and network"), having the attitudinal data over time is what made the data valuable.

    After having retrieved the same dissertation, and spoken to Ken Frank, who had had similar thoughts some years ago and had actually dug through archives at Michigan, my sad conclusion is that these rich, unique, expensive, irreplaceable data are simply gone. (I am happy to be disabused of this notion if someone can prove I am wrong.)

    This was terribly unfortunate, from my immediate perspective, because I had some need to replicate some findings with longitudinal data I had recently collected on networks and political attitudes, and this was one of the few data sets that fit the bill (in fact, I have not found _any_ other good longitudinal data sets on whole networks and political attitudes-- again, happy to find out that I am wrong about that).

    The broader lesson I want to convey is that it is important to make your data publicly available. And I think this is especially important for (generally) micro, whole network data sets, because there are necessarily concerns about replication and external validity. There are concerns about replication, because of the possibility, esp with limited N's, that results were over-fitted to the particular data set. There are concerns about external validity, because even if the results are a fair representation of the data, there may be some quirk about the particular setting driving results.

    So: it would be enormously valuable to be able to reach into a vast archive of all social network data sets ever gathered and, given appropriate variables, etc, replicate a set of findings (or not) across multiple data sets. (In fact, I think that such built in replication should be de rigeur in published studies of micro settings. But that's a story for another day.)

    That is simply not possible now. And data are dying every day, as files get thrown out, people's memories fade, etc. Say, if one went to social network research published in ASQ from the 1990s, how many of those data sets are publicly available? I would bet close to zero. How many could be retrieved in some fashion and made available? Not many, I think. And this would be far worse as one looks further back in time.

    There are, of course, some centralized depositories of social science data (most notably, ICPSR). But IQSS has recently developed a new model of preserving data, which is really around creating an infrastructure (the "Dataverse") to allow decentralized mechanisms for data sharing, and one that I think is "incentive compatible" with getting some credit for creating a public good. (In fact, more should be done to recognize good data, and people who collect good data; again, a story for another day.)

    [An addendum to original post, the network workbench initiative offers another bottom up type of model to save network data, via wiki.]

    The remainder of this post is excerpted from an e-mail from Gary King, describing the project:

    For those who have collected research data and made it available to others, its nice when people thank you. But it would be nicer to receive formal scholarly citation credit and web visibility for your hard work. The Dataverse Network project is designed to get you that credit and visibility.

    The idea is to give you a free "dataverse" (your view of the universe of data) -- which is a virtual archive where you can store, permanently preserve, and distribute your data (or list data from other dataverses) with everyone or only those you approve. Your dataverse is branded as yours, with the look and feel of your web site and on your web site, but since it is served out by an installation of the Dataverse Network at Harvard you needn't install any software or hardware. Some other features include:

    * Safe and permanent data storage in preservation format branded
    as yours.
    * No need to translate data when statistical software formats change.
    * Can be easily re-branded if you move institutions, but either way
    will never be lost.
    * Formal citation credit for your data, including a globally unique
    identifier and universal numeric fingerprint.
    * Establish an unbreakable link between your data and related
    published work.
    * Easy ways for others to find your data and associated scholarship.
    * Share your data with everyone, or those who sign your licensing
    agreement, or only individuals or groups you approve.
    * Allow users to subset, recode, and download your data in any format
    * Run many advanced statistical methods via a GUI on-line.

    An interesting but under-appreciated fact is that if you are at an institution that receives federal funding, and you share research data or put it on your web site without prior IRB approval, you are violating federal regulations. (This includes any research data, even that compiled from information in the public domain, from IRB-approved research protocols, or from any other source.) To avoid this problem, the Dataverse Network has automated the IRB data approval process, and so if you have a dataverse in most cases going to the IRB is unnecessary.

    For an example, go to my homepage at http://gking.harvard.edu and click on dataverse. To get your own dataverse, go to the IQSS Dataverse Network, http://dvn.iq.harvard.edu. For more information on our open source Dataverse Network project, see http://TheData.org.

    Posted by David Lazer at August 4, 2008 10:09 PM