|
Where are the data? Can I have them? What can I do with them?
There's a new subversive proposal in town. The original was Stevan Harnad's landmark call for self-archiving of the scientific ("esoteric") literature (see here for a ten-year update, and here for context). Now, 12 years later, Open Access is gathering momentum and forward-looking advocates of knowledge as a public good are thinking about Open Data (some extra background here). Peter Murray-Rust recently stepped up with a subversive proposal of his own: The simplest thing that researchers can do [to promote Open Data] is to add a Creative Commons license to their data. It costs nothing, is a simple cut-and-paste, and could be trivially made a template in any data production tool. [...]I think Peter's proposal is a good one, similar in form and effect to the SPARC author addendum. Importantly, Science Commons also offers author addenda, and will soon offer them in the machine-, human- and lawyer-readable versions that come with all Creative Commons licenses; as Peter notes, the machine-readable version is crucial to full Open Data utility. Use of the proposed Open Data addendum (in combination, where necessary, with an Open Access addendum) would clarify the legal status of an author's data, provided we get the wording right. Herewith some thoughts on how to do that, based on the questions in the title. First, note that papers do not usually contain raw (useful, useable) data. They contain, say, graphs made from such data, or bitmapped images of it -- as Peter says, the paper offers hamburger when what we want is the original cow. Chris Surridge of PLoS puts it this way: So if authors want to make their data openly and usefully available, they will need to host it themselves or find someone to host it for them. Many journals will host supplementary information, and many institutional repositories will take datasets as well as manuscripts. I have been saying for some time that it should by now be de rigueur to make one's raw data available with each publication. This is very rarely done -- even supplementary information, when I have come across it, tends to be of the hamburger-rather-than-cow variety and so not very useful. (The situation speaks sad volumes about the emphasis on competition over cooperation within the scientific community and, perhaps in many cases, about the quality of the raw data in question, if only one were ever able to see it; but I digress.) Thus an effective Open Data addendum will first have to answer the question: where are the data? Second, there is the issue of licensing ("Can I have them? What can I do with them?"). In comments on Peter's proposal, Jonathan Eisen observes that publishing in Open Access journals should provide open access to data as well. Peter replies that this is not always the case and points to Molbank as a problematic example, because they require a copyright transfer and it is simply not clear what rights they claim over raw data. In fact, the situation is even worse. In the same entry, Peter points approvingly to the BioMed Central OA charter, which is based on the Bethesda Statement: Every peer-reviewed research article appearing in any journal published by BioMed Central is 'open access', meaning that:But what does that mean for Open Data? Take any paper in any BMC journal: where are the data? Can I have them? What can I do with them? It's true but it's simply not enough that, having published in BMC, the authors are probably amenable to giving me the data and allowing me to do with them as I please. I need unfettered access to the data at the same time as I access the paper. Even as a human I don't have time to chase down permission for every dataset I want to re-use, and if I'm data-mining by web crawler I need machine-readable licenses that tell my robot what it can have. Policies regarding data and materials are journal-specific within the BMC group, but I browsed a few and it seems they all use a standard template, which includes the following: Submission of a manuscript to [BMC Journal in question] implies that readily reproducible materials described in the manuscript, including all relevant raw data, will be freely available to any scientist wishing to use them for non-commercial purposes. Nucleic acid sequences, protein sequences, and atomic coordinates should be deposited in an appropriate database in time for the accession number to be included in the published article. In computational studies where the sequence information is unacceptable for inclusion in databases because of lack of experimental validation, the sequences must be published as an additional file with the article. [There follows a list of databases that can be used to deposit nucleotide and protein sequences and structures, chemical structures and assays, microarray data, computer models and plasmids.]Note though that these policies are not strict demands, and I'll bet they are not policed in any way. I think most journals include similar language in their instructions to authors, and have done for some time, but we still do not have widespread Open Data. Further, the actual BMC license (which BMC says is identical to the Creative Commons Attribution License) refers only to "the work" which it defines as "the copyrightable work of authorship offered under the terms of this License". That seems to me to allow an interpretation that excludes data, which sit in the grey zone between creative works that can be copyrighted and, er, things (like gene sequences and chemical structures of drugs) that can be patented. So how about Public Library of Science and Hindawi, the other major OA publishers? Well, Hindawi seems to say nothing about data whatsoever, only that authors retain copyright and articles are published under a CC Attribution license. PLoS also publishes everything under a CC Attribution license, which says nothing about data, but if you dig a bit you find encouraging things in the editorial/publishing policies: Publication is conditional upon the agreement of authors to make freely available any materials and information associated with their publication that are reasonably requested by others for the purpose of academic, noncommercial research.That's better, stronger language -- but why is there no mention of data in the actual license, and why is there a need for warnings about restrictions that "might be judged to diminish the significance, etc" if publication is truly conditional on open access to data? I suspect another toothless tiger. It's not that I want the tiger to have teeth, that is, for journals to actively police data availability, but that I wonder why I have to go digging around the website just to find this wishy-washy nod in the general direction of Open Data. To illustrate my point here, suppose I read a paper in PLoS Biology, and I want to get my hands on some raw data from that paper: where are they? Can I have them? What can I do with them? All of these things are, basically, left up to the authors. Now remember that these highly unsatisfactory examples are drawn from the most prominent Open Access publishing houses, which might be expected to be much more supportive of Open Data than To be effective, then, an Open Data addendum must at least answer my opening questions: it must point to the online, freely accessible location of the raw, un-hamburgered data; it should make clear that yes, you can have them; and it should state clearly what you can do with them. The last question probably requires the creation of multiple addenda, since some people (like Jonathan Eisen) will want to effectively copyleft their data, whereas others will prefer less restrictive licenses. My preferred answer is "anything you want, so long as you do not remove information or materials from the scientific commons". So, finally, let me take a stab at a draft Open Data addendum. This is AUTHOR'S ADDENDUM TO PUBLICATION AGREEMENT That's not perfect, not by a long shot -- most especially not for automated data mining, which requires machine-readable metadata and data. It should, however, do what Peter suggests: provide some relief from endless rounds of find-the-permissions, and get a much-needed conversation underway. Comments I really think this is a great idea and is turyl in the spirit of Open Science. There have been a variety of attempts to do ths for some types of data that still need more work. For example, for DNA sequencing, Genbank is considered by many to be this raw data Database for sequences. While Genbank is a phenomenal resource, it does not actually contain the raw data on sequences. The raw data comes in the form of the results of sequencing experiments themsleves (e.g., electropherograms). The data in Genbank is a model/interpretation of the raw data. The distinction here is important as not all the bases in a DNA sequence in Genbank are of equal quality. And when you are using sequencing to identify subtle differences within or between species, the quality is really important. There is a place for the raw data for DNA sequencing - it is called the Trace Archive and it is something NCBI has set up. But not everyone deposits their data there. But at least there is aplace for it (and the journals never ask fro this to be piut in Supplemental Data) But they ask for all sorts of thigns to be put there and I did not realize until your blog how this is in a way an insiduous idea of the journals. Not only do the get the rights to the paper which they do not deserve, they sometimes get control over the raw data. Keep pushing this - I will try to help. Interesting discussion of Open Data. I like CC licenses, but they may not work well for "data". Many scientific projects collect pretty "objective" factual data with little expressive content. Copyright only protects expressions, and since Creative Commons licenses only apply to copyright protected works, these licenses won't work with many scientific datasets. A CC license won't give contributors a legal assurance that they will be cited and attributed for data. Even if a dataset is pretty "expressive" (copyrightable and CC licensed), simple attribution may not be enough. Papers are the currency of achievement not raw data. Making raw data available may put you at a competitive disadvantage, especially if you are less certain about being attributed for it. Nice article! Post a comment |
RSS Feed Links: spousal unit me copy Bloglines account Simpy account Connotea account OpenWetWare userpage blogroll: Archives: July 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 January 2007 December 2006 November 2006 October 2006 September 2006 August 2006 July 2006 June 2006 May 2006 April 2006 March 2006 February 2006 January 2006 December 2005 November 2005 October 2005 September 2005 August 2005 July 2005 June 2005 May 2005 April 2005 March 2005 February 2005 January 2005 December 2004 November 2004 October 2004 September 2004 August 2004 July 2004 June 2004 May 2004 April 2004 March 2004 February 2004 January 2004 December 2003 |
Great discussion. With the current journal publishing system, at least in organic chemistry, the raw data are not actually available through the publisher and thus are not part of that copyright. Supplementary data rarely contain actual raw data. The last time I looked up Supplementary Info in JACS, all I got was a table of yields. What I wanted was NMR and chromatography data. For it to be really useful you have to be able to interact with the raw data in the same way that the researcher did to arrive at their conclusions. For example:
http://usefulchem.blogspot.com/2006/11/jspecview-demo.html
The way I look at it, I may filter, summarize and integrate the raw data when I publish in a journal but I never give it away. I don't hand over the copyright to my lab notebooks. But I can reference a lab notebook page to support a statement in my article, including the experimental section. Using a wiki makes it easy to reference a specific page version. At least that is what I will try. I'll let you know if the editors find that acceptable. If not then I think we've come to a fork in the road of scientific communication.