Opening Up Data Access, Not Just Articles
For those who’ve been paying attention, you’ll have noticed that we just published an interesting Perspective in PLOS Biology from Dominique Roche and colleagues that provides some practical hints on how to improve public data archiving for scientific research.
And if you’ve been even more on the ball, you’ll also have seen the recent announcement of PLOS’ new Data Policy and subsequent Update on the PLOS website.
The new Data Policy will be implemented for manuscripts submitted on, or after, March 1st. The main change is that all PLOS journals will require that all manuscripts have an accompanying data availability statement for the data used in that piece of research. We’re well aware that this may prove to be a challenge, but we think that this thorny issue needs to be tackled head-on. Ultimately, an Open Access paper for which the underlying data are not available doesn’t make a whole lot of sense.
Roche and colleagues raise some important and interesting points in their perspective and do a fine job of detailing the benefits to the scientific community of making data available. But for the eagle-eyed you’ll note an incongruity between their suggestion that a longer embargo period might be necessary before data need to be made available for some subjects, while the PLOS policy won’t make that distinction.
We don’t all have to agree here, and for the short term this may mean that some choose to send their research somewhere that permits them to keep their data under wraps. But funding agencies are also moving more towards our viewpoint, implementing requirements that data be made available. Whether researchers like it or not, this is something that needs to be addressed; it’s time to start ensuring there are better lab, university and institution practices for the storage and archiving of pertinent data.
If what we really want to see is optimal advancement of science, then open access to research means open access to as much as possible associated with the paper and not just the paper itself. What should such openness include? Well – probably everything – from methods to code to materials to equipment. But without a doubt a key component of openness is access to the data behind a study. Access to data facilitates reproducibility and testing of a papers conclusions and methods and also enables new discoveries to be made without the expense of redoing the experiments. We believe that the more open we all are about open data, the more we discuss the benefits and challenges, and the more we shift the bar towards openness, the better off all of science will be.
Roche DG, Lanfear R, Binning SA, Haff TM, Schwanz LE, et al. (2014). Troubleshooting Public Data Archiving: Suggestions to Increase Participation. PLoS Biology, 12 (1): e1001779. DOI: 10.1371/journal.pbio.1001779
More posts on PLOS Biologue about data:
“Dude, where’s my data?” by Roli Roberts
“Improving data access at PLOS” By John Chodacki
“Dealing with data” by Theo Bloom
It’s the first time I come on this website and I find this idea really interesting. I’m not in biology or whatever because I’m a sociologist and I wonder how we can apply this in social studies. We have statistical data but we also have textual data (e.g. coming from interviews) which might be delicate to publish (because of the anonymity). So is it possible de push further this idea?
Interesting blog post by PLoS Biol editors – thank you for commenting on our Perspective piece.
We fully agree with PLoS’s Data Policy that papers, data and code should ultimately all be open-access. However, we think the blog misses a key point in our paper. Many scientists remain reluctant to share their data (see details in the Perspective), for a variety of real and perceived reasons. In addition, journals and funding agencies lack sufficient resources or motivation to check the quality of archived data. As a result, the response to inflexible policies on archiving may not be that authors go to different journals, but that archived data are frequently incomplete (intentionally or unintentionally), and/or in a format that makes them difficult, if not impossible, to re-use (for a good example, see the link at the bottom of this post). This is a burden rather than a benefit to the system. We strongly believe that bigger ‘carrots’, not just ‘sticks’, are needed to encourage wider acceptance of data sharing.
Offering the possibility of embargoes is one way to encourage greater, and higher quality, participation. By not considering requests for embargoes, PLoS runs the risk of encouraging poor archiving practices, which is ultimately the worst case scenario. At the end of the day, we all benefit when data are archived and open. But, as we state in our paper, high-quality archived data that eventually become open are better than immediately-available, poorly-archived data, or no data at all. The possibility of embargoes also in no way affects the aim of storage and archiving of valuable data-sets.
Imagine the extreme scenario if, prior to 2004, all authors had archived high-quality data with a 10 year embargo. Now, in 2014, all of these data would be stored and accessible by everyone. This scenario is much better than the current state of affairs. What percentage of data associated with publications in 2013 is currently available for re-use, even in journals with existing archiving policies? Recent papers have put this figure at as low as 20 % (Drew et al 2013, Lost Branches of the Tree of Life, PLoS Biol 11(9): e1001636). Clearly the current state of affairs is unsatisfactory, but we are not convinced that the direction proposed here is the best solution.
The bottom line is that some studies require more flexibility than others when it comes to data re-use (again, see the Perspective), and this needs to be acknowledged if scientists are going to willingly archive re-usable data. PLoS has been a great venue for publishing research in ecology and evolution. It would be a shame if it could no longer accommodate scientists who want to release their hard-earned data, but who have genuine reasons for requesting priority access to those data for a limited period.
Dom Roche, Sandra Binning, Loeske Kruuk
For a famous example of poorly archived data (code imbedded as figures in a word doc), see:
I found this article by UK Data Archive quite useful:
And this one from Open Science Collaboration:
Thanks for the comments. I would like to see some evidence that having a less restrictive policy would in fact lead to better archiving. I think instead it would be more likely to lead to some people saying they will release data at publication but with incomplete results (some good archiving, much bad) and some people saying they want to embargo the data and those people not doing any better a job than the ones releasing the data now. This would lead to a worse situation because we would add a delay on top of a problem in quality of release.
As someone who has been involved in the genome sequencing community 15+ years I can say that strict policies by journals and funders almost certainly led to more widespread and timely data release than would have happened if embargoes were allowed. Sure, some of the data released has had issues. But overall, the benefit to the global scientific community of strict data release requirements has been enormous. And I note – many in the genome sequencing community fought tooth and nail against broad, timely data release requirements. People said things like “we will just delay our publications” or “we will find journals that do not require certain types of data release” or “we will not take funds from agencies that require rapid data release” and yet, in the end, that was all talk. The genome sequencers mostly followed the funder and journal and community data release standards. And that was good. I think the ecology community and really any community could do the same. Letting people make excuses for why they do not want to release the data associated with a publication is short sighted in my opinion.
[…] Ganley and Jonathan Eisen wrote about “Opening Up Data Access, Not Just Articles” for […]
[…] March 1, all PLOS journals will require that all manuscripts have an accompanying data availability statement for the data used in that piece of […]
[…] March 1, all PLOS journals will require that all manuscripts have an accompanying data availability statement for the data used in that piece of […]
Crystallography is also a field where it is just expected to archive the data before publication. And data is even reviewed by reviewers. More movement in this direction better…
Thanks for your comment.
It’ll be interesting to see how issues around making data more accessible work out in other disciplines too.
There are access issues for some clinical data too where confidentiality is necessary; some potential ways around are to make sure all information is made anonymous, or to extract out only the relevant responses from interviews (or pertinent clinical numbers etc.). These are some of the datasets that may class as special cases, where access would need to be overseen by some sort of ethics committee, as discussed in the full Data Policy. Hopefully as we move forward with this, workarounds for tricky situations will become more apparent.
Thanks for your feedback.
As mentioned in the post, we don’t all have to agree here, and in consulting around our new Data Policy, we were very aware that we wouldn’t all agree. The important point is to move forward in trying to make Open Access papers also have Openly Accessible data too.
We want to see better archiving practises associated with the papers that we publish. Precisely how the archiving is done is up to the scientists; hopefully you all won’t do a shoddy job. And as noted in the Data Policy:
‘If restrictions on access to data come to light after publication, we reserve the right to post a correction, to contact the authors’ institutions and funders, or in extreme cases to retract the publication.’
Imagine if in 2004 all datasets relating to scientific papers had been made available straight away, how much extra scientific research and discovery could have taken place in the last decade….
This is an important issue for those whose research is based on long-term monitoring of marked individuals. These studies have made important contributions to ecology and evolution (Clutton-Brock and Sheldon 2010). Most have shared their data with anyone that asked and explained why they wanted the data. The policy adopted by PloS and many other journals is a threat to this type of research. The reasonable suggestions made by Roche et al. go a long way to remove that theat. Here is the problem: suppose my student X completes a PhD on the how individual characteristics (genetic, life-history and phenotypic) affect the onset of senescence, using my 40-year database on marked individuals to which she contributed 3 years of fieldwork. Student X published her results and provides all the individual-based data. The same year, student Z starts his PhD on paternal effects on age-specific survival according to sex and environmental conditions. Most of the data that will be used by student Z were made public by student X. Professor Y downloads those data, does not contact me and produces a paper on paternal effects, killing student Z’s PhD. My long-standing collaborators Q, P and H, meanwhile, threaten to stop collaborating with my lab (or decide to sue me) because they had provided parts of the dataset used by Student X before journals adopted this inflexible policy, and their own research plans are affected by Professor Y’s publication. Had the data been subject to a 5-year embargo, or had Professor Y been obliged to contact me first, the problem would not have emerged and a productive collaboration may have ensued. Roche et al. make several very useful and constructive suggestions. Casting them as being against public archiving is counterproductive. Instead, it is essential that journals consider their suggestions to avoid harming long-term research programs.
Clutton-Brock, T.H., and Sheldon, B.C. 2010. Individuals and populations: the role of long-term, individual-based studies of animals in ecology and evolutionary biology. Trends in Ecology and Evolution, 25: 562–573.
The last two points really get to the heart of it for me. I’d love to see a situation in which the rewards of open data and data archiving are so high, and the ethics of reusing open-access data so good, that even those with long-term datasets are keen to archive them and make them open access immediately. I see no cogent arguments that this wouldn’t be the best thing for science overall. However, as the last commenter points out, it’s not always the best for the scientists in the current system (though I don’t agree with all the issues raised).
To be 100% clear, I think that the vast majority of studies should be accompanied by immediately open access data, and that there are very few cases in which long embargoes (or any embargo at all) is justified. But I DO think that long-term datasets (and possibly some other cases) are one case in which long embargoes might be justified. So the rest of this post refers only to the very small proportion of studies with a genuine (in my opinion) reason to request an embargo.
The situation we are in right now is that those with long-term datasets often do not to archive them at all. This is a huge loss to science – not only are some of the most valuable datasets (scientific and monetary value) eventually lost, but they are never opened up to the huge number of extremely talented researchers who may be able to re-use and re-mix them to make new inferences. With our suggestions, we hope to both increase archiving of long-term datasets, and reduce the “never opened up to others” to “eventually opened up to others”. While I completely agree with Jonathan Eisen that this is not the optimal solution for science, I think it’s a step in the right direction.
If we can keep making these kinds of small steps, then we are making positive progress. At the same time, I would advocate that we all try and change the working culture so that we can try and make what is good for science (all data archived and open immediately) the same as what is good for scientists. If we can do that, then we won’t need these discussions any more.
Excellent points. I completely agree that what we need to do is develop systems and policies where what is good for sciences is also good for the scientists. Same is true for any scholarly area. This is I note why I just organized a conference on “Publish or perish – the future of scholarly communication and careers”. I am still writing up summaries of the meeting but if interested see these blog posts
Day 1. http://icis.ucdavis.edu/?p=182
Day 2. http://icis.ucdavis.edu/?p=185
The issue of Data Archiving and it’s complexities came up at the meeting. And I think the general agreement was similar to what you wrote – we need to develop policies and practices to line up individual good with society good.
[…] al and Dominique Roche et al all in PLOS Biology with associated blog posts from Roli Roberts and Emma Ganley (editors on PLOS Biology). And then read Cameron Neylon’s post ‘Open is a state of […]
[…] new policy — which was actually first announced on January 23, as we noted here — had led to criticism at the DrugMonkey blog, and a February 26 […]
[…] new policy — which was actually first announced on January 23, as we noted here — had led to criticism at the DrugMonkey blog, and a February 26 clarification […]
[…] is a big debate going on now regarding what, where and how much data should be shared in association with […]
[…] submitted to any PLOS journal will need to have a ‘data availability statement’ for the data. The release said: “The new Data Policy will be implemented for manuscripts submitted on, or after, March 1st. […]