Nine simple ways to make it easier to (re)use your data

Ethan P. White; Elita Baldridge; Zachary T. Brym; Kenneth J. Locey; Daniel J. McGlinn; Sarah R. Supp

doi:10.7287/peerj.preprints.7v2

Nine simple ways to make it easier to (re)use your data

Ethan P. White ¹, Elita Baldridge¹, Zachary T. Brym¹, Kenneth J. Locey², Daniel J. McGlinn¹, Sarah R. Supp¹

1 Department of Biology and the Ecology Center, Utah State University, Logan, UT, United States

2 Department of Biology, Utah State University, Logan, UT, United States

DOI: 10.7287/peerj.preprints.7v2

Published: 2013-07-05
Accepted: 2013-07-05

Subject Areas: Computational Biology, Ecology, Evolutionary Studies, Computational Science
Keywords: data sharing, data reuse, repository, license, data format

Copyright: © 2013 White et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Cite this article: White EP, Baldridge E, Brym ZT, Locey KJ, McGlinn DJ, Supp SR. 2013. Nine simple ways to make it easier to (re)use your data. PeerJ PrePrints 1:e7v2 https://doi.org/10.7287/peerj.preprints.7v2

Abstract

Sharing data is increasingly considered to be an important part of the scientific process. Making your data publicly available allows original results to be reproduced and new analyses to be conducted. While sharing your data is the first step in allowing reuse, it is also important that the data be easy to understand and use. We describe nine simple ways to make it easy to reuse the data that you share and also make it easier to work with it yourself. Our recommendations focus on making your data understandable, easy to analyze, and readily available to the wider community of scientists.

Feedback on this revision

1

4039 days ago - Charles Warden

Interesting editorial!

It looks like this article doesn't currently mention standard high-throughput databases. Although most scientists should be aware of these, data deposition rates are still not as high as they should be and it may be worth emphasizing the value of these databases:

GEO: http://www.ncbi.nlm.nih.gov/geo/

Array Express: http://www.ebi.ac.uk/arrayexpress/

SRA: http://www.ncbi.nlm.nih.gov/sra

PRIDE: http://www.ebi.ac.uk/pride/

4031 days ago - Ethan White

Thanks for the feedback Charles!

Our goal was never to highlight all of the major classes of databases since there are so many across ecology and evolution, but if we decide to do another revision we'll definitely consider adding one of them to an e.g., somewhere in the paper in the hopes of encouraging more deposition.

1

4026 days ago - Ross Mounce

First of all: congratulations. This is a really good paper and I really hope everyone in ecology & evolution (& beyond!) reads it.

That said. There's a couple of minor things I'd like to mention now in the pre-print stage that might perhaps help polish this even further:

I'm a little bit confused with Table 2:

Ecological Archives & Knowledge Network for Biocomplexity are both listed as License: No.

No? Hmmm... I did a little digging...

Ecological Archives:

What about this awful non-standard bespoke permission statement that Ecological Archives appear to apply: http://esapubs.org/archive/copyright.htm

It's a flawed permission statement that doesn't adequately allow for data re-use e.g. only 'copies' (exact?) are allowed. No explicit allowance for format-shifting is given AFAIK (which is copyright infringement if done without permission in many legal jurisdictions), or remixing with new data - so as I read it - it only gives permission to copy & repost the whole data file(s) as a blob, no changes allowed. And the "initial screen of a display" attribution clause is just totally weird.

[part 1 / 2 of this comment, not enough characters to post it all as one contiguous comment!]

3989 days ago - Ethan White

Thanks for the feedback Ross and my apologies for taking so long to get back to you. The paper had been accepted by the time you commented and I was in the middle of getting ready to move for sabbatical and leaving for ESA.

"No" was definitely vague and I just added a sentence to the legend during proofs to clarify this: "No License indicates optional and non-standard licenses...". It's not great, but it's something. The point is that there isn't a good standard license required.

3989 days ago - Ethan White

Also (and sorry for the short comments my responses seem to be limited to 500 characters at the moment) everything else I've heard/seen from/about Ecological Archives including direct public statements from the ESA president is that it's attribution only (https://twitter.com/ESA_Prez2013/status/312212684600905731). They definitely need to be pushed to move to standard and clearly open licenses. It's on my list, but if you have the time you should go for it.

1

4026 days ago - Ross Mounce

[comment 2/2] Knowledge Network for Biocomplexity:

It would appear to me that there are licences on some of the data there e.g. all the LTER datasets

http://knb.ecoinformatics.org/knb/metacat?action=read&qformat=knb&sessionid=0&docid=knb-lter-vcr.87.10

that info appears to be in the "Data Set Usage Rights" bit for each dataset, but it certainly appears like there's a lot of heterogenous, bespoke stuff in there. Rather questionable whether the tiny statement "no restrictions"

http://knb.ecoinformatics.org/knb/metacat?action=read&qformat=knb&sessionid=0&docid=chadden.61.3

is actually meaningful, or whether (default) full copyright restrictions would still apply to this dataset (?). I wish they use CC0!

[Related anecdote: Morphobank (http://www.morphobank.org/) initially also provided data without any licencing information on site until I had a frank discussion about the importance of licencing with Maureen O'Leary (P.I. of the Morphobank project) and the team at the Society of Vertebrate Paleontology meeting in 2011 (Las Vegas), and some follow-up emails afterwards (2011-11-10). So I take great pride when I see CC-licencing being used there. I advocated open licences: CC0 & CC-BY but museum people seem to have a strong attachment to NC & ND so unfortunately Morphobank also allows these modules too.]

Takeaway point: someone really needs to prod these archives/databases about licencing! They may well be amenable to change once they realise the error of their ways.

3989 days ago - Ethan White

Yep, we were using "No" to indicate a bad lack of a license that was general across datasets, not a good lack of one. There is definitely more education necessary here.

1

4026 days ago - Ross Mounce

A shorter comment this time...

with Table 2 the meaning of the Access column and it's scoring isn't clear to me.

A) Figshare allows data to be uploaded and held privately indefinitely (upto 1GB) - making the data open/public is optional [1]. They also have a third type of access - private share between collaborators. Not sure this if this is 'live' yet but I saw Mark Hahnel demo it a couple of days ago at a talk at Imperial College London. Thus I'd definitely score it as 'variable' in terms of Access.

B) What about 'embargoes' WRT access?

On the whole Dryad is brilliant. But my minor peeve with it is that they've caved-in to pressure & allowed authors to optionally 'embargo' access to their data for up to a year after the associated paper is published, for some journals [2]. Many authors have chosen to make use of this 'feature' so whilst the data is always eventually open, it's either immediately-open or delayed-open. Better than not open at all I suppose...

C) What exactly is meant/implied by 'variable' access?

I may be pedantic here but I'm not entirely sure if it's clear what variable access entails.

I presume you mean not all data is necessarily freely accessible to all. Perhaps it might also be worth saying 'free' or 'public' access rather than 'open' because open also implies liberal re-use rights.

1. http://www.digital-science.com/blog/posts/figshare-links-your-desktop-to-the-cloud

2. http://datadryad.org/pages/faq

3989 days ago - Ethan White

Again, this was definitely vague and I appreciate you pointing it out. I've added a sentence to the legend in proofs to indicate that "Variable Access indicates that only some data is openly available" and changed figshare's status.

1

4020 days ago - Daniel Hocking

Not sure if this has already been submitted anywhere, but I was thinking about using data from multiple sources and the increased chance that the same data can end up in more than one repository. For example, if you use historic data from VertNet and combine it with modern resurvey data, then deposit your new data in Dryad, the historic data could end up in two places. Either that should be avoided and only the new data should be deposited (and someone can recompile the appropriate VertNet data again later) or at a minimum, the museum and unique specimen identifier needs to be maintained in the new dataset. That way someone combining data later doesn't artificially double the sample size by using the same data twice. There are obviously data processing steps that should prevent that, but only if there is sufficient information in both datasets to identify duplicates or adequate metadata to reveal this.

Maybe that doesn't need to be mentioned since it should be covered by the suggestions to provide good metadata and upload raw data. Just a thought.

Good luck with the paper. A colleague already found this useful for writing her NSF data management plan.

-Dan

3989 days ago - Ethan White

Apologies for taking so long to get back to you, between moving for sabbatical and getting ready for ESA I've been way behind on everything.

This is definitely something to be concerned about and isn't something with an easy solution for compilations of small data. That said, in addition to not having any great answers I think this topic is too advanced for this paper, which we intentionally tried to keep at a very introductory level.

1

3988 days ago - Daniel Graziotin

This is a nicely written article. It is an introduction to the realization/implementation of Open Data in the research context. I think it achieves the aim.

The language is simple and straight, and the message is delivered.

I have got only one suggestion for the actual content. You may want to add http://www.zenodo.org/ to the comparison in Table 2.

As the article is aimed to beginners, I tried to read it with such a viewpoint.

I realized that the manuscript may benefit from spending some words on the concept of Open Data and the related licenses.

Therefore, I have three suggestions for expanding the article.

1) The paper is an introduction to realizing Open Data. However, the term Open Data appears only once (line 399, in the references). I think that putting the term Open Data in the Abstract may improve the visibility of the manuscript.

2) As the article is for people with a very little understanding of Open Science stuff, a tiny sub-section (1.1?) on Open Data would make it attractive for beginners. In my field (Computer Science), even not-so-senior researchers fail to understand the difference between Open Access and Open Data. They do not embrace openness because they do not understand it at the first stance. So, they simply avoid it. I guess this happens in other fields, too. Your article has the potential to address this issue

3) Section 9 could be improved with an introduction to CC licenses. Again, the world is still full of researchers who do not understand them. Thus, they avoid them. For example, I would be happy to see this manuscript explaining/answering (with few lines):

3.1) "Why should I choose a license? I could simply upload the data to my personal Web page. Everybody would be free to download it.". This concern is less trivial than what it looks like.

3.2) The difference between public domain and CC0 waiver. For example, public domain varies per jurisdiction, and it is not universally recognized. CC0 is designed to address these issues.

I wish you good luck with your endeavor.

3988 days ago - Ethan White

Thanks for the feedback. I've just submitted proofs for this paper so we can't change anything at this point, but your overall points are well taken. Fortunately this is being published as part of a special section in Ideas in Ecology and Evolution that will also include at least one and probably several papers justifying the need for open data (http://figshare.com/articles/Movingtowardasustainableecologicalsciencedontletdatagotowaste_/693745), which is why we focused on reuse.

1

3106 days ago - Breno Barros

Congratulations for the brilliant and encouraging review!

It has definitely helped us on how to organized our own data, and also on how to better sharing raw data!

I honestly think the future of scientific publishing depends on a transparent process regarding authors, editors, reviewers and publishers; and data sharing is a remarkable step for increasing this transparency.

Along with open reviews, which are long-expected from the most journals, I believe such policies may only contribute for the scientific community.

3106 days ago - Ethan White

Thanks Breno. I'm very glad to hear that you enjoyed the paper and certainly agree that increasing the openness of science more generally will be beneficial.

Nine simple ways to make it easier to (re)use your data

Abstract

Feedback on other revisions

Feedback on this revision

1

1

1

1

1

1

1

1

Add your feedback

Feedback on other revisions

Feedback on this revision

1

1

1

1

1

1

1

1

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article