A peer-reviewed article of this Preprint also exists.
It looks like this article doesn't currently mention standard high-throughput databases. Although most scientists should be aware of these, data deposition rates are still not as high as they should be and it may be worth emphasizing the value of these databases:
Array Express: http://www.ebi.ac.uk/arrayexpress/
First of all: congratulations. This is a really good paper and I really hope everyone in ecology & evolution (& beyond!) reads it.
That said. There's a couple of minor things I'd like to mention now in the pre-print stage that might perhaps help polish this even further:
I'm a little bit confused with Table 2:
Ecological Archives & Knowledge Network for Biocomplexity are both listed as License: No.
No? Hmmm... I did a little digging...
What about this awful non-standard bespoke permission statement that Ecological Archives appear to apply: http://esapubs.org/archive/copyright.htm
It's a flawed permission statement that doesn't adequately allow for data re-use e.g. only 'copies' (exact?) are allowed. No explicit allowance for format-shifting is given AFAIK (which is copyright infringement if done without permission in many legal jurisdictions), or remixing with new data - so as I read it - it only gives permission to copy & repost the whole data file(s) as a blob, no changes allowed. And the "initial screen of a display" attribution clause is just totally weird.
[part 1 / 2 of this comment, not enough characters to post it all as one contiguous comment!]
[comment 2/2] Knowledge Network for Biocomplexity:
It would appear to me that there are licences on some of the data there e.g. all the LTER datasets
that info appears to be in the "Data Set Usage Rights" bit for each dataset, but it certainly appears like there's a lot of heterogenous, bespoke stuff in there. Rather questionable whether the tiny statement "no restrictions"
is actually meaningful, or whether (default) full copyright restrictions would still apply to this dataset (?). I wish they use CC0!
[Related anecdote: Morphobank (http://www.morphobank.org/) initially also provided data without any licencing information on site until I had a frank discussion about the importance of licencing with Maureen O'Leary (P.I. of the Morphobank project) and the team at the Society of Vertebrate Paleontology meeting in 2011 (Las Vegas), and some follow-up emails afterwards (2011-11-10). So I take great pride when I see CC-licencing being used there. I advocated open licences: CC0 & CC-BY but museum people seem to have a strong attachment to NC & ND so unfortunately Morphobank also allows these modules too.]
Takeaway point: someone really needs to prod these archives/databases about licencing! They may well be amenable to change once they realise the error of their ways.
A shorter comment this time...
with Table 2 the meaning of the Access column and it's scoring isn't clear to me.
A) Figshare allows data to be uploaded and held privately indefinitely (upto 1GB) - making the data open/public is optional . They also have a third type of access - private share between collaborators. Not sure this if this is 'live' yet but I saw Mark Hahnel demo it a couple of days ago at a talk at Imperial College London. Thus I'd definitely score it as 'variable' in terms of Access.
B) What about 'embargoes' WRT access?
On the whole Dryad is brilliant. But my minor peeve with it is that they've caved-in to pressure & allowed authors to optionally 'embargo' access to their data for up to a year after the associated paper is published, for some journals . Many authors have chosen to make use of this 'feature' so whilst the data is always eventually open, it's either immediately-open or delayed-open. Better than not open at all I suppose...
C) What exactly is meant/implied by 'variable' access?
I may be pedantic here but I'm not entirely sure if it's clear what variable access entails.
I presume you mean not all data is necessarily freely accessible to all. Perhaps it might also be worth saying 'free' or 'public' access rather than 'open' because open also implies liberal re-use rights.
Not sure if this has already been submitted anywhere, but I was thinking about using data from multiple sources and the increased chance that the same data can end up in more than one repository. For example, if you use historic data from VertNet and combine it with modern resurvey data, then deposit your new data in Dryad, the historic data could end up in two places. Either that should be avoided and only the new data should be deposited (and someone can recompile the appropriate VertNet data again later) or at a minimum, the museum and unique specimen identifier needs to be maintained in the new dataset. That way someone combining data later doesn't artificially double the sample size by using the same data twice. There are obviously data processing steps that should prevent that, but only if there is sufficient information in both datasets to identify duplicates or adequate metadata to reveal this.
Maybe that doesn't need to be mentioned since it should be covered by the suggestions to provide good metadata and upload raw data. Just a thought.
Good luck with the paper. A colleague already found this useful for writing her NSF data management plan.
This is a nicely written article. It is an introduction to the realization/implementation of Open Data in the research context. I think it achieves the aim.
The language is simple and straight, and the message is delivered.
I have got only one suggestion for the actual content. You may want to add http://www.zenodo.org/ to the comparison in Table 2.
As the article is aimed to beginners, I tried to read it with such a viewpoint.
I realized that the manuscript may benefit from spending some words on the concept of Open Data and the related licenses.
Therefore, I have three suggestions for expanding the article.
1) The paper is an introduction to realizing Open Data. However, the term Open Data appears only once (line 399, in the references). I think that putting the term Open Data in the Abstract may improve the visibility of the manuscript.
2) As the article is for people with a very little understanding of Open Science stuff, a tiny sub-section (1.1?) on Open Data would make it attractive for beginners. In my field (Computer Science), even not-so-senior researchers fail to understand the difference between Open Access and Open Data. They do not embrace openness because they do not understand it at the first stance. So, they simply avoid it. I guess this happens in other fields, too. Your article has the potential to address this issue
3) Section 9 could be improved with an introduction to CC licenses. Again, the world is still full of researchers who do not understand them. Thus, they avoid them. For example, I would be happy to see this manuscript explaining/answering (with few lines):
3.1) "Why should I choose a license? I could simply upload the data to my personal Web page. Everybody would be free to download it.". This concern is less trivial than what it looks like.
3.2) The difference between public domain and CC0 waiver. For example, public domain varies per jurisdiction, and it is not universally recognized. CC0 is designed to address these issues.
I wish you good luck with your endeavor.
Congratulations for the brilliant and encouraging review!
It has definitely helped us on how to organized our own data, and also on how to better sharing raw data!
I honestly think the future of scientific publishing depends on a transparent process regarding authors, editors, reviewers and publishers; and data sharing is a remarkable step for increasing this transparency.
Along with open reviews, which are long-expected from the most journals, I believe such policies may only contribute for the scientific community.
You can also choose to receive updates via daily or weekly email digests. If you are following multiple preprints then we will send you no more than one email per day or week based on your preferences.
Note: You are now also subscribed to the subject areas of this preprint and will receive updates in the daily or weekly email digests if turned on. You can add specific subject areas through your profile settings.
Usage since published - updated daily