Versioned data: why it is needed and how it can be achieved (easily and cheaply)

Daniel Falster; Richard G FitzJohn; Matthew W. Pennell; William K. Cornwell

doi:10.7287/peerj.preprints.3401v1

Versioned data: why it is needed and how it can be achieved (easily and cheaply)

Daniel Falster ¹, Richard G FitzJohn², Matthew W. Pennell³, William K. Cornwell¹

1 Evolution and Ecology Research Centre, University of New South Wales, Sydney NSW 2052, Australia

2 School of Public Health, Imperial College London, London SW7 2AZ, United Kingdom

3 Department of Zoology and Biodiversity Research Centre, University of British Columbia, Vancouver B.C. V6T 1Z4, Canada

DOI: 10.7287/peerj.preprints.3401v1

Published: 2017-11-10
Accepted: 2017-11-10

Subject Areas: Computational Biology, Ecology, Computational Science, Data Science
Keywords: Version control, Data sharing, Semantic versioning, Meta-analysis

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Falster D, FitzJohn RG, Pennell MW, Cornwell WK. 2017. Versioned data: why it is needed and how it can be achieved (easily and cheaply) PeerJ Preprints 5:e3401v1 https://doi.org/10.7287/peerj.preprints.3401v1

Abstract

The sharing and re-use of data has become a cornerstone of modern science. Multiple platforms now allow quick and easy data sharing. So far, however, data publishing models have not accommodated on-going scientific improvements in data: for many problems, datasets continue to grow with time -- more records are added, errors fixed, and new data structures are created. In other words, datasets, like scientific knowledge, advance with time. We therefore suggest that many datasets would be usefully published as a series of versions, with a simple naming system to allow users to perceive the type of change between versions. In this article, we argue for adopting the paradigm and processes for versioned data, analogous to software versioning. We also introduce a system called Versioned Data Delivery and present tools for creating, archiving, and distributing versioned data easily, quickly, and cheaply. These new tools allow for individual research groups to shift from a static model of data curation to a dynamic and versioned model that more naturally matches the scientific process.

Author Comment

This is a preprint submission to PeerJ Preprints.