The health care and life sciences community profile for dataset descriptions

Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, United States
Computer Science, Heriot-Watt University, Edinburgh, United Kingdom
Department of Radiation Oncology (MAASTRO), GROW - School for Oncology and Developmental Biology, MAASTRO Clinic, Maastricht, The Netherlands
Ontotext Corporation, Sofia, Bulgaria
CSIRO, Canberra, Australia
The Donnelly Centre, University of Toronto, Toronto, Canada
Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Geneve, Switzerland
Carleton University, Ottawa, Canada
SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
IO Informatics, Berkeley, CA, United States of America
Oxford e-Research Centre, University of Oxford, Oxford, Oxfordshire, United Kingdom
Elsevier Labs, Amsterdam, Netherlands
Department of Medical Informatics and Epidemiology, Oregon Health Sciences University, Portland, Oregon, United States
NIBIO, Osaka, Japan
EMBL, European Bioinformatics Institute, Saffron Walden, United Kingdom
Database Center for Life Sciences, Sendai, Japan
RIKEN, Wako, Japan
Cerenode Inc., Troy, United States of America
Babraham Institute, Cambridge, United Kingdom
Nationwide Children's Hospital, Columbus, Ohio, United States of America
Institute for Systems Biology, Seattle, Washington, United States of America
Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
Department of Exact Sciences, VU University Amsterdam, Amsterdam, Netherlands
Research Organization of Information and Systems, Database Center for Life Sciences, Kashiwa, Japan
DOI
10.7287/peerj.preprints.1982v2
Subject Areas
Bioinformatics, Taxonomy, Computational Science
Keywords
data profiling, dataset descriptions, metadata, provenance, FAIR data
Copyright
© 2017 Dumontier et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Dumontier M, Gray AJG, Marshall MS, Alexiev V, Ansell P, Bader GD, Baran J, Bolleman JT, Callahan A, Cruz-Toledo J, Gaudet P, Gombocz EA, Gonzalez Beltran AN, Groth P, Haendel M, Ito M, Jupp S, Juty N, Katayama T, Kobayashi N, Krishnaswami K, Laibe C, Le Novère N, Lin S, Malone J, Miller M, Mungall C, Rietveld L, Wimalaratne SM, Yamaguchi A. 2017. The health care and life sciences community profile for dataset descriptions. PeerJ Preprints 5:e1982v2

Abstract

Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting guideline covers elements of description, identification, attribution, versioning, provenance, and content summarization. This guideline reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets.

Author Comment

Revisions to address reviewers comments. Chiefly adding summary of use cases, complete example of summary level description, and more discussion of related work.