This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Discoveries in modern science can take years and involve the contribution of large amounts of data, many people and various tools. Although good scientific practice dictates that findings should be reproducible, in practice there are very few automated tools that actually support traceability of the scientific method employed, in particular when various experimental environments are involved at different research phases. Data provenance tracking approaches can play a major role in addressing many of these challenges. These approaches propose ways to capture, manage, and use of provenance information to support the traceability of the scientific methods in heterogeneous environments. PROV is a W3C standard that provides a comprensive model for data and semantics representation with common vocabularies and rich concepts to describe provenance. Nevertheless, it is difficult for domain scientists to easily understand and adopt all the richeness provided by PROV. In this paper we describe the design and implementation of the provenance manager PROV-man, a PROV-compliant framework that facilitates the tasks of scientists in integrating provenance capabilities into their data analysis tools. PROV-man provides functionalities to create and manipulate provenance data in a consistent manner and ensures its permanent storage. It also provides a set of interfaces to serialize and export provenance data into various data formats, serving interoperability. The open architecture of PROV-man, consisting of an API and a configurable database, allows for its easy deployment within existing and newly developed software tools. The paper presents examples illustrating the usage of PROV-man. The first example illustrates how to create and manipulate provenance data of an online newspaper article using PROV-man. The second example demonstrates and evaluates the PROV-man implementation in a more complex case for collection of provenance data about biomedical data analysis activities that are carried out using a distributed computing infrastructure.
This is submission to PeerJ Computer Science for review.