Software Heritage is a useful repository for scientists to discover, preserve and recognize the source code powering science.

by | Nov 27, 2017 | Community, Guest Post

Software is an essential component of 21st-century science workflows, yet it often receives little attention in formal scientific publication. Software citation is one way to encourage wider recognition of software’s role in scientific analysis. In 2016, we published Software citation principles in PeerJ Computer Science by Arfon Smith , Daniel Katz, Kyle Niemeyer, and the FORCE11 Software Citation Working Group. The article established “a consolidated set of citation principles that may encourage broad adoption of a consistent policy for software citation across disciplines and venues”.

Software citation for today’s scientific research requires discoverability and accessibility of software, which may still be in the iterative development stage – no small feat. Here, Daniel Katz, author and Academic Editor for PeerJ Computer Science follows up on the discussions that have taken place since these principles were first published and provides further direction for ways the scientific community can implement them.


Software Heritage

Software Heritage (as described in a recent white paper) is a really interesting and very ambitious project that started (or at least was announced) last summer. Its mission is to collect, preserve, and share all software that is publicly available in source code form.

And the project has been doing this. For example, it currently stores all public repositories in GitHub, plus a bunch of other things, and is working with Bitbucket. It currently holds almost 4 billion source code files, almost a billion commits, and about 65 million projects.

This means that all public code will be stored, indexed, and made available (currently via an API and eventually simply to browse). The Software Heritage project describes three use cases for this archive:

  • Heritage: Software is an important part of human production. It is also a key enabler for salvaging our entire digital heritage. We collect, preserve, and make accessible source code for the benefits of present and future generations.
  • Science: Science relies more and more on software. To guarantee scientific reproducibility we need to preserve it. Amassing source code at this scale will be challenging, but will also enable the next generation of software studies.
  • Industry: Software is present in all industrial processes and products. The universal source code archive we are building will help industry with provenance tracking, long-term archival, and software bill of materials.

This includes reproducibility, but surprisingly to me, doesn’t include citation and credit, perhaps because the archive by itself is not sufficient to allow credit, but could be with just a bit of effort by the community.

Citation: citable and cited

As I’ve written before, the citation process involved two elements, making something citable and then actually citing it. And these two elements involved three steps, for example for a paper:

  1. The creator of a paper (aka, an author) submits the paper to a publisher
  2. After some number of steps, the publisher publishes the paper and assigns it an identifier, most likely a DOI.
  3. Someone who wants to refer to the paper within another work cites the metadata of the paper, likely including the identifier.

The first two steps make the paper citable, and the third step cites the paper.

The second step is the key to making the paper citable, by making it recoverable. The APA Publication Manual distinguishes between recoverable and unrecoverable data: recoverable data is that which can be accessed by the reader via the citation information, while unrecoverable data is that which cannot be accessed via the citation information. The APA Manual goes on to recommend that recoverable data should be cited as a formal citation, and unrecoverable data should be referred to within the text as “(author, personal communication, date)”.

Software citation

But for software, this distinction between recoverable (published) and unrecoverable (not available) doesn’t work. All versions of software on GitHub, even if never published, are recoverable by default (more or less barring the project being deleted from GitHub, though even here they could be recovered from a local repository.)

The software citation principles try to force this issue, by recommending the insertion of Step 2 in the process. Today, when a creator develops software on GitHub, the software is never really complete, though it may be released at different stages (versions) during its development. The principles say that the creator should also publish each software release (for example, through Zenodo or figshare.) This finished the process of making the software citable, and allows someone else then to cite it.

This is a reasonable solution in many cases, because it allows the reader of the paper to recover (access) the software that was cited. But in some cases, it will not work, because it adds a step to the software developers workflow that they may not care about enough to implement. Even if we do get to a future time in which developers routinely published their software releases, what happens until then, or for existing software?

Enter Software Heritage

For software that it archives, Software Heritage mostly removes the need for Step 2. If I, as a user of software want to cite the software I’ve used, I just need to:

  1. Find it on Software Heritage
  2. Cite it

Of course, this isn’t quite as simple as it sounds, but it could be in the future.

Three gaps that Software Heritage can fill

Here are three of the things that I think are missing, for which Software Heritage provides the basic answers, but some additional work is still needed:

  1. To cite software, how do I find it on Software Heritage? Many people today use software from GitHub, and they would like to cite it by pointing to the GitHub repository and commit hash. However, GitHub is not an archive while Software Heritage is. (While much software development is done on GitHub today, at some future point, this will likely no longer be true. Think about Google Code. And while Google has created a Google Code archive, it’s unlikely a smaller company would support creating and maintaining an archive of a dead project.) I find it easy to imagine a set of tools that could link from a GitHub commit hash to a location in Software Heritage’s internal Merkle tree.
  2. What is a Software Heritage ID? Software Heritage uses a (very long) hash to represent a file (a node in the Merkle tree). Exactly how this hash should be translated to a PID is not clear. Perhaps something of the form https://softwareheritage.org/ID/hash? Or, given that most Software Heritage hashes won’t be cited, perhaps a smaller hash space could be used for those that are, leading to PIDs that are easier to document as text?
  3. How do I access cited software on Software Heritage? Of course, an extra function is also needed to make the recoverable part of the citation work, to go from a Software Heritage ID to actually obtain the software that was cited. This needs to obtain the full package, and it also ideally should link back to where the software is being developed. Using Software Heritage would be enough to see if there had been further developments of the software, such as bug fixes, that could be important depending on if your need is to simply access the software to repeat the exact experiment in a work that cited the software, or if you want to reproduce the work at a higher level. And, if you want to contribute to the software, or create an issue about it, you also need a link to the software repository where the active development is occurring.

The metadata/credit gap

The remaining gap is not one that Software Heritage can solve alone. It can be asked as two related questions:

  1. How do I give credit to the developers of the software?
  2. How do I find the appropriate metadata for the citation?

The question of what is appropriate metadata for software (for the purpose of citation) was partially addressed in the Software Citation Principles paper, specifically in Table 2, though this hasn’t really been tested in practice. DataCite has drafted a new version (4.1) of their schema that adds and updates elements to reflect these principles, though this isn’t public yet.

It’s also important to note, that as the Software Citation Principles paper says:

Similarly, the software metadata recorded as part of data provenance will overlap the metadata recorded as part of software citation for the software that was used in the work. The data recorded for reproducibility should also overlap the metadata recorded as part of software citation. In general, we intend the software citation principles to cover the minimum of what is necessary for software citation for the purpose of software identification. Some use cases related to citation (e.g., provenance, reproducibility) might have additional requirements beyond the basic metadata needed for citation, as Table 2 shows.

One way to think about this is that there is some metadata that describe properties of the software itself as source code, such as: authors, language, license, version number, location, etc. Let’s call this software creation metadata. And there are also metadata that describe how the code is being used, possibly including how it is built, such as: compiler version, operating system, parallel computing platform, command-line options, etc. Let’s call this software usage metadata. The software citation principles say that the software creation metadata are needed for citation, while the software usage metadata are needed for provenance and reproducibility.

While a person who uses some software in their research can determine the software usage metadata, this person cannot determine the software creation metadata. This can only be done by the software creators. So, Software Heritage cannot provide the metadata needed for software citation – this is why there is a gap.

Filling most of the metadata gap

But the authors of the software who want to be cited can fill this gap, and could do so relatively easily. They just would need to create a single metadata file in the root of their repository, with an agreed upon name.

The first time I heard this, it was suggested by Martin Fenner, based on work done in the CodeMeta project, which has the goal of creating a minimal metadata schema for science software and code, in JSON and XML. Martin provided an example of how this could be done: the codemeta.json file in the repository https://github.com/datacite/maremma.  According to Martin, the process by which DataCite today could generate a DOI and a citation from this is semi-manual and involves using https://github.com/datacite/bolognese for DataCite XML generation.

If code developers created a codemeta.json file in their repository when they started working on their project, they would then just need to keep it up to data, much like they do their README (description of their project) or CONTRIBUTORS (who has contributed to the project) files, and they might not need to create a CITATION (how the project should be cited) file. Or, the CONTRIBUTORS and CITATION file could be generated from the codemeta.json as part of continuous integration, or as part of releasing or packaging.

Since Software Heritage would keep all versions of codemeta.json with the corresponding versions of the software code, it would be relatively easy to retrospectively build the proper citation metadata for any version of the software.

The rest of the metadata gap

This still leaves a portion of the gap: How do we build a citation to code when the authors don’t care about credit and have not provided a codemeta.jsonfile?  This also needs be answered to cite almost all software that has been built to-date.

Most of the metadata needed for this (see Table 2) can be extracted from the repository directly. The one thing that cannot is the authors. While some would argue that the authors are the same as the contributors to the repository, I don’t agree. There are contributors to a software project who may not contribute to the repository (e.g. a person who gets the funding and other resources to enable the project, a person who provides training on the software) and there are contributors to the repository who are not authors of the software (e.g., an administrator who adds license information to source code files). Therefore, I think the best thing to do is simply to identify the project as the authors (e.g., authors = “CodeMeta Project”), and if the authors feel this is incorrect, they can create a codemeta,json file to provide the correct information.

Summary

To put this all together, I am really excited to see Software Heritage emerge, as this archive of all source code will enable a lot of understanding of software, as the project claims. And I’m also excited Software Heritage will better enable software citation than anything we have today. Finally, this also opens a path by which software authors can create a maintain a single file in each repository to provide the metadata needed to make software citation almost automatic.

The views & opinions expressed in PeerJ Community guest posts featured on the PeerJ Blog are those of the author and do not necessarily reflect the opinions & views of PeerJ. This post is cross-posted at Daniel Katz’s blog. This was the third post in a series related to talks and discussions at the 10th RDA Plenary. The first and second posts can be found here.

Get PeerJ Article Alerts