An empirical analysis of the Docker container ecosystem on GitHub

software evolution and architecture lab, University of Zurich, Zurich, Switzerland
IBM T.J. Watson Research Center, Yorktown Heights, New York, USA
DOI
10.7287/peerj.preprints.2905v1
Subject Areas
Data Science, Software Engineering
Keywords
empirical software engineering, Docker, GitHub, mining software repositories, container, Infrastructure as Code
Copyright
© 2017 Cito et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Cito J, Schermann G, Wittern E, Leitner P, Zumberi S, Gall HC. 2017. An empirical analysis of the Docker container ecosystem on GitHub. PeerJ Preprints 5:e2905v1

Abstract

Docker allows packaging an application with its dependencies into a standardized, self-contained unit (a so-called container), which can be used for software development and to run the application on any system. Dockerfiles are declarative definitions of an environment that aim to enable reproducible builds of the container. They can often be found in source code repositories and enable the hosted software to come to life in its execution environment. We conduct an exploratory empirical study with the goal of characterizing the Docker ecosystem, prevalent quality issues, and the evolution of Dockerfiles. We base our study on a data set of over 70000 Dockerfiles, and contrast this general population with samplings that contain the Top-100 and Top-1000 most popular Docker-using projects. We find that most quality issues (28.6%) arise from missing version pinning (i.e., specifying a concrete version for dependencies). Further, we were not able to build 34% of Dockerfiles from a representative sample of 560 projects. Integrating quality checks, e.g., to issue version pinning warnings, into the container build process could result into more reproducible builds. The most popular projects change more often than the rest of the Docker population, with 5.81 revisions per year and 5 lines of code changed on average. Most changes deal with dependencies, that are currently stored in a rather unstructured manner. We propose to introduce an abstraction that, for instance, could deal with the intricacies of different package managers and could improve migration to more light-weight images.

Author Comment

This is a preprint submission to PeerJ Preprints.