Anatomy of a crash repository

Joshua C Campbell; Eddie Antonio Santos; Abram Hindle

doi:10.7287/peerj.preprints.2601v1

Anatomy of a crash repository

Joshua C Campbell , Eddie Antonio Santos, Abram Hindle

Computing Science, University of Alberta, Edmonton, Alberta, Canada

DOI: 10.7287/peerj.preprints.2601v1

Published: 2016-11-19
Accepted: 2016-11-19

Subject Areas: Software Engineering
Keywords: Duplicate Bug Reports, Free/Open Source Software, Information Retrieval, Deduplication, Automatic Crash Reporting, Call Stack Trace, Contextual Information, Duplicate Crash Report, Software Engineering

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Campbell JC, Santos EA, Hindle A. 2016. Anatomy of a crash repository. PeerJ Preprints 4:e2601v1 https://doi.org/10.7287/peerj.preprints.2601v1

Abstract

This work investigates the properties of crash reports collected from Ubuntu Linux users. Understanding crash reports is important to better store, categorize, prioritize, parse, triage, assign bugs to, and potentially synthesize them. Understanding what is in a crash report, and how the metadata and stack traces in crash reports vary will help solve, debug, and prevent the causes of crashes. 10 different aspects of 40,592 crash reports about 1,921 pieces of software submitted by users and developers to the Ubuntu project were analyzed, plotted, and statistical distributions were fitted to some of them. We investigated the structure and properties of crash reports. Crashes have many properties that seem to have distributions similar to standard statistical distributions, but with even longer tails than expected. These aspects of crash reports have not been analyzed statistically before. We found that many applications only had a single crash, while a few applications had a large number of crashes reported. Crash bucket size (clusters of similar crashes) also followed a Zipf-like distribution. The lifespan of buckets ranged from less than an hour to over four years. Some stack traces were short, and some were so long they were truncated by the tool that produced them. Many crash reports had no recursion, some contained recursion, and some displayed evidence of unbounded recursion. Linguistics literature hinted that sentence length follows a gamma distribution; this is not the case for function name length. Additionally, only two hardware architectures, and a few signals are reported for almost all of the crashes in the Ubuntu dataset. Many crashes were similar but there were also many unique crashes. This study of crashes from 1,921 projects will be valuable for anyone who wishes to: cluster or deduplicate crash reports, synthesize or simulate crash reports, store or triage crash reports, or data-mine crash reports.

Author Comment

Was submitted to the journal Empirical Software Engineering for peer review. That process is still in progress at the time of submission. This version has not passed peer review.