Anatomy of a crash repository

Joshua C Campbell; Eddie Antonio Santos; Abram Hindle

doi:10.7287/peerj.preprints.2601v1

Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

NOT PEER-REVIEWED

"PeerJ Preprints" is a venue for early communication or feedback before peer review. Data may be preliminary.

Anatomy of a crash repository

Joshua C Campbell , Eddie Antonio Santos, Abram Hindle

Computing Science, University of Alberta, Edmonton, Alberta, Canada

DOI: 10.7287/peerj.preprints.2601v1

Published: 2016-11-19
Accepted: 2016-11-19

Subject Areas: Software Engineering
Keywords: Duplicate Bug Reports, Free/Open Source Software, Information Retrieval, Deduplication, Automatic Crash Reporting, Call Stack Trace, Contextual Information, Duplicate Crash Report, Software Engineering

Copyright: © 2016 Campbell et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Campbell JC, Santos EA, Hindle A. 2016. Anatomy of a crash repository. PeerJ Preprints 4:e2601v1 https://doi.org/10.7287/peerj.preprints.2601v1

Abstract

This work investigates the properties of crash reports collected from Ubuntu Linux users. Understanding crash reports is important to better store, categorize, prioritize, parse, triage, assign bugs to, and potentially synthesize them. Understanding what is in a crash report, and how the metadata and stack traces in crash reports vary will help solve, debug, and prevent the causes of crashes. 10 different aspects of 40,592 crash reports about 1,921 pieces of software submitted by users and developers to the Ubuntu project were analyzed, plotted, and statistical distributions were fitted to some of them. We investigated the structure and properties of crash reports. Crashes have many properties that seem to have distributions similar to standard statistical distributions, but with even longer tails than expected. These aspects of crash reports have not been analyzed statistically before. We found that many applications only had a single crash, while a few applications had a large number of crashes reported. Crash bucket size (clusters of similar crashes) also followed a Zipf-like distribution. The lifespan of buckets ranged from less than an hour to over four years. Some stack traces were short, and some were so long they were truncated by the tool that produced them. Many crash reports had no recursion, some contained recursion, and some displayed evidence of unbounded recursion. Linguistics literature hinted that sentence length follows a gamma distribution; this is not the case for function name length. Additionally, only two hardware architectures, and a few signals are reported for almost all of the crashes in the Ubuntu dataset. Many crashes were similar but there were also many unique crashes. This study of crashes from 1,921 projects will be valuable for anyone who wishes to: cluster or deduplicate crash reports, synthesize or simulate crash reports, store or triage crash reports, or data-mine crash reports.

Author Comment

Was submitted to the journal Empirical Software Engineering for peer review. That process is still in progress at the time of submission. This version has not passed peer review.

0

3311 days ago - Daniel Bilar

Much enjoyed this empirical work and will ponder what (if anything) can be inferred from the statistical distributions.

I have two comments:

1) Crash reports have a potential which is largely unexplored, for instance as evidence of (buggy) 0day exploits in computer security. See Window Error Report (kept under tight qras by Microsoft) from Websense 2014 "USING ANOMALIES IN CRASH REPORTS TO DETECT UNKNOWN THREATS" http://www.websense.com/assets/reports/websense-crash-report-en.pdf

They field two examples: Discovering a new APT attack on a global telecommunication company and a government entity and a previously unreported campaign against point-of- sale (POS) systems.

2) On the function name lengths, deduplication goals and tokens, see also Hatton "Power-laws and the Conservation of Information in discrete token systems: Part 1 General Theory"

"Using variational principles suggested in [10], [11] and using the princi- ple of the Conservation of Information, it is predicted that the probability pi of a component appearing with ti tokens in any software system, whatever its implementation details, obeys the following distribution with respect to the size of its unique alphabet of tokens ai,

pi ∼ (ai)−β (26)

Overwhelming evidence for this behaviour has been presented derived from some 55.5 million lines of code in six languages with an associated p-value of < 2.2.10−16.

The behaviour exemplified by (26) has been demonstrated to be persis- tent through the life of single software systems as exemplified by three very disparate systems."

Best regards, will study your paper more in depth over the holidays

Daniel Bilar

Add your feedback

Before adding feedback, consider if it can be asked as a question instead, and if so then use the Question tab. Pointing out typos is fine, but authors are encouraged to accept only substantially helpful feedback.

Some Markdown syntax is allowed: _italic_ **bold** ^superscript^ ~subscript~ %%blockquote%% [link text](link URL)

By posting this you agree to PeerJ's commenting policies

Questions

Ask a question

Learn more about Q&A

Links

Add a link

Content

Alert

Just enter your email

0

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article