How are functionally similar code clones syntactically different? An empirical study and a benchmark

Stefan Wagner; Asim Abdulkhaleq; Ivan Bogicevic; Jan-Peter Ostberg; Jasmin Ramadani

doi:10.7287/peerj.preprints.1516v2

Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

NOT PEER-REVIEWED

"PeerJ Preprints" is a venue for early communication or feedback before peer review. Data may be preliminary.

A peer-reviewed article of this Preprint also exists.

View peer-reviewed version

How are functionally similar code clones syntactically different? An empirical study and a benchmark

Stefan Wagner , Asim Abdulkhaleq, Ivan Bogicevic, Jan-Peter Ostberg, Jasmin Ramadani

Universität Stuttgart, Stuttgart, Germany

DOI: 10.7287/peerj.preprints.1516v2

Published: 2016-02-09
Accepted: 2016-02-09

Subject Areas: Programming Languages, Software Engineering
Keywords: Code Clone, Functionally Similar Clone, Empirical Study, Benchmark

Copyright: © 2016 Wagner et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Wagner S, Abdulkhaleq A, Bogicevic I, Ostberg J, Ramadani J. 2016. How are functionally similar code clones syntactically different? An empirical study and a benchmark. PeerJ PrePrints 4:e1516v2 https://doi.org/10.7287/peerj.preprints.1516v2

Abstract

Background. Today, redundancy in source code, so-called “clones”, caused by copy&paste can be found reliably using clone detection tools. Redundancy can arise also independently, however, caused not by copy&paste. At present, it is not clear how only functionally similar clones (FSC) differ from clones created by copy&paste. Our aim is to understand and categorise the syntactic differences in FSCs that distinguish them from copy&paste clones in a way that helps clone detection research. Methods. We conducted an experiment using known functionally similar programs in Java and C from coding contests. We analysed syntactic similarity with traditional detection tools and explored whether concolic clone detection can go beyond syntax. We ran all tools on 2,800 programs and manually categorised the differences in a random sample of 70 program pairs. Results. We found no FSCs where complete files were syntactically similar. We could detect a syntactic similarity in a part of the files in < 16 % of the program pairs. Concolic detection found 1 of the FSCs. The differences between program pairs were in the categories algorithm, data structure, OO design, I/O and libraries. We selected 58 pairs for an openly accessible benchmark representing these categories. Discussion. The majority of differences between functionally similar clones are beyond the capabilities of current clone detection approaches. Yet, our benchmark can help to drive further clone detection research.

Author Comment

This version of the preprint constitutes the revision we have sent to PeerJ Computer Science. It contains a lot of smaller corrections of typos and improvements of explanations. The largest difference is a new attempt to define functionally similar code clones.

Add your feedback

Before adding feedback, consider if it can be asked as a question instead, and if so then use the Question tab. Pointing out typos is fine, but authors are encouraged to accept only substantially helpful feedback.

Some Markdown syntax is allowed: _italic_ **bold** ^superscript^ ~subscript~ %%blockquote%% [link text](link URL)

By posting this you agree to PeerJ's commenting policies

Questions

Ask a question

Learn more about Q&A

Links

Add a link

Content

Alert

Just enter your email

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article