Duplicate Question Detection in Stack Overflow: A Reproducibility Study

Rodrigo F G Silva; Klerisson V Paixao; Marcelo de A. Maia

doi:10.7287/peerj.preprints.26555v1

Duplicate Question Detection in Stack Overflow: A Reproducibility Study

Rodrigo F G Silva , Klerisson V Paixao, Marcelo de A. Maia

Faculdade de Ciência da Computação, Universidade Federal de Uberlândia, Uberlândia, Minas Gerais, Brazil

DOI: 10.7287/peerj.preprints.26555v1

Published: 2018-02-21
Accepted: 2018-02-21

Subject Areas: Data Mining and Machine Learning, Data Science, Social Computing, Software Engineering
Keywords: Question quality, Stack Overflow, Duplicate questions, Classification

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Silva RFG, Paixao KV, Maia MdA. 2018. Duplicate Question Detection in Stack Overflow: A Reproducibility Study. PeerJ Preprints 6:e26555v1 https://doi.org/10.7287/peerj.preprints.26555v1

Abstract

Stack Overflow has become a fundamental element of developer toolset. Such influence increase has been accompanied by an effort from Stack Overflow community to keep the quality of its content. One of the problems which jeopardizes that quality is the continuous growth of duplicated questions. To solve this problem, prior works focused on automatically detecting duplicated questions. Two important solutions are DupPredictor and Dupe. Despite reporting significant results, both works do not provide their implementations publicly available, hindering subsequent works in scientific literature which rely on them. We executed an empirical study as a reproduction of DupPredictor and Dupe. Our results, not robust when attempted with different set of tools and data sets, show that the barriers to reproduce these approaches are high. Furthermore, when applied to more recent data, we observe a performance decay of our both reproductions in terms of recall-rate over time, as the number of questions increases. Our findings suggest that the subsequent works concerning detection of duplicated questions in Question and Answer communities require more investigation to assert their findings.

Author Comment

This paper has been accepted for publication in the proceedings of 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2018).