Duplicate Question Detection in Stack Overflow: A Reproducibility Study

Faculdade de Ciência da Computação, Universidade Federal de Uberlândia, Uberlândia, Minas Gerais, Brazil
DOI
10.7287/peerj.preprints.26555v1
Subject Areas
Data Mining and Machine Learning, Data Science, Social Computing, Software Engineering
Keywords
Question quality, Stack Overflow, Duplicate questions, Classification
Copyright
© 2018 Silva et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Silva RFG, Paixao KV, Maia MdA. 2018. Duplicate Question Detection in Stack Overflow: A Reproducibility Study. PeerJ Preprints 6:e26555v1

Abstract

Stack Overflow has become a fundamental element of developer toolset. Such influence increase has been accompanied by an effort from Stack Overflow community to keep the quality of its content. One of the problems which jeopardizes that quality is the continuous growth of duplicated questions. To solve this problem, prior works focused on automatically detecting duplicated questions. Two important solutions are DupPredictor and Dupe. Despite reporting significant results, both works do not provide their implementations publicly available, hindering subsequent works in scientific literature which rely on them. We executed an empirical study as a reproduction of DupPredictor and Dupe. Our results, not robust when attempted with different set of tools and data sets, show that the barriers to reproduce these approaches are high. Furthermore, when applied to more recent data, we observe a performance decay of our both reproductions in terms of recall-rate over time, as the number of questions increases. Our findings suggest that the subsequent works concerning detection of duplicated questions in Question and Answer communities require more investigation to assert their findings.

Author Comment

This paper has been accepted for publication in the proceedings of 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2018).