Looking for related posts on GitHub discussions

View article
PeerJ Computer Science

Main article text

 

Introduction

  • The RD-Detector, an approach based on deep machine learning models to detect related posts on GitHub Discussions.

  • Empirical evidence on using a general-purpose machine learning model to detect related posts created by software communities.

  • Empirical evidence regarding the RD-Detector practical applications in communities from three OSS maintainers’ perspectives.

Background

Communication within the GitHub

Related work

The rd-detector approach

  • Duplicate posts are those with the same content. The duplicate post items (title, description, and author) can be exact or close copies. Posts’ authors can rewrite some items by adding or deleting information. Detecting duplicate posts is essential to mitigate the duplication problem in collaborative discussion forums of developers.

  • Near-duplicate posts are those posts with similar topics. Different users with similar interests, questions, or ideas create and comment on them. Items of near-duplicate posts (title, description, and author) are not the same but share information related to each other’s content. Detecting near-duplicate posts is essential to disseminate the project knowledge.

Preprocessing

Relatedness checker

Similarity checker

Selection of related post candidates

  • 1. Defining the K value: The K value delimits the search bounds for related posts candidates. The RD-Detector uses the K value to select the similarity values of the top-K most similar posts to each discussion post in the input dataset. The greater the value of K, the greater the number of similarity values selected. Setting K=3, the approach uses the similarity values of the top-3 most similar discussion posts to every post in the dataset. Setting K=20, the RD-Detector selects the similarity values of top-20 most similar discussion posts. The K value can range from 1 to n1, where n is the number of discussion posts in the dataset. K is an input value (Algorithm 1).

  • 2. Creating the distribution S: The S distribution is a collection of similarity values. S contains the similarity values of the K most similar target discussion posts for each discussion post in the dataset (Algorithm 1, lines 14–19).

    Let n be the number of discussion posts in the dataset, K the number of the most similar target posts to every discussion in the dataset, and valuei_j the similarity value of a given masteri and targetj post pair, the distribution S is:

    S=  <value1_1,value1_2,,value1_K,value2_1,                  value2_2,,value2_K,,valuen_1,valuen_2,,valuen_K>

  • 3. Determining the descriptive statistics of S: We use descriptive statistics variability measures to understand how dispersed the distribution S is. To this end, we calculate the interquartile range (IQR), along with the 25th percentile ( Q1), the 50th percentile ( Q2), and the 75th percentile ( Q3), Algorithm 1—lines 20 to 23. Next, we find the Upper Inner Fence value (Eq. (1)) that identifies the outliers in S (Tukey, 1977):

    Upperinnerfence=Q3+(1.5IQR)

  • 4. Setting the local threshold ( Trelated): Because we assume that the greater the semantic similarity value of a pair of posts, the greater the chances they are related posts. We set the local threshold to the upper inner fence value (Algorithm 1—line 24). Therefore, we have that:

    Trelated=Upperinnerfence

    The K value defines the S distribution size. Consequently, it changes the coefficients Q1, Q2, Q3, and IQR values that summarize S. As a result, it also causes changes in the local threshold value, Trelated. Since Trelated is directly influenced by S, we call Trelated as ‘local threshold’.

After setting the local threshold, the RD-Detector detects the pairs of candidates for related posts. Related posts are those pairs with similarity values equal to or greater than the local threshold. RD-Detector outputs R, the set of related post candidates (Algorithm 1, lines 26-30). We consider related posts those pairs identified as outliers in the S distribution. Calefato et al. (2021) also use descriptive statistics to identify core OSS developers’ inactivity periods.

Assessing rd-detector over github discussions forums

Data collection—GitHub Discussions

Dataset characterization

Preprocessing phase applied to discussions dataset

Relatedness checker applied to GitHub Discussions

The RD-Detector evaluation

Results

Discussions

The impacts of changing the K value

False-positive RD-Detector predictions

RD-Detector practical applications

  • 1. to combine the posts’ content merging the related posts—“These (post) could have been combined into one discussion and it would have made sense…” (M_Homebrew), “…if it were up to me, they should have gone together in the same discussion.” (M_Homebrew).

  • 2. to move a post content to another location reorganizing the discussion threads as comments to each other—“…the new issue should probably have been posted as a comment in the master discussion” (M_Homebrew), “…(the posts) would’ve received better traction as a comment on one another.” (M_Next.js), “This discussion could’ve sufficed as a comment on 21633.” (M_Next.js).

  • 3. to recruit collaborators for specific tasks—“…could be useful though for people looking for other guides to contribute to (in this instance).” (M_Gatsby).

Contrasting the RD-Detector measurements

The implications of this research

Limitations

Conclusion

Additional Information and Declarations

Competing Interests

Evangeline Liu and Grace Vorreuter participated in the GitHub Discussions engineering team. Denae Ford is employed by Microsoft Research. The other authors have no competing interests to declare that are relevant to the content of this article.

Author Contributions

Marcia Lima conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Igor Steinmacher conceived and designed the experiments, performed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Denae Ford performed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Evangeline Liu performed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Grace Vorreuter performed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Tayana Conte conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Bruno Gadelha conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The data is available at GitHub and Zenodo:

- https://github.com/marcia-lima/RD-Detector/tree/v.1-beta.

- marcia-lima. (2023). marcia-lima/RD-Detector: v.1 (v.1-beta). Zenodo. https://doi.org/10.5281/zenodo.7982576.

Funding

This work was supported by CNPq through processes number 314174/2020-6 and 313067/2020-1, CAPES financial code 001, FAPESP under grant 2020/05191-2, FAPEAM through process number 062.00150/2020. This research was carried out within the scope of the Samsung-UFAM Project for Education and Research (SUPER), according to Article 48 of Decree number 6.008/2006 (SUFRAMA). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

3 Citations 835 Views 51 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more