Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on March 19th, 2025 and was peer-reviewed by 3 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on July 1st, 2025.
  • The first revision was submitted on August 20th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • The article was Accepted by the Academic Editor on September 20th, 2025.

Version 0.2 (accepted)

· Sep 20, 2025 · Academic Editor

Accept

The paper may be accepted.

[# PeerJ Staff Note - this decision was reviewed and approved by Xiangjie Kong, a PeerJ Section Editor covering this Section #]

Reviewer 2 ·

Basic reporting

-

Experimental design

-

Validity of the findings

-

·

Basic reporting

The article is written in technical English, suitable for a scientific journal in the fields of software engineering and artificial intelligence. It is clear and unambiguous, utilizing specialized vocabulary and correct terminology specific to deep learning and natural language processing (NLP). No major spelling or grammatical errors were detected.

The article provides a moderate review of the existing literature, with a particular focus on enhancement report classification approaches, bug reports, sentiment analysis in software engineering, and prioritization systems. The need for this work is highlighted based on an identified gap: the combined use of non-textual features with deep learning-based models. The context is adequately supported.

The article follows a standard format, comprising an introduction, related works, methodological proposal, evaluation, discussion, and conclusions. It is self-contained, clearly stating the problem, the hypothesis (improving the prediction of feasible ERs), and the proposed method (using BERT + non-textual features) while presenting a robust set of results that support the conclusions.

Experimental design

The study represents primary and original research that aligns with the scope of journals specializing in software engineering, artificial intelligence (AI), and natural language processing (NLP). The objective is clearly defined: to improve the automatic identification of feasible ERs. The article justifies its proposal by comparing it to previous approaches. It highlights a specific contribution to the field: the combined use of textual and non-textual features with a BERT-based model, which fills a gap in the existing literature. The experimentation is carefully designed. Real data (40,000 ERs from Bugzilla) are utilized, and various models (SVM, LSTM, CNN, BERT) are compared across multiple metrics. The ethics of the field are respected by using public datasets.

Validity of the findings

The paper not only proposes a new model but also makes direct comparisons with previous work (replicating and improving the results of Nizamani, Umer, and Cheng). This adds value to the literature. Furthermore, it performs sensitivity analysis (by disabling features) and statistical tests (ANOVA and Wilcoxon) to demonstrate significant differences between the approaches. All of this strengthens the internal validity of the experiment. The data used comes from a widely cited public dataset. The cross-validation tests are well-controlled, and standard metrics commonly used in the community are employed.

Additional comments

The article addresses a relevant problem in software engineering and applies modern NLP techniques. It is exhaustive in its comparison with previous work, solid in its experimental design, and meticulous in the presentation of results. Its greatest strength is the combination of non-textual features with deep learning, something little explored in previous work.

Version 0.1 (original submission)

· Jul 1, 2025 · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

**Language Note:** The review process has identified that the English language must be improved. PeerJ can provide language editing services - please contact us at [email protected] for pricing (be sure to provide your manuscript number and title). Alternatively, you should make your own arrangements to improve the language quality and provide details in your response letter. – PeerJ Staff

Reviewer 1 ·

Basic reporting

This paper proposes a novel BERT-based approach for the automatic identification of meaningful software improvement reports (ERs). The objective is to combine BERT's natural language processing capabilities with the integration of non-textual information (logarithm of quantities, logarithm of products/units, and report sentiment).

Experimental design

The experimental design is robust and demonstrates originality by combining non-textual features specific to RE with the BERT model to answer relevant research questions. This approach fills gaps by exploring previously unexploited features and techniques (BERT).

Validity of the findings

The results demonstrate substantial performance improvements, supported by appropriate statistical tests (ANOVA and Wilcoxon for RQ1) and well-conducted ablation (RQ2) and classifier comparison (RQ3) studies, and the impact and novelty are not formally assessed according to the journal guidelines. The use of a public dataset is a strength, and while enriching it with non-text features is a key contribution, documentation of this augmentation is of interest to validate the results. The conclusions drawn are well supported by the results, effectively address the research questions, and the main claim of improvement (e.g., accuracy from 82.38% to 94.02%) has a reasonable basis on the average performance of the different S-APPs.

Additional comments

1. Detail how the numerical sentiment score is formatted and combined with other features for BERT input.

2. Refine Section 2.5: Briefly make the bibliometric analysis in Section 2.5 more concise or better link it to the paper's specific methods.

3. Discussion: Briefly suggest future work on model explainability (e.g., LIME/SHAP) in the conclusion.

4. Specify Average Performance: Explicitly state in the text that key performance improvements (from Table 2) are averages across S-APPs.

5. Detail Non-Textual Feature Engineering: Provide a thorough, step-by-step explanation of how non-textual features (nR, rR, nM, rM) were calculated, integrated with other inputs, and how data leakage was prevented.

Reviewer 2 ·

Basic reporting

The abstract and introduction contain overstatements about novelty, claiming to be the first to combine non-textual features and DL methods without sufficiently acknowledging related prior work.
Feature engineering is poorly described.

Experimental design

How were the statistics per reporter and module extracted from the raw Bugzilla data?
The resampling techniques are vaguely referenced, without a clear specification of how oversampling or undersampling was implemented.

Validity of the findings

There is no mention of validation strategies, such as k-fold cross-validation or temporal validation. An 80/20 split is stated, but it’s unclear if data leakage is avoided.

Input truncation at 256 tokens is not motivated or evaluated.

The comparative models (SVM, LSTM, CNN) appear under-tuned or used with minimal configuration, which raises fairness concerns in comparison with the BERT model.

The study claims statistically significant improvements but does not provide p-values for all metrics, only for the F1 score.

·

Basic reporting

The article is written in technical English, making it suitable for a scientific journal in the fields of software engineering and artificial intelligence. The writing is clear and unambiguous, utilizing specialized vocabulary and correct terminology specific to deep learning and natural language processing (NLP). No major spelling or grammatical errors were detected. Some passages could benefit from minor stylistic revision to eliminate redundancies and improve flow, but overall, it meets an acceptable professional standard.

The article provides a moderate review of the existing literature, with a particular focus on enhancement report classification approaches, bug reports, sentiment analysis in software engineering, and prioritization systems. Relevant works such as those by Nizamani et al. (2017), Umer et al. (2019), and Cheng et al. are cited. (2021), as well as classical approaches such as MNB, SVM, RNN, and more recent ones, including BERT. The need for this work is highlighted based on an identified gap: the combined use of non-textual features with deep learning-based models. The context is adequately supported.

The article follows a standard format, comprising an introduction, related works, methodological proposal, evaluation, discussion, and conclusions. The figures (e.g., architecture diagrams, Word Cloud, co-occurrence network) are relevant, well-labeled, and contribute to the understanding of the text. The tables (such as those with comparative results) are clear, well-organized, and present standard metrics (accuracy, precision, recall, F1). Although data from a public dataset is used, not all the scripts or notebooks necessary to reproduce the experiment are linked, which would be ideal under open data policies.

The article is self-contained, clearly stating the problem, the hypothesis (improving the prediction of feasible ERs), and the proposed method (using BERT + non-textual features), while presenting a robust set of results that support the conclusions. It is not artificially fragmented to inflate publications. The unnecessary use of jargon or technical terms is avoided, and sufficient information is provided to enable understanding and contextualization of the findings within the field. The hypothesis is adequately developed and tested with different experiments.

Experimental design

The study represents primary and original research that aligns with the scope of journals specializing in software engineering, artificial intelligence (AI), and natural language processing (NLP). The objective is clearly defined: to improve the automatic identification of feasible ERs. The article justifies its proposal by comparing it to previous approaches. It highlights a specific contribution to the field: the combined use of textual and non-textual features with a BERT-based model, which fills a gap in the existing literature.

The experimentation is carefully designed. Real data (40,000 ERs from Bugzilla) are utilized, and various models (SVM, LSTM, CNN, BERT) are compared across multiple metrics. The ethics of the field are respected by using public datasets, and the limitations of the data source (e.g., changes in ER status over time) are acknowledged. Furthermore, threats to internal and external validity are reported, demonstrating methodological awareness.

The article clearly describes text preprocessing, the use of the Senti4SD tool for sentiment analysis, tokenization with BERT, padding and truncation steps, and the model architecture. It also details the training configuration (hyperparameters, batch size, number of epochs, computing environment). However, a direct link to a code repository is needed to facilitate full replicability of the study, which would be an essential complement in an open science environment.

Validity of the findings

The paper not only proposes a new model but also makes direct comparisons with previous work (replicating and improving the results of Nizamani, Umer, and Cheng). This adds value to the literature. Furthermore, it performs sensitivity analysis (by disabling features) and statistical tests (ANOVA and Wilcoxon) to demonstrate significant differences between the approaches. All of this strengthens the internal validity of the experiment.

The data used comes from a widely cited public dataset. Although the paper manually extracts some non-textual features from Bugzilla (because the original dataset does not contain them), a supporting repository with these enriched data is not provided, which reduces transparency. The cross-validation tests are well-controlled, and standard metrics commonly used in the community are employed. However, more details could be added on how the imbalance in the training set is managed.

The conclusions are well aligned with the results obtained. Exaggerated claims are avoided. Percentage improvements over the state-of-the-art are quantified and discussed. Limitations (such as the "black box" nature of the model or the need for more future features) are acknowledged, and future lines of work are proposed. Causality is never established without adequate experimental evidence.

Additional comments

The article addresses a relevant problem in software engineering and applies modern NLP techniques. It is exhaustive in its comparison with previous work, solid in its experimental design, and meticulous in the presentation of results. Its greatest strength is the combination of non-textual features with deep learning, something little explored in previous work.

List of Minor Recommendations
Publish the source code of the proposed model, including the repository (e.g., on GitHub) with the code used for training and evaluating the BERT model, as well as preprocessing scripts and statistical analysis.

Share enriched data: Although the base dataset (Bugzilla) is publicly available, the authors have enriched the data with non-textual variables. It is recommended that this modified dataset be made public to facilitate replication.

English stylistic revision in some sections: Although the article is well-written, there are sentences with redundant or clumsy constructions that could benefit from professional editing to improve clarity.

Greater detail on class imbalance handling: Although the use of resampling techniques is mentioned, it would be helpful to describe more precisely which method was used (e.g., SMOTE, random oversampling) and how it was implemented.

Clarify the selection and justification of hyperparameters: Although it is mentioned that the default parameters of the BERT model were used (e.g., learning rate, batch size, epochs), it would be valuable to explain if other values ​​were tested or if any fine-tuning was performed, and under what criteria.

Specify the metrics used to statistically validate the results: ANOVA and Wilcoxon are mentioned, but it is not explained whether assumptions such as normality or homogeneity of variance were verified. It is recommended to briefly justify the choice of these tests and their appropriateness to the context.

Better explain the dataset segmentation (train/test): Although an 80/20 split is mentioned, it would be advisable to indicate whether cross-validation (e.g., k-fold) was used, whether stratification was used, or whether the process was repeated multiple times to reduce bias.

Include a brief discussion on the approach's scalability: The article focuses on improving metrics, but it would be helpful to include a reflection on computational costs and the viability of the model in production or resource-limited environments.

Future work could include:
Comparison with more current deep learning models: While comparing with LSTM, CNN, and SVM, adding additional comparisons with models such as RoBERTa, DistilBERT, or lightweight Transformer models would enhance the contextualization of the proposed advancement.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.