Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on January 30th, 2025 and was peer-reviewed by 2 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on May 13th, 2025.
  • The first revision was submitted on June 6th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • A further revision was submitted on June 30th, 2025 and was reviewed by 1 reviewer and the Academic Editor.
  • A further revision was submitted on August 4th, 2025 and was reviewed by 1 reviewer and the Academic Editor.
  • A further revision was submitted on August 25th, 2025 and was reviewed by the Academic Editor.
  • The article was Accepted by the Academic Editor on August 27th, 2025.

Version 0.5 (accepted)

· Aug 27, 2025 · Academic Editor

Accept

Thank you for your new version. As all comments have been addressed now; I'm happy to accept your paper.

[# PeerJ Staff Note - this decision was reviewed and approved by Shawn Gomez, a PeerJ Section Editor covering this Section #]

Version 0.4

· Aug 20, 2025 · Academic Editor

Minor Revisions

As you can see, the reviewer still would like to have a few additional changes. As we have been going through several iterations with this paper now, I really hope you can make these final changes now.

·

Basic reporting

concerning remark 1.3: the bibliography now appears almost done. Some minor typos or lacks are still there, but I guess that this can be managed along the production process

concerning remark 1.4: the abstract is still ambiguous. A final short statement was added, but the misleading statement was not touched at all:
"Empirical results show that compared with the best performing machine learning and deep learning baselines in our experiments, the fine-tuning GPT-3.5-turbo model improves ..."

A possible formulation might be, for instance:
"Empirical results show that, compared with a set of traditional machine learning and deep learning techniques, the fine-tuning GPT-3.5-turbo model improves ..."

Experimental design

nothing to add

Validity of the findings

nothing to add

Additional comments

nothing to add

Version 0.3

· Aug 4, 2025 · Academic Editor

Minor Revisions

As you can see, the reviewer still has a few additional remarks that need to be addressed. Please make sure to address all remaining remarks before resubmitting.

·

Basic reporting

Most comments have been addressed, with exception of two which I mention here (following the Ids appearing in the response letter). The first of them is minor, the second one is relevant to me.

Comment 1.3 is partially addressed
the bibliography still includes details that I cannot safely parse. In particular, it is not safely clear to me the meaning of various annotations ([C], [J], (red)) which appear repeatedly.

Comment 1.4 is not properly addressed
The correction in the abstract to resolve the problem that I have raised is not appropriately addressed.
The proposed solution was not proven to be better than the best machine learning baselines. Rather, it was proven to perform better than a specific set of techniques, which does not include LLM-based approaches. The formulation provided in the review is definitely ambiguous. Instead, this point should be made clear, and not only in the abstract but also in what emerges along the entire paper and notably in the discussion.

Experimental design

nothing to add

Validity of the findings

nothing to add
(apart the way how results are presented, in relation to the above mentioned comment 1.4)

Additional comments

nothing to add

Version 0.2

· Jun 23, 2025 · Academic Editor

Minor Revisions

As you can see, the reviewers are happy with how you improved the paper but still have several points of feedback. Please make sure to address those in the new version of your paper (in particular the major issues mentioned by reviewer 1).

·

Basic reporting

Presentation still needs improvement.
Sections and subsections are still difficult to identify and to refer to: giving each section a number would help the reader.
Here and there, the English form still needs some polishing and some repetitions might be removed.
The bibtex has many flaws

Experimental design

The work was improved, addressing most of the comments.

I still have some remarks.

[Major] The authors claim to compare with “the best machine learning and deep learning”. The comparison is actually extensive, but I still miss to see arguments sufficient to say that what they compare with was actually the best possible benchmark. Intuitively, comparison with some other LLM based solution might work better than various of the addressed approaches. E.g. what about the tool At this point of the editing process, this issue cannot be addressed by extending the evaluation, but, discussion on this point should be added.
In particular, what about the approach of reference [12]?
Title and abstract appear to be very correlated. However, this is mentioned in the intro as an example of “studies using deep learning models such as convolutional neural networks (CNN) and long short-term memory networks (LSTM) to capture more complex semantic features[12]”, without mentioning that [12] also addresses usage fo LLMs. Related works does not add on this point.
Besides, Senti4SD [14] is ruled out from the treatment just mentioning “limitations in handling negative sentiments with technology-specific terms. For example, Senti4SD failed to correctly identify negative sentiments containing terms like “FIXME, which caused the accuracy of sentiment recognition to suffer.” Actually, reference [14] is just an extended abstract of 1 page. So, where are evidences about mentioned limits provided?
Finally, how can we say that traditional AI and deep learning included in the benchmark are a good realization of results that those methods can achieve? Where these methods implemented ad hoc for this study and by these authors. In this case, this is a relevant threat to validity, which can be accepted but need to have an explicit mention and discussion.

[-] I still miss to see an explicit discussion of threats to validity.

[Minor] usage of GPT3.5 instead of GPT4 is not properly argumented. I mean: I don’t see any problem in reporting a study that uses GPT3.5 even if this is now outdated by GPT4 in various aspects, the concept of the study is still valid. However, I would not like to publish a simplified statement about the fact that GPT3.5 is better than GPT4. The additional references added in the review are by far not convincing. Overall, I would rather say that this study was performed with GPT3.5 without providing an unnecessary statement about why not GPT4.

[Minor] in the results, the proposed fine tuned GPT3.5 approach shows improved performance over compared methods. The interpretation “This consistent top ranking can be attributed to its complex Transformer architecture, extensive pre-training, and task specific fine-tuning, which enables it to capture the complex semantic nuances that are critical to this task” is reasonable and provides a hint for the interpretation, but it is not supported by concrete arguments (e.g. by some ablational experiment). The statement should thus be reformulated with more prudence.

Validity of the findings

(included in the previous box)

Reviewer 2 ·

Basic reporting

This review relates to the revised version of the paper. Having checked author responses and included corrections/extensions in the submitted revised version of the paper I can state that the authors improved the text in relevance to my comments. Especially those needing more precise specification or explanation, so the revised text is more reader friendly.
Minor remarks: The list of examples in lines 292-308 could be extended for better illustration. There are also some extensions related to requests of the first reviewer, they explain some issues, however this should be evaluated by the first reviewer.
I suggest to the authors to comment more substantively on how to correlate the presented methodology and results with widely discussed problems in the literature on issue handling schemes, text mining of software repository contents classification of issue types, comments and their sequences across the issue handling within the project development period (some references could be added here). This aspect could be used to underline the practical usefulness of your study for software development and assessment.

Experimental design

No corrections are needed.

Validity of the findings

Compare point 1 of my review and my review of the previous paper version.

Additional comments

In point 1 I have proposed some suggestions which are not obligatory, however can make the paper more attractive to readers.

Cite this review as

Version 0.1 (original submission)

· May 13, 2025 · Academic Editor

Major Revisions

Please address the comments by the reviewers in your new version, in particular the major points mentioned by reviewer 1.

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

·

Basic reporting

The manuscript is easy to read, and the form of English is appropriate, with some minor flaws. Some are mentioned in the following.

At the end of the Intro, the organization of the work is described in terms of Sect.1,2,3 … but the subsequent format does not number sections, and formatting is poor, so that it is not immediately distinguishing what is a section and what a subsection, and checking whether all the promised contents appear. In particular, Materials and Methods seems to have a larger font than Experimental setup ... and different usage of capitalization.

In the Abstract, the concept of “complex sentiments” remains unclear.
I’m not able to interpret with strong confidence the meaning of “using GPT for prompt engineering has achieved significant performance improvements in sentiment analysis tasks[15]”. Is this about using GPT to create a prompt? Or is it about engineering the prompt for GPT?

In the initial statement of contribution #2, it is not sufficiently clear whether the GPT model used in the comparison is GPT itself or its fine-tuned version. The correct interpretation emerges only in the sequel of the treatment.

Line 206-207 a ‘.’ Is missing
Reanalysis –> re-analysis
Line 367: “performance.. .”

Experimental design

[Major] I understand that: the work develops on a dataset where SATD is annotated with labels capturing relevant dimensions in the software engineering perspective (e.g. functional …); however, experiments and evaluation developed here is developed on a flattened classification of sentiments (Non-Negative, Negative, Neutral, Mixed, and NoAgression … some of which are then removed).
I miss a sufficient motivation for this reduction, and a sufficient argument about how the result is still somehow “actionable” in a software engineering process.

[Major] In the baseline approaches against which the proposed technique is compared, I miss something that uses similar ingredients: the usage of LLMs for classification of SATD is addressed in a significant literature.
I’m not specifically focused on this subject, but after and even limited search, my impression is that the discussion of related works could/should address various other works that are closer to this contribution than the other mentioned references.

GPT-3.5 turbo is chosen instead of GPT-4, with the motivation that “GPT-3.5-turbo exhibits greater stability and reliability in multiple scenarios [25]”. My understanding from [25] is that the choice among them is not clearly defined. Working with GPT-3.5 turbo can be the best choice, but more arguments should be made on this passage.

I’m not convinced about whether Q1 and Q2 are actually different questions? In the end, they are solved by the same methodology. (Yet, this is not a crucial point.)

In the description of Materials and methods, what is the hardware platform relevant to? The work does not report complexity or performance results, so what is the utility of this information?

Validity of the findings

The discussion of results is not clear and, more importantly, it misses real observations providing at least an intuition about the interpretation of results.

I miss a clear definition of the intended meaning of colours associated with different techniques by the Scott-Knott test. What is the order of quality associated with colors? (Looking at plots and reading along many lines, this emerges, but, for the ease of reading, this should be made explicit since the beginning.)
I miss some discussion about the reason why this particular clustering emerged. What is the weight given to outliers in identifying clusters and giving them an order? What is a practical consequence of this clustering? (i.e., is there a sharp conclusion that can be taken from the fact that some technique is in the red or the green cluster).
Some techniques (but only a few) have different colours for different measures of comparison (e.g., KNN is pale blue and blue in different plots). This means that clustering was repeated for each measure of quality? Is there some aggregated plot that provides a final order according to some mix of measures?

[Major] The discussion of results is just a verbalization of the box-plot of Figs 2,3,4 and Tables 2,3,4 (and then Figs 5,6,7 and Tables 5,6,7). Could any interpretation or explanation, or comment be added? Without this, I don’t see how the textual discussion adds value to the plots.

[Major] I miss a real discussion of threats to validity. The final section of Conclusions briefly mentions 2 possible threats, but this is very limited in the depth of discussion about consequences and mitigations, and also in the range of possible threats.

Reviewer 2 ·

Basic reporting

The paper presents an overview of publications related to Self-Admitted Technical Debt (SATD) problems and shows some drawbacks. The main goal was to propose a generative model framework for SATD sentiment prediction and experimental analysis of the performance difference between the GPT model and other machine learning methods and deep learning. This is well well-organized paper with a clearly presented experiment setup and data processing issues. Nevertheless, there are some drawbacks that need mitigation.
Some comments to the text: Line 68 – to which categories does this sentence refer? Lines 160 + notion “no aggression” and “no agreement” are not clear, need explanation; lines 165-166, user and assistant roles need explanation; line 207 – lacking dot after [13], line 208 lacking dot and space after “al”; line 213 - describe explicitly these 6 categories and 32 subcategories. Line 213+ give some illustrative examples from the data set related to negative and mixed categories. Similarly, tab. 1 needs better comments and examples on excluded and no-agreement categories. Line 368 - explain your statement “significant higher” (provide more justification for this). More substantive comments to included figures are needed (description needs more details). What framework did you use to derive results, time overhead of calculations etc. ? Line 509 - outline contents of the materials in this link.
The practical significance and usage of your approach was not sufficiently explained, may be some illustrative examples could be added. Better correlation with software engineering aspects could be added – the impact of classified sentiment categories on project development. Moreover, it is worth commenting on your analysis in relevance to issue tracing repository contents and issue handling processes, you can refer to relevant publications (which also comment technical debt), e.g. : Albuquerque, D.; Guimarães, E.; Tonin, G.; Rodríguezs, P.; Perkusich, M.; Almeida, H.; Perkusich, A.; Chagas, F. Managing technical debt using intelligent techniques—A systematic mapping study. IEEE Trans. Softw. Eng. 2023, 49, 2202–2220.

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful.

Experimental design

The experiment is quite interesting, however, some improvements in presentation are needed - compare point 1

Validity of the findings

-

Cite this review as

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.