Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

Review History
RefactorGPT: a ChatGPT-based multi-agent framework for automated code refactoring

All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

The initial submission of this article was received on June 4th, 2025 and was peer-reviewed by 2 reviewers and the Academic Editor.
The Academic Editor made their initial decision on July 16th, 2025.
The first revision was submitted on August 12th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
The article was Accepted by the Academic Editor on September 9th, 2025.

Version 0.2 (accepted)

Armin Mikler · Sep 9, 2025 · Academic Editor

Accept

The reviewers confirmed that all their concerns have been addressed in the revision. The manuscript is ready for publication.

[# PeerJ Staff Note - this decision was reviewed and approved by Xiangjie Kong, a PeerJ Section Editor covering this Section #]

Reviewer 1 · Aug 23, 2025

Basic reporting

.

Experimental design

.

Validity of the findings

.

Cite this review as

Anonymous Reviewer (2025) Peer Review #1 of "RefactorGPT: a ChatGPT-based multi-agent framework for automated code refactoring (v0.2)". PeerJ Computer Science

Mohamed Wiem Mkaouer · Aug 27, 2025

Basic reporting

I am satisfied with the revision; I have no further comments to be addressed.

Experimental design

-

Validity of the findings

-

Cite this review as

Mkaouer MW (2025) Peer Review #2 of "RefactorGPT: a ChatGPT-based multi-agent framework for automated code refactoring (v0.2)". PeerJ Computer Science

Download Version 0.2 (PDF) Download author's response letter - submitted Aug 12, 2025

Version 0.1 (original submission)

PeerJ Staff · Jul 16, 2025 · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.

**Language Note:** When you prepare your next revision, please either (i) have a colleague who is proficient in English and familiar with the subject matter review your manuscript, or (ii) contact a professional editing service to review your manuscript. PeerJ can provide language editing services - you can contact us at [email protected] for pricing (be sure to provide your manuscript number and title). – PeerJ Staff

Reviewer 1 · Jun 23, 2025

Basic reporting

1. Expand the Architecture Diagram with detailed annotations for each agent and show the data flow (input and output) for each phase. Figure 1 represents a high-level view of the pipeline process but lacks details.

2. Add a table that summarizes the nine refactoring techniques used.

3. Source code and datasets are publicly available, but the dataset construction requires more details of 1) the source of these code samples (they seem to be manually authored and not extracted from open-source projects), 2) the complexity and size, and 3) the preprocessing steps.

4. Adding a running example with simple Python code can be useful to show how each agent works.

5. The introduction briefly mentions some previous literature; however, a comparison with existing approaches is missing. You need to add a related work section to differentiate this framework from the rule-based refactoring tools and LLM-based studies, such as the mentioned studies (Choi et al., Alon et al., and Xiue et al.).

6. There are duplicate section headings for “Materials and Methods” (Sections 2 and 3). These should be merged or renamed for clarity.

Experimental design

1. The experiment uses only one language, which limits the findings’ generalizability.

2. There is no comparison with baseline studies or tools!

Validity of the findings

Tables 2-5 show the computational and structural impacts. However, this is the weakest part of the paper.

1. What is the evaluation objective? Provide justifications for the selected evaluation metrics. Token Count was used as a proxy for code density and cognitive complexity. What is the source of such a claim?

2. Behavior preservation is not evaluated.

3. You should consider other evaluation metrics such as coupling, cohesion, complexity, and other internal metrics commonly used to assess refactoring improvements in previous studies. Code correctness is more important to the developers than execution time!

Additional comments

Below are the major changes that need to be made to the paper; other suggested changes are mentioned in the comments above.

1. Expand the Architecture Diagram with detailed input and output.

2. Explain the dataset creation process.

3. Increase the number of code examples by including open-source projects.

4. Extend the experiments to include one other language, such as Java.

5. Include quantitative comparisons with baseline studies or tools.

6. Include other metrics in the evaluation, such as coupling, cohesion, and complexity, that have been used in studies with similar objectives.

7. State the limitations of the proposed approach.

Cite this review as

Anonymous Reviewer (2025) Peer Review #1 of "RefactorGPT: a ChatGPT-based multi-agent framework for automated code refactoring (v0.1)". PeerJ Computer Science

Mohamed Wiem Mkaouer · Jul 8, 2025

Basic reporting

-

Experimental design

-

Validity of the findings

-

Additional comments

The paper is straightforward and generally easy to read. It provides the appropriate background to cover the different topics discussed. The motivation for the paper is unclear but can be inferred. The problem being addressed is current, given how developers heavily rely on large language models for various software engineering and programming tasks.

I find the limitations mentioned in the introduction of the existing studies to be inaccurate and to overstate the significance of the current study, since it neither offers solutions to these limitations nor explains how the current approach will address them. For example, stating that existing solutions only support limited refactoring scenarios needs to clarify why these restrictions are insufficient for developers' needs. As the authors know, providing highly focused refactorings that are appropriate and correct, such as the one in the study [3], is more developer-friendly because it helps developers easily understand the impact of the refactoring, apply it within the IDE, and ensure the correctness of the change. Additionally, the paper mentions another limitation regarding the support for programming languages, but then offers a solution tested only in Java. Although the proposed approach could be easily adapted to support other languages, the paper only tests it in one language.

The paper surprisingly ignores studies that either used ChatGPT directly [1,2] or indirectly [3] for refactoring.

I find the proposed 27 scenarios interesting, but it would be helpful to show how the paper has finely tuned the comprehensiveness of these scenarios. How did the authors develop their rationale? Is it based on existing studies? Does it rely on previous bug reports? Or is this a pure effort by the authors through some form of mixed methods? Explaining the details of how these scenarios were derived would help the reader better understand their advantages and potential limitations. For example, it’s not clear how the extraction method scenarios account for extracting either one code fragment or multiple code fragments within the context of code clones. Furthermore, if we are extracting clones that belong to multiple classes, how would this approach work? From what it seems, it only supports one source class and one target class.

My main concern with the paper is that the experiments do not adequately reflect the extensive effort put into providing a comprehensive list of scenarios, as they only present the results in a black-box manner. I wanted to see a breakdown of the test cases related to the scenarios so that readers could understand which scenarios succeed or fail with respect to ChatGPT and agents. Surprisingly, it only provides the numbers without delving into the details or mechanics of each scenario. The paper would greatly benefit from "white boxing" that process and providing clear examples of compiler failures and incorrect refactorings to strengthen the discussion.

Finally, the paper claims the scenarios to be part of its contribution, which I believe is the strongest part of the paper, but it then fails to provide those scenarios for replication and extension. If the paper wants to claim such effort as a contribution, it needs to provide them after detailing how they were designed, as I mentioned in my previous comment.

[1] Liu, Yue, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo. "Refining chatgpt-generated code: Characterizing and mitigating code quality issues." ACM Transactions on Software Engineering and Methodology 33, no. 5 (2024): 1-26.

[2] DePalma, Kayla, Izabel Miminoshvili, Chiara Henselder, Kate Moss, and Eman Abdullah AlOmar. "Exploring ChatGPT’s code refactoring capabilities: An empirical study." Expert Systems with Applications 249 (2024): 123602.

[3] Pomian, Dorin, Abhiram Bellur, Malinda Dilhara, Zarina Kurbatova, Egor Bogomolov, Timofey Bryksin, and Danny Dig. "Next-generation refactoring: Combining llm insights and ide capabilities for extract method." In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 275-287. IEEE, 2024.

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful.

Cite this review as

Mkaouer MW (2025) Peer Review #2 of "RefactorGPT: a ChatGPT-based multi-agent framework for automated code refactoring (v0.1)". PeerJ Computer Science

Download Original Submission (PDF) - submitted Jun 4, 2025

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Review History RefactorGPT: a ChatGPT-based multi-agent framework for automated code refactoring

Summary

Version 0.2 (accepted)

Armin Mikler · Sep 9, 2025 · Academic Editor

Reviewer 1 · Aug 23, 2025

Basic reporting

Experimental design

Validity of the findings

Mohamed Wiem Mkaouer · Aug 27, 2025

Basic reporting

Experimental design

Validity of the findings

Version 0.1 (original submission)

PeerJ Staff · Jul 16, 2025 · Academic Editor

Reviewer 1 · Jun 23, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Mohamed Wiem Mkaouer · Jul 8, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Review History
RefactorGPT: a ChatGPT-based multi-agent framework for automated code refactoring