All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
The reviewers confirmed that all their concerns have been addressed in the revision. The manuscript is ready for publication.
[# PeerJ Staff Note - this decision was reviewed and approved by Xiangjie Kong, a PeerJ Section Editor covering this Section #]
.
.
.
I am satisfied with the revision; I have no further comments to be addressed.
-
-
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
 
**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.
**Language Note:** When you prepare your next revision, please either (i) have a colleague who is proficient in English and familiar with the subject matter review your manuscript, or (ii) contact a professional editing service to review your manuscript. PeerJ can provide language editing services - you can contact us at [email protected] for pricing (be sure to provide your manuscript number and title). – PeerJ Staff
1.	Expand the Architecture Diagram with detailed annotations for each agent and show the data flow (input and output) for each phase. Figure 1 represents a high-level view of the pipeline process but lacks details. 
2.	Add a table that summarizes the nine refactoring techniques used.
3.	Source code and datasets are publicly available, but the dataset construction requires more details of 1) the source of these code samples (they seem to be manually authored and not extracted from open-source projects), 2) the complexity and size, and 3) the preprocessing steps.
4.	Adding a running example with simple Python code can be useful to show how each agent works.
5.	The introduction briefly mentions some previous literature; however, a comparison with existing approaches is missing. You need to add a related work section to differentiate this framework from the rule-based refactoring tools and LLM-based studies, such as the mentioned studies (Choi et al., Alon et al., and Xiue et al.).
6.	There are duplicate section headings for “Materials and Methods” (Sections 2 and 3). These should be merged or renamed for clarity.
1.	The experiment uses only one language, which limits the findings’ generalizability.
2.	There is no comparison with baseline studies or tools!
Tables 2-5 show the computational and structural impacts. However, this is the weakest part of the paper. 
1.	What is the evaluation objective? Provide justifications for the selected evaluation metrics. Token Count was used as a proxy for code density and cognitive complexity. What is the source of such a claim? 
2.	Behavior preservation is not evaluated.
3.	You should consider other evaluation metrics such as coupling, cohesion, complexity, and other internal metrics commonly used to assess refactoring improvements in previous studies. Code correctness is more important to the developers than execution time!
Below are the major changes that need to be made to the paper; other suggested changes are mentioned in the comments above. 
1.	Expand the Architecture Diagram with detailed input and output.
2.	Explain the dataset creation process.
3.	Increase the number of code examples by including open-source projects.
4.	Extend the experiments to include one other language, such as Java.
5.	Include quantitative comparisons with baseline studies or tools.
6.	Include other metrics in the evaluation, such as coupling, cohesion, and complexity, that have been used in studies with similar objectives. 
7.	State the limitations of the proposed approach.
-
-
-
The paper is straightforward and generally easy to read. It provides the appropriate background to cover the different topics discussed. The motivation for the paper is unclear but can be inferred. The problem being addressed is current, given how developers heavily rely on large language models for various software engineering and programming tasks.
I find the limitations mentioned in the introduction of the existing studies to be inaccurate and to overstate the significance of the current study, since it neither offers solutions to these limitations nor explains how the current approach will address them. For example, stating that existing solutions only support limited refactoring scenarios needs to clarify why these restrictions are insufficient for developers' needs. As the authors know, providing highly focused refactorings that are appropriate and correct, such as the one in the study [3], is more developer-friendly because it helps developers easily understand the impact of the refactoring, apply it within the IDE, and ensure the correctness of the change. Additionally, the paper mentions another limitation regarding the support for programming languages, but then offers a solution tested only in Java. Although the proposed approach could be easily adapted to support other languages, the paper only tests it in one language. 
The paper surprisingly ignores studies that either used ChatGPT directly [1,2] or indirectly [3] for refactoring.
I find the proposed 27 scenarios interesting, but it would be helpful to show how the paper has finely tuned the comprehensiveness of these scenarios. How did the authors develop their rationale? Is it based on existing studies? Does it rely on previous bug reports? Or is this a pure effort by the authors through some form of mixed methods? Explaining the details of how these scenarios were derived would help the reader better understand their advantages and potential limitations. For example, it’s not clear how the extraction method scenarios account for extracting either one code fragment or multiple code fragments within the context of code clones. Furthermore, if we are extracting clones that belong to multiple classes, how would this approach work? From what it seems, it only supports one source class and one target class.
My main concern with the paper is that the experiments do not adequately reflect the extensive effort put into providing a comprehensive list of scenarios, as they only present the results in a black-box manner. I wanted to see a breakdown of the test cases related to the scenarios so that readers could understand which scenarios succeed or fail with respect to ChatGPT and agents. Surprisingly, it only provides the numbers without delving into the details or mechanics of each scenario. The paper would greatly benefit from "white boxing" that process and providing clear examples of compiler failures and incorrect refactorings to strengthen the discussion.
Finally, the paper claims the scenarios to be part of its contribution, which I believe is the strongest part of the paper, but it then fails to provide those scenarios for replication and extension. If the paper wants to claim such effort as a contribution, it needs to provide them after detailing how they were designed, as I mentioned in my previous comment.
[1] Liu, Yue, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D. Le, and David Lo. "Refining chatgpt-generated code: Characterizing and mitigating code quality issues." ACM Transactions on Software Engineering and Methodology 33, no. 5 (2024): 1-26.
[2] DePalma, Kayla, Izabel Miminoshvili, Chiara Henselder, Kate Moss, and Eman Abdullah AlOmar. "Exploring ChatGPT’s code refactoring capabilities: An empirical study." Expert Systems with Applications 249 (2024): 123602.
[3] Pomian, Dorin, Abhiram Bellur, Malinda Dilhara, Zarina Kurbatova, Egor Bogomolov, Timofey Bryksin, and Danny Dig. "Next-generation refactoring: Combining llm insights and ide capabilities for extract method." In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 275-287. IEEE, 2024.
**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.