Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

Review History
Evaluating large language models’ Arabic grammar error corrections and explanations

All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

The initial submission of this article was received on June 18th, 2025 and was peer-reviewed by 2 reviewers and the Academic Editor.
The Academic Editor made their initial decision on July 23rd, 2025.
The first revision was submitted on October 1st, 2025 and was reviewed by 1 reviewer and the Academic Editor.
The article was Accepted by the Academic Editor on November 26th, 2025.

Version 0.2 (accepted)

Nicole Nogoy · Nov 26, 2025 · Academic Editor

Accept

I can confirm the authors have addressed all the reviewers' comments.

This manuscript is now ready for publication.

[# PeerJ Staff Note - this decision was reviewed and approved by Shawn Gomez, a PeerJ Section Editor covering this Section #]

Hamed Jelodar · Oct 16, 2025

Basic reporting

The author improved the paper based on the suggested comments, and the new version is well-structured. However, in future versions of your work, it would be beneficial to use more samples.

Experimental design

'no comment'

Validity of the findings

'no comment'

Cite this review as

Jelodar H (2026) Peer Review #2 of "Evaluating large language models’ Arabic grammar error corrections and explanations (v0.2)". PeerJ Computer Science

Download Version 0.2 (PDF) Download author's response letter (v0.2) - submitted Oct 1, 2025

Version 0.1 (original submission)

PeerJ Staff · Jul 23, 2025 · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

Reviewer 1 · Jul 13, 2025

Basic reporting

The paper is written in clear, professional English. The introduction clearly identifies the goals of grammar error correction and error-correction explanation, and it clarifies the motivation for prompting and finetuning on Arabic-specific LLMs. The Materials & Methods section clearly lays out the hardware, software, model choices, datasets, preprocessing steps, and performance measurement. The paper follows standard formats and PeerJ’s guidelines to keep the paper clear and reproducible.
The paper explores the capabilities of GPT-4o, Gemini, Llama3, and the Arabic-specific ALLaM model in performing Arabic grammar error correction and generating explanatory feedback for those corrections. The author trains and evaluates them on two distinct datasets using zero-shot prompting, few-shot prompting, and finetuning, and compares their outputs with metrics like BLEU, GLEU and ROUGE. Results show that a fine-tuned GPT-4o performs well across almost every metric, and both few-shot prompting and finetuning give the models a big performance improvement. Overall, it shows that with the right prompts and finetuning, these top LLMs can fix Arabic grammar quickly and accurately while giving clear explanations.

Experimental design

The experimental design tests GPT-4o, Gemini, Llama3, and ALLaM four LLMs and gives clear details about the hardware, software, and data sets used. The models chosen are very new. The steps for data preprocessing, model selection, and evaluation were fully explained, and the models were scored with measures like ROUGE, cosine similarity, and CLEME.
Though the experiment is quite thorough, there are only about 300 examples for pretraining and 2,000 samples for evaluation, which may make the results less stable and harder to generalize. Key Finetuning settings such as learning rate, batch size, and number of training epochs are not shared, so readers cannot fully reproduce the work. And even though many automatic metrics are used, there is no human review of the data, so it is unclear how useful the explanations really are. Overall, there is still room for improvement.

Validity of the findings

The experiments and evaluations in the paper are rigorous, and the conclusions are firmly supported by the data. The research goals of Arabic grammar error correction and explanation generation are validated by the experiments, and the experimental design together with automated evaluation methods sufficiently support the paper’s main arguments. However, limitations such as a small sample size, unclear finetuning hyperparameter settings, and the lack of human evaluation may lead readers to question the reliability of the paper’s conclusions.

Additional comments

The authors may add comparisons with existing work to show the strengths and limits of the method. The authors can also look at how different prompt designs affect results, use more training data, and discuss ethical issues like bias in the data and the accuracy of the explanations, which can better judge the risks of using the model in real life.

Cite this review as

Anonymous Reviewer (2026) Peer Review #1 of "Evaluating large language models’ Arabic grammar error corrections and explanations (v0.1)". PeerJ Computer Science

Hamed Jelodar · Jul 17, 2025

Basic reporting

The authors in this study focused on evaluating different LLMs for Arabic grammar error correction. The topic and research models are interesting; however, I would like to suggest the following comments:

- What kind of fine-tuning method did you use to train your dataset? Please explain.

- Did you have any limitations regarding the number of tokens fed into the LLM? If so, please provide details.

- For evaluating the models with LLM metrics, how did you prepare or collect the reference sentences? Please highlight this process.

Experimental design

Please provide a detailed explanation of the fine-tuning approach, including hyperparameters, dataset preparation, and training procedure.

Validity of the findings

'no comment'

Cite this review as

Jelodar H (2026) Peer Review #2 of "Evaluating large language models’ Arabic grammar error corrections and explanations (v0.1)". PeerJ Computer Science

Download Original Submission (PDF) - submitted Jun 18, 2025

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Review History Evaluating large language models’ Arabic grammar error corrections and explanations

Summary

Version 0.2 (accepted)

Nicole Nogoy · Nov 26, 2025 · Academic Editor

Hamed Jelodar · Oct 16, 2025

Basic reporting

Experimental design

Validity of the findings

Version 0.1 (original submission)

PeerJ Staff · Jul 23, 2025 · Academic Editor

Reviewer 1 · Jul 13, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Hamed Jelodar · Jul 17, 2025

Basic reporting

Experimental design

Validity of the findings

Review History
Evaluating large language models’ Arabic grammar error corrections and explanations