All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Reviewer 2 considered his/her comments have been addressed.
[# PeerJ Staff Note - this decision was reviewed and approved by Xiangjie Kong, a PeerJ Section Editor covering this Section #]
no comment
no comment
no comment
Reviewer comments addressed.
Please respond to the reviewers' comments, especially those from Reviewer 2.
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
This paper compares the skill of chatGPT and GPT4 in answering GAT exam questions in Arabic and English. The idea is to provide a benchmark data that can be used to evaluate the skill of the models with respect to the same task but formulated in different languages. Samples of GAT exam questions in Arabic and English were fed to the two models and performance was reported and discussed.
The paper is well-written and easy to follow.
The experiments in this paper are explained nicely.
The findings are valid and reasonable and align with the findings of the research community working on the specific topic addressed by this research.
I have the following remarks for the authors:
1. Algorithm1: it is not clear how to extract the correct answer which is returned by ChatGPT. Simply checking if the choice (c) belongs to the tokens of the question is not enough. I guess you need to check that algorithm again.
2. 456 questions in English and 468 in Arabic is small to be called benchmark for testing language models. Also, the paper did not describe the details and nature of crowdsourcing which was used to generate this dataset.
3. There are 14 references taken from arXiv. Kindly check if these are published elsewhere and modify these references accordingly. Papers in arXiv are not peer reviewed.
4. References no 44, 45 and 52 lack information. Try to provide all information related to these references.
5. Paragraph 4 in the Introduction Section: add references to GAT, NCA and KFPUM.
6. Introduction section: add organization of the paper at the end of this Section
7. Line 200 page 4: “with four possible answers (A, B, C, D)” These are not possible answers, rather they are given choices and there is only one correct answer. I guess this sentence needs paraphrasing.
8. Equation 1: I believe you should use MAX rather than MIN because you are looking for the most similar pairs.
9. Page 8, line 272: the text in this line seems cutoff.
The paper presents a new benchmark dataset for multilingual large language models in Arabic and English. The dataset was collected using GAT exams verbal section. The paper primarily focuses on detailing the data collection process, with no substantial technical contributions. The paper also has various limitations, which are outlined as follows.
• The introduction lacks a clear structure of introducing the problem and providing summary of effort in the field and gap in the research area.
• In the introduction, the author stated that one of the contributions of the paper is to propose a new way of evaluating LLMs in Arabic and English. It's unclear how the author introduced a novel approach to evaluating LLMS. While data collection is crucial, it doesn't necessarily imply that the author has proposed a new method for assessing LLMS.
• The background section appears extensive and may not be necessary since it mainly describes other proposed approaches (transformer-based methods) not sure how these methods are relevant to this paper.
• The author didn’t provide enough details of the methodology or the results. For example, in the approach section which supposed to provide details of the methodology, there is no description of how fast text was implemented.
• Equation 1 , should be argmax since the function should return the part with maximum similarity.
• The author stated that no previous dataset existed for Arabic LLM. However, as demonstrated in the following study, a dataset has been created for this specific purpose.
Ali, Abbas Raza, et al. "A Large and Diverse Arabic Corpus for Language Modeling." arXiv preprint arXiv:2201.09227 (2022).
• The findings are addressed but not reflected in terms of new insights gained in comparison to previous studies.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.