Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on August 12th, 2024 and was peer-reviewed by 3 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on October 29th, 2024.
  • The first revision was submitted on December 26th, 2024 and was reviewed by 3 reviewers and the Academic Editor.
  • A further revision was submitted on March 27th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • A further revision was submitted on September 25th, 2025 and was reviewed by the Academic Editor.
  • The article was Accepted by the Academic Editor on October 6th, 2025.

Version 0.4 (accepted)

· · Academic Editor

Accept

Thank you to the authors for their efforts to improve the work. I believe this version has been revised well and is ready for acceptance.

[# PeerJ Staff Note - this decision was reviewed and approved by Jyotismita Chaki, a PeerJ Section Editor covering this Section #]

Version 0.3

· · Academic Editor

Minor Revisions

Thanks to the authors for their efforts to improve the article. Most issues have been solved. Please continue to revise the paper according to the comments.

·

Basic reporting

The manuscript is generally well written and well-structured. And the topic is relevant and timely, given the increasing interest in Arabic NLP and question answering systems. The Introduction provides sufficient background and situates the work in the context of recent AQAS developments, including datasets and model architectures. However, some sections are repetitive, which needs to be reviewed and fixed. Moreover, some terminologies should be consistent. For example, please avoid mixing “AQA” and “AQAS”. Regarding Figures, they are minimal and somewhat generic, such as Figure 1. Please consider adding more details on the explanation of the Figures. For the References, they are appropriate, but formatting is inconsistent between conference/journal styles.

Experimental design

The work fits the journal’s scope and article type. The authors clearly position it as a systematic literature review (SLR). Inclusion criteria and database selection are explained well, and coverage appears broad. However, to make an improvement, the author could consider including a PRISMA-style flow diagram summarizing study selection. Moreover, the author could add a consolidated comparison table of model performance including accuracy, and F1-score across studies to allow easier cross-paper comparisons.

Validity of the findings

The conclusions presented are consistent with the evidence, which addresses the main research questions outlined in the Introduction. The study successfully identifies unresolved issues including lack of diverse annotated datasets, limited hybrid-model exploration, and computational challenges. It offers reasonable recommendations. However, here are some suggestions about the areas for improvement listed below. Please review.

To better improve the Validity of the findings, the author might consider prioritizing and quantifying findings where possible. Furthermore, please reduce redundancy between the Discussion and RQ-based Results.

Additional comments

The revisions needed are structural. Please consider removing redundancy and improving visuals.

Reviewer 5 ·

Basic reporting

The prose is generally clear, but scope and presentation drift across sections. Please state a single corpus description and a firm literature cutoff (e.g., “12 core studies + 60+ contextual references; search updated to May 2025”) and use exactly that wording in the Abstract, Introduction, Methods, and captions. Several tables appear garbled (e.g., Table 2) and a few figures are referenced without interpretation in the text. Re-export the affected tables, and add a one-sentence takeaway in each caption so readers understand why the item matters at that point in the narrative. Claims about language demographics and UN recognition should include inline authoritative citations.

Experimental design

The paper reads as a narrative review with “systematic” intent, but it does not yet meet reproducibility standards for identification/selection of studies. The bias-mitigation principles are good, yet not operationalized. Suggested revisions:
1.Make the pipeline reproducible: Add a concise Methods subsection listing databases, core queries (keywords/Boolean), date limits, language/type restrictions, and gray-literature handling; include a brief PRISMA-style flow (counts at each stage).
2.Standardize data extraction: Provide a compact table for the 12 core studies (task, dataset/dialect status, model family, metrics used, headline finding).

Validity of the findings

Conclusions are reasonable and aligned with the stated goals, but the claim-to-evidence chain needs to be more explicit and the limitations should be surfaced so readers can judge the reliability of the takeaways. The manuscript does a good job identifying gaps (dialectal coverage, metric alignment, data diversity) and sketching future directions; however, some conclusions generalize beyond what heterogeneous datasets and metrics permit, and readers must infer which specific studies support which statements. A concise fix is to insert a small “RQ–Finding–Evidence” summary (even as a paragraph or compact table) that ties each main conclusion to study IDs/tables/figures, and to add a brief Limitations paragraph acknowledging selection constraints, cross-study incomparability of metrics/tasks, and the chosen recency window.

Version 0.2

· · Academic Editor

Major Revisions

Although two reviewers are satisfied with the revisions, one reviewer has raised valid points. I urge authors to consider the points raised by Reviewer 2 and address them carefully in the revision.

·

Basic reporting

# Unambiguous, professional English is used throughout.
Comments: The manuscript essentially adopts technical language suitable for its targeted audience. Nonetheless, there are incidents of long-winded sentences that may hamper clarity. For instance, lines 65-73. Breaking these into shorter sentences would improve clarity. Simplify complex phrases wherever necessary, and define technical terms where needed.
Remarks: The author has replaced the longwinded sentences with short, concise, and precise sentences.
# Literature references and sufficient field background/context provided.
Comments: The manuscript makes references to relevant literature; however, it lacks an outlook and overview of how the field has been developed in general and Arabic NLP in particular. The introduction needs to be more focused on presenting an in-depth background and narrative regarding how this research fits into the broader landscape of existing works while emphasizing the gaps that it will attempt to fill.
Remarks: The author provided an overview of the Arabic NLP development which effectively portrays the background of how the study fits into the broader landscape of existing studies.
# Professional article structure, figures, and tables. Raw data shared.
Comments: The manuscript follows a standard professional structure, with clear headings and subheadings, which helps the reader navigate the paper. Nonetheless, figures and tables should be incorporated where applicable, with high-resolution and clear captions to articulate their relevance to the text.
Remarks: The author did not incorporate tables and figures, although not mandatory, it is advisable to include charts for proper presentation.
# Is the review of broad and cross-disciplinary interests within the scope of the journal?
Comments: The manuscript is of broad and cross-disciplinary interest, specifically within the domain of natural language processing (NLP), artificial intelligence, and linguistics. The manuscript articulates the advancements and challenges confronted in AQAS, pinpointing the unique linguistic properties of Arabic that differentiate it from other languages.
Remarks: No comments
# Has the field been reviewed recently? If so, is there a good reason for this review?
Comments: The Arabic Question Answering Systems field has witnessed some recent reviews, specifically, concentrating on advancements in deep learning and machine learning techniques. In that light, the manuscript offers a fresh perspective by synthesizing findings from 12 studies published between 2018 and 2023, which is a relatively recent timeframe.
Remarks: No Comments
# Does the Introduction adequately introduce the subject and make it clear who the audience is/what the motivation is?
Comments: The Introduction of the manuscript efficiently presents the subject of the Arabic Question Answering System, highlighting the significance of the study in the context of AI and NLP. It clearly describes the audience as practitioners and researchers in these domains, and educators interested in the impact of AI technologies for Arabic language processing.
Remarks: No Comments
# Formal results should include clear definitions of all terms and theorems, and
Detailed proof.
Comments: The manuscript presents formal results that describe key concepts and terms relevant to AQAS, however, it could benefit from clearer definitions of specific terms and theorems used across the text.
Remarks: The author has elaborated more on the operational terms and theorems that may be confusing to non-technical audiences.

Experimental design

# Article content is within the Aims and Scope of the journal and article type.
Comments: The manuscript seems to coincide with the Journal’s scope, concentrating on advancements in Arabic Question Answering Systems.
Remarks: No comments
# Rigorous investigation performed to a high technical & ethical standard.
Comments: Rigorous investigations are considerations. Nevertheless, ethical considerations regarding data usage as well as the inclusion and exclusion procedure of the chosen studies should be explicitly stated.
Remarks: The author has articulated the ethical considerations, which guarantee that the research adheres to ethical principles. Besides, the author conveyed the inclusion and exclusion protocols for selecting the studies.
# Methods described with sufficient detail & information to replicate.
Comments: The methodology employed is rigorous and robust, but it would be beneficial if the author indicated the criteria used to select the database to source the studies.
Remarks: No Comments
# Is the Survey Methodology consistent with a comprehensive, unbiased coverage of the subject? If not, what is missing?
Comments: The manuscript indicates that 12 studies were selected based on relevance, and publication date, but does not mention the protocol used to affirm balanced representation and unbiased coverage.
Remarks: The author appears to adopt a systematic literature review which is a proven methodology of affirming balanced representation.
# Are sources adequately cited? Quoted or paraphrased as appropriate?
Comments: All sources are cited properly adhering to the contemporary APA formatting.
Remarks: The author should consider adding the latest sources dated 2024, to enhance relevance.
# Is the review organized logically into coherent paragraphs/subsections?
Comments: Retrospectively, the manuscript is arranged systematically into articulate subsections and paragraphs.
Remarks: No comments

Validity of the findings

# Impact and novelty.
Comments: The manuscript sufficiently assesses the implications and novelty of its findings. A clearer explanation regarding how the research contributes to the field, comprising possible applications or implications is included.
Remarks: No comments
# Conclusions are well stated, linked to the original research question & limited to supporting results.
Comments: The conclusions are explicitly linked to the research questions presented in the introduction. All claims made in the conclusions are supported by the results presented.
Remarks: No further comments
# Is there a well-developed and supported argument that meets the goals set out in the Introduction?
Comments: The argument is generally well-structured; it has depth in articulating how the findings correlate to existing literature. However, more connections to prior studies can further strengthen the argument.
Remarks: The author has demystified more on how previous studies correlate to the present study.
# Does the Conclusion identify unresolved questions/gaps / future directions?
Comments: The conclusion effectively identifies unresolved gaps that can be further explored in the future such as the scarcity of high-quality diverse datasets.
Remarks: No comments

Additional comments

Consider expanding the discussion of current limitations in AQAS technologies and the implications of such limitations that might create demands for real applications. Besides, rigorous proofreading for grammatical errors and technical terminology will enhance overall professionalism

Reviewer 2 ·

Basic reporting

- Regardless of the improvements made to this version of the article, the issue of inadequate coverage remains. Your study reviews only 12 papers in a field with over 60 relevant studies, spanning both dataset-oriented and approach-oriented research. This number is barely acceptable for a literature review, especially given the availability of numerous high-quality studies addressing question answering (QA) in Arabic.
- Additionally, your study excludes papers published in 2024 (even though I found one in this survey which conflict with the mentioned coverage period), which feature significant advancements, including the use of large language models for QA and data generation. To be considered for publication as a survey, this paper must provide comprehensive coverage of all relevant work in the QA field, particularly recent studies that reflect the latest developments.

- The final recommendations of this work are overly superficial and excessively general: "research directions focused on developing lightweight, efficient models, enhancing semantic analysis, and ensuring the fairness and equity of AI applications." While these areas are important, they lack specificity and actionable insights. Despite acknowledging substantial progress, the study identifies gaps in handling linguistic nuances, the scarcity of annotated datasets, and the limited exploration of innovative AI techniques but fails to provide concrete strategies or focused directions to address these challenges effectively.
- The use of subtitles such as “Understanding the Components of a Question Answering System” and “Significance of Arabic Question Answering Systems (AQAS)” in the introduction section is unconventional for scientific writing. These titles should be reconsidered or integrated more seamlessly into the narrative flow.
- The paragraph titled “Significance of Arabic Question Answering Systems (AQAS)” does not address the significance of Arabic QA systems as the title suggests. Instead, it primarily discusses the challenges of handling the Arabic language. The content should align with the title, or the title should be revised to reflect the actual focus.
- The repetition of abbreviations, such as redefining Question Answering Systems (QAS) and Arabic Question Answering Systems (AQAS) multiple times throughout the manuscript, is unnecessary and disrupts the flow. Define abbreviations once, preferably at their first mention, and use them consistently thereafter.
- The statement “Arabic is spoken by about 330 million people and is one of the United Nations’ 134 official languages” requires proper citation to substantiate the claim.
- The section “Challenges in Arabic Natural Language Processing” is a repetition of “Overview of Arabic NLP Development”
- The section “Overview of Arabic NLP Development” (line 149) fails to deliver a meaningful overview of the stages of NLP development. Instead, it provides a very general introduction discussing why NLP is important, its applications, and the need to address language nuances. However, it lacks a clear and structured staging of Arabic NLP development, particularly in relation to question answering (QA). A comprehensive overview should delineate the historical milestones, key advancements, and trends in Arabic NLP, as well as the specific developments in QA systems.
- One of the papers you describe is ArQUAD, and you incorrectly claim that the dataset covers dialectal Arabic. In reality, the dataset is built from Wikipedia and does not include dialectal content. Such inaccuracies are not acceptable in a research paper, as they misrepresent the referenced work and undermine the credibility of the study. Accurate and thorough verification of cited datasets and their characteristics is essential to ensure the reliability and quality of the research.
- Precision, Recall, and F-score are not commonly used as evaluation metrics for QA systems. Including them without justification or clarification misrepresents standard practices in the field. Furthermore, having two additional sections discussing evaluation metrics without substantial new insights adds redundancy rather than value to the paper.

Experimental design

-Overall evaluation: The survey contains significant redundancy and lacks clarity, making it difficult to follow and understand its contributions. The information presented is often vague, repetitive, and poorly structured. improving the paper needs a thorough restructuring, ensuring that the content is clearly classified, logically organized, and supported by accurate, verified information. All of these issues must be addressed before this paper can be considered for publication.
-

Validity of the findings

-

Additional comments

-

Reviewer 3 ·

Basic reporting

the authors fix my concerns

Experimental design

the authors fix my concerns

Validity of the findings

the authors fix my concerns

Version 0.1 (original submission)

· · Academic Editor

Major Revisions

There are many comments which you must address. In particular, the comments from R2 are the most critical and so you should pay particular attention to them

[# PeerJ Staff Note: It is PeerJ policy that additional references suggested during the peer-review process should *only* be included if the authors are in agreement that they are relevant and useful #]

·

Basic reporting

# Unambiguous, professional English is used throughout.
Comments: The manuscript essentially adopts technical language suitable for its targeted audience. Nonetheless, there are incidents of long-winded sentences that may hamper clarity. For instance, lines 65-73. Breaking these into shorter sentences would improve clarity. Simplify complex phrases wherever necessary, and define technical terms where needed.
# Literature references and sufficient field background/context provided.
Comments: The manuscript makes references to relevant literature; however, it lacks an outlook and overview of how the field has been developed in general and Arabic NLP in particular. The introduction needs to be more focused on presenting an in-depth background and narrative regarding how this research fits into the broader landscape of existing works while emphasizing the gaps that it will attempt to fill.
# Professional article structure, figures, and tables. Raw data shared.
Comments: The manuscript follows a standard professional structure, with clear headings and subheadings, which helps the reader navigate the paper. Nonetheless, figures and tables should be incorporated where applicable, with high-resolution and clear captions to articulate their relevance to the text
# Is the review of broad and cross-disciplinary interests within the scope of the journal?
Comments: The manuscript is of broad and cross-disciplinary interest, specifically within the domain of natural language processing (NLP), artificial intelligence, and linguistics. The manuscript articulates the advancements and challenges confronted in AQAS, pinpointing the unique linguistic properties of Arabic that differentiate it from other languages.
# Has the field been reviewed recently? If so, is there a good reason for this review?
Comments: The Arabic Question Answering Systems field has witnessed some recent reviews, specifically, concentrating on advancements in deep learning and machine learning techniques. In that light, the manuscript offers a fresh perspective by synthesizing findings from 12 studies published between 2018 and 2023, which is a relatively recent timeframe.
# Does the Introduction adequately introduce the subject and make it clear who the audience is/what the motivation is?
Comments: The Introduction of the manuscript efficiently presents the subject of the Arabic Question Answering System, highlighting the significance of the study in the context of AI and NLP. It clearly describes the audience as practitioners and researchers in these domains, and educators interested in the impact of AI technologies for Arabic language processing.
# Formal results should include clear definitions of all terms and theorems, and
detailed proof.
Comments: The manuscript presents formal results that describe key concepts and terms relevant to AQAS, however, it could benefit from clearer definitions of specific terms and theorems used across the text.

Experimental design

# Article content is within the Aims and Scope of the journal and article type.
Comments: The manuscript seems to coincide with the Journal’s scope, concentrating on advancements in Arabic Question Answering Systems.
# Rigorous investigation performed to a high technical & ethical standard.
Comments: Rigorous investigations are considerations. Nevertheless, ethical considerations regarding data usage as well as the inclusion and exclusion procedure of the chosen studies should be explicitly stated.
# Methods described with sufficient detail & information to replicate.
Comments: The methodology employed is rigorous and robust, but it would be beneficial if the author indicated the criteria used to select the database to source the studies.
# Is the Survey Methodology consistent with a comprehensive, unbiased coverage of the subject? If not, what is missing?
Comments: The manuscript indicates that 12 studies were selected based on relevance, and publication date, but does not mention the protocol used to affirm balanced representation and unbiased coverage.
# Are sources adequately cited? Quoted or paraphrased as appropriate?
Comments: All sources are cited properly adhering to the contemporary APA formatting.
# Is the review organized logically into coherent paragraphs/subsections?
Comments: Retrospectively, the manuscript is arranged systematically into articulate subsections and paragraphs.

Validity of the findings

# Impact and novelty.
Comments: The manuscript sufficiently assesses the implications and novelty of its findings. A clearer explanation regarding how the research contributes to the field, comprising possible applications or implications is included.
# Conclusions are well stated, linked to the original research question & limited to supporting results.
Comments: The conclusions are explicitly linked to the research questions presented in the introduction. All claims made in the conclusions are supported by the results presented.
# Is there a well-developed and supported argument that meets the goals set out in the Introduction?
Comments: The argument is generally well-structured; it has depth in articulating how the findings correlate to existing literature. However, more connections to prior studies can further strengthen the argument.
# Does the Conclusion identify unresolved questions/gaps / future directions?
Comments: The conclusion effectively identifies unresolved gaps that can be further explored in the future such as the scarcity of high-quality diverse datasets.

Additional comments

Consider expanding the discussion of current limitations in AQAS technologies and the implications of such limitations that might create demands for real applications. Besides, rigorous proofreading for grammatical errors and technical terminology will enhance overall professionalism.

Reviewer 2 ·

Basic reporting

This work surveys AI-based Arabic Question Answering (QA) Systems, reviewing recent methodologies and challenges in the field. It explores various algorithms, models, and datasets used in Arabic QA and highlights gaps in the current research landscape. The study aims to guide future research by outlining potential areas for improvement and further investigation. The survey also discusses the role of AI, machine learning, and deep learning in enhancing Arabic QA systems.However, even though this survey aims to present the current state of Arabic QA research, it suffers from several drawbacks. The research questions are too generic, making them less focused and difficult to address meaningfully. The survey lists related works without providing a comprehensive comparison or critical analysis, making it challenging to derive clear conclusions or actionable insights. Additionally, the evaluation metrics mentioned are not suitable for QA tasks, and important resources and models have been overlooked, which undermines the completeness and accuracy of the survey. Furthermore, many points in the literature review are supported by only one or two papers, which is insufficient to provide a thorough and balanced overview of the field. The following are the details of these issues.1. While the paper does not contain a significant number of spelling errors, there are notable issues with the language and presentation that detract from its quality. Specifically:-- The paper repeatedly defines the same abbreviations, which disrupts the flow and coherence of the text. This redundancy suggests a lack of careful editing and gives an impression of disorganized content presentation.-- Assigning a title to each paragraph, especially when not necessary, can interrupt the narrative flow and make the paper seem fragmented. This stylistic choice is unconventional and reduces the readability of the paper.-- The combination of repetitive abbreviation definitions and an excessive use of paragraph titles gives the impression that large portions of the paper may have been generated using automated tools, such as AI. This detracts from the perceived authenticity and originality of the work.
-- In general, many sections of the paper, such as “Research Design,” “Analysis Techniques,” and “Dataset Characteristics,” are presented as simple lists of bullet points. This format is not suitable for a scholarly paper, as it lacks the necessary depth, narrative flow, and critical analysis expected in academic writing. Sections should be written in coherent paragraphs that discuss and analyze the information in a structured manner, rather than merely listing points. This will enhance the readability and scholarly value of the paper by providing a more engaging and comprehensive exploration of the topics.

2. In defining the components of a QA system, you state: “and may include answer validation to assess confidence levelsinclude answer validation to assess confidence levels.” This sentence is unclear, particularly the phrase “assess confidence levels include answer validation.” I would like to clarify that in the context of machine learning, the term “validation” typically refers to the process of evaluating a model using a separate dataset, known as a validation set. This step is crucial for tuning the model’s parameters and assessing performance before final testing. If your intent was to discuss this aspect of model evaluation, it would be more appropriate to explicitly reference the validation set in the context of model training and evaluation. However, I don’t believe this paragraph is the right place to introduce this conceptI recommend revising the sentence to clearly distinguish between the concepts of answer validation within a QA system and the use of a validation set during the training phase of machine learning models.

3. As this is a survey reviewing the literature on Arabic QA, it should include a classification of the machine reading comprehension tasks, such as span-extraction, cloze, multiple-choice, and open-ended QA. This information is essential to guide the reader and clarify the focus of this survey within the broader landscape of QA research.

4. While you mentioned the complex structure of the Arabic language as a factor hindering research in Arabic QA, I disagree with this point. Most machine learning and deep learning models are language-agnostic. This idea does not adequately support the survey’s argument, especially since there is no mention in the survey of specific methods for handling Arabic’s complex structure and morphology.

5. As you mentioned, this survey examines twelve studies published between 2018 and 2023. One of the key strengths of any survey is its ability to cover all significant work published in the field. I am concerned that this survey omits several important contributions to Arabic QA. To enhance the comprehensiveness of the survey, I recommend conducting a thorough search across various databases and sources to ensure that all relevant works are included. This will provide a more complete and valuable overview of the field.

6. The goal of this survey is not clearly defined in the introduction. Currently, it states: “The purpose is to emphasize the achievements of advanced computational techniques, identify key gaps and difficulties, and suggest actionable directions for further research. This project intends to support the development of more effective and equitable AQAS by concentrating on the integration of varied datasets, innovative modeling methodologies, and appropriate management of linguistic complications.”However, this statement is too broad and lacks specificity. Given that the survey primarily focuses on span-extraction machine reading comprehension, it would be beneficial to explicitly state this focus in the introduction. Additionally, the survey should clarify its scope, particularly in relation to the various types of information retrieval and machine reading comprehension tasks that are not addressed. A more focused and precise objective would help guide the reader and align the content with the survey’s intended contribution.

7. In the literature review section, you have only one reference addressing the challenges of the Arabic language. This is insufficient to constitute a comprehensive literature review; one reference is too few to adequately cover this point. Moreover, I don’t believe that the literature review is the appropriate section to discuss this issue, as it does not align with the typical focus of a literature review, which is to summarize and analyze previous research comprehensively.
8. The paper claims that the study by Alkhurayyif and Wahab Sait (2023) conducted a systematic literature review of 617 articles, meticulously selecting 40 of them. Relying on another survey to describe the current status of research is not appropriate. If the existing survey already provides a comprehensive overview, it begs the question: why should readers invest time in your survey? To justify its relevance, the current survey must offer new insights or perspectives that are not covered in the previous one. Otherwise, its contribution to the field remains unclear and redundant. I recommend either accurately reflecting the content and scope of the cited study or explicitly stating how this survey builds upon and differentiates itself from the previous work. This will help clarify its unique value and avoid redundancy, ensuring that it adds meaningful contributions to the research community.

Experimental design

9. The author cannot rely on a single study to showcase the current taxonomy of available resources for Arabic QA. There are new datasets that significantly surpass the ones mentioned, such as Abdallah, Abdelrahman, et al., "ArabicaQA: A Comprehensive Dataset for Arabic Question Answering." Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024, and Obeidat, Rasha, et al. "ArQuAD: An Expert-Annotated Arabic Machine Reading Comprehension Dataset." Cognitive Computation (2024): 1-20, among others. These recent resources represent substantial advancements in the field and should be considered to provide a more up-to-date and comprehensive overview of Arabic QA datasets. Ignoring these contributions diminishes the survey's relevance and completeness in capturing the current landscape of Arabic QA research
10. The paragraph titled "Utilization of BERT in Arabic QA Systems " discusses the utilization of BERT for enhancing Arabic QA systems, which is extensive but heavily focused on a single study (Mahdi, 2021). While BERT's impact on Arabic QA is significant, the survey overlooks other important neural and language model-based approaches that have been developed in recent years. Relying solely on one paper to illustrate the current approach taxonomy is insufficient and presents an incomplete picture of the field. There are several other models and techniques, such as GPT, T5, and hybrid models combining different neural network architectures, which have demonstrated promising results in Arabic QA tasks. Additionally, recent advancements in multilingual models and transfer learning should also be considered. The survey would benefit from a broader and more balanced exploration of these models to provide a comprehensive understanding of the current state of Arabic QA research. I recommend incorporating a wider range of studies and models to accurately reflect the diversity of approaches being applied in this area. This will ensure that the survey captures a more complete and nuanced view of the field.

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful.


11. The evaluation metrics presented in the survey—precision, recall, F-score, and accuracy—are not appropriate for MRC tasks, which typically use metrics like Exact Match and F1-score. Additionally, the survey does not cover the evaluation of IR systems, which are a critical component of QA systems. These omissions lead to inaccuracies in the information provided. Including the correct metrics for both MRC and IR is necessary to enhance its accuracy and comprehensiveness.

Validity of the findings

12. The research questions should focus on specific aspects of Arabic QA rather than repeatedly connecting them to “this study.” The constant use of “this study” makes the questions sound restrictive and less generalizable. Additionally, the questions are too generic and lack focus. They don't clearly define what specific issues or aspects of Arabic QA are being explored, which makes them difficult to address effectively. Research question 4, " RQ4 How do AI, ML, and DL contribute to the development or performance of AQAS?" is an example. There is no clear method to answer these questions, as they seem to require merely listing related works without sufficient discussion or critical analysis. This approach hinders the ability to draw meaningful summaries or conclusions, making the purpose and direction of the research ambiguous. For the research questions to be impactful, they should be more targeted, well-defined, and capable of guiding a structured and in-depth exploration of Arabic QA.

13. In the "Survey Methodology" section, the authors claim to have used a rigorous methodology to analyze AI-based Arabic QA Systems, emphasizing extensive descriptions of algorithms, models, and frameworks. They describe the approach as being similar to conducting a systematic review to address specific research topics and test hypotheses. From this description, I expected to see empirical results from experiments conducted by the authors of this study. However, what I found instead was merely a listing of related works corresponding to each research question, without any real comparison, analysis, or clear conclusions. This lack of in-depth analysis and the absence of original experimental results is misleading and does not align with the methodology described. A true systematic review or rigorous analysis should include a comprehensive comparison and synthesis of findings, along with a clear discussion of their implications, which is currently missing in this work.

Reviewer 3 ·

Basic reporting

1. The manuscript is written in clear, professional, and unambiguous English. The text is mostly accessible, but there are sections where the clarity of expression could be improved to ensure easier comprehension by a broader audience.

To improve the clarity of expression for broader comprehension, here are a few suggestions based on common areas where complexity might hinder understanding in such reviews:

Technical Jargon:

Example: "The study delves deeply into Artificial Intelligence (AI)-based Arabic Question Answering Systems (AQAS), emphasizing the use of Machine Learning (ML) and Deep Learning (DL) technologies."
Suggestion: Simplify by explaining abbreviations early on. For example:
"The study explores systems designed to answer questions in Arabic using advanced AI methods, including machine learning (ML) and deep learning (DL)."
Complex Sentences:

Example: "A careful analysis of twelve qualifying studies identifies considerable advances in the use of advanced computational approaches to address Arabic's distinctive linguistic problems."
Suggestion: Break down into shorter sentences:
"The analysis focuses on twelve studies published between 2018 and 2023. These studies reveal significant progress in using computational methods to tackle the unique challenges of the Arabic language."

2. The study provides a comprehensive introduction to the field of Arabic Question Answering Systems (AQAS), clearly establishing the context and significance of the research. It covers the challenges in Arabic NLP, emphasizing the complexities of the language and the need for tailored solutions.

but still, there is a missing good paper in Arabic qa. these paper is recent and should be cited

Abdallah, Abdelrahman, et al. "Arabicaqa: A comprehensive dataset for arabic question answering." Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024.

Alwaneen, T. H., Azmi, A. M., Aboalsamh, H. A., Cambria, E., & Hussain, A. (2022). Arabic question answering system: a survey. Artificial Intelligence Review, 55(1), 207-253.

PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful.

Literature Review: The manuscript reviews the relevant literature comprehensively, covering studies from 2018 to 2023. It discusses various methods, models, and datasets used in AQAS, providing a detailed assessment of the current state of research.

but still, there is a missing good paper in Arabic qa. Please look to the previous comment

The structure of the manuscript aligns with the standards of literature review articles, offering a logical flow from background information through the analysis of selected studies to conclusions and recommendations.

The review addresses a gap in the literature concerning Arabic Question Answering Systems, which is a niche but important area for advancing natural language processing technologies. It would be of interest to researchers and practitioners in AI and NLP fields, particularly those working with Arabic or low-resource languages.

Experimental design

The study is well-aligned with the journal's aim, focusing on a comprehensive review of Arabic Question Answering Systems (AQAS). It systematically covers innovations and challenges, providing insights into how advanced computational techniques are applied to address the complexities of the Arabic language.

The study follows a systematic review methodology, analyzing 12 selected research studies published between 2018 and 2023. The selection criteria, which include publication date, relevance, and contribution to the field, are outlined, ensuring a focused review on recent advancements in AQAS.

Information was gathered from a pool of 71 papers, with 12 studies being selected for in-depth analysis. The data collection process emphasizes a focus on diverse methodologies and conclusions, ensuring a comprehensive review of recent AQAS advancements.

but still, there is a missing good paper in Arabic qa. these papers is recent and should be cited

Abdallah, Abdelrahman, et al. "Arabicaqa: A comprehensive dataset for arabic question answering." Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024.

Alwaneen, T. H., Azmi, A. M., Aboalsamh, H. A., Cambria, E., & Hussain, A. (2022). Arabic question answering system: a survey. Artificial Intelligence Review, 55(1), 207-253.

PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful.

Validity of the findings

The conclusion clearly identifies several gaps that remain in the field of AQAS. It points out the ongoing challenges, such as the scarcity of high-quality annotated datasets that reflect the diverse dialects of Arabic and the computational demands of state-of-the-art models like BERT.
Additionally, the study outlines future directions, such as the need for more research into hybrid AI models that combine different approaches for better semantic understanding. It also calls for efforts to improve computational efficiency through lightweight models that can operate effectively in resource-limited environments.
The recommendation to focus on fairness and equity in AI models for Arabic QA, especially across different dialects and sociolinguistic groups, highlights a thoughtful consideration of the broader impact of these technologies.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.