Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on March 26th, 2025 and was peer-reviewed by 3 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on May 28th, 2025.
  • The first revision was submitted on July 26th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • A further revision was submitted on August 25th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • The article was Accepted by the Academic Editor on September 11th, 2025.

Version 0.3 (accepted)

· Sep 11, 2025 · Academic Editor

Accept

I am pleased to inform you that your work has now been accepted for publication in PeerJ Computer Science.

I agree that a discussion between LIME and SHAP should be included as requested by one reviewer. I suggest the acceptance of this manuscript, but the authors should include this paragraph.

Please be advised that you cannot add or remove authors or references post-acceptance, regardless of the reviewers' request(s).

Thank you for submitting your work to this journal. I look forward to your continued contributions on behalf of the Editors of PeerJ Computer Science.

With kind regards,

[# PeerJ Staff Note - this decision was reviewed and approved by Sedat Akleylek, a PeerJ Section Editor covering this Section #]

**PeerJ Staff Note:** Although the Academic and Section Editors are happy to accept your article as being scientifically sound, a final check of the manuscript shows that it would benefit from further English editing. Therefore, please identify necessary edits and address these while in proof stage.

·

Basic reporting

The paper uses good and clear English. The writing is easy to understand and professional. The authors give a good summary of past studies and explain how their work fits in. The paper is well-organized, and the figures and tables are in the right places and clearly labeled. (The needed corrections were made.)

Experimental design

The paper shows new and original research that fits well with the goals of the PeerJ Computer Science journal. The research question is clear and focuses on an important and current topic. The authors clearly explain what is missing in past studies and how their work helps fill that gap. The methods are explained in enough detail so that other researchers can repeat the study, which is important for checking the results and doing more research in the future.

Validity of the findings

The authors have made all underlying data available, demonstrating transparency. The data is well-controlled, statistically sound, and supports the robustness of the study's findings. The conclusions are clear and make sense. They match the main research question and are based only on the results, without guessing or adding extra ideas.

Additional comments

The paper can be accepted with its current form.

Reviewer 2 ·

Basic reporting

Thank you to the authors for revising the manuscript according to the reviewers' comments in the previous round. This revision has addressed the required changes. However, the Explainable AI Results section should include a comparative discussion between LIME and SHAP regarding their effectiveness in explaining the model in this study.

Experimental design

The revised manuscript has addressed all the concerns raised in the previous review round.

Validity of the findings

The Explainable AI Results section should include a comparative discussion between LIME and SHAP regarding their effectiveness in explaining the model in this study.

Version 0.2

· Aug 15, 2025 · Academic Editor

Minor Revisions

All concerns raised by the reviewers have been partially addressed. The manuscript still needs further clarification regarding the following:

a) How can the different perspectives (possibly different features) identified by CNN-GRU-3 be considered convincing? This is especially questionable given the enormous difference in training time (0.55 seconds vs. 5410.91 seconds).

b) Reduce the number of tables and figures; some are not required. For example, Table 1 should be included in the related work section. Furthermore, this table is difficult to grasp and unreadable.

c) What is the computational complexity of the proposed approach?

d) Provide further technical details (perhaps another statistical test) on why XGB reaches 100% in the recall performance metric.

These issues require a minor revision. If you are prepared to undertake the work required, I would be pleased to reconsider my decision. Please submit a list of changes or a rebuttal against each point that is being raised when you submit your revised manuscript.

·

Basic reporting

The paper uses good and clear English. The writing is easy to understand and professional. The authors give a good summary of past studies and explain how their work fits in. The paper is well-organized, and the figures and tables are in the right places and clearly labeled. (The needed corrections were made.)

Experimental design

The paper shows new and original research that fits well with the goals of the PeerJ Computer Science journal. The research question is clear and focuses on an important and current topic. The authors clearly explain what is missing in past studies and how their work helps fill that gap. The methods are explained in enough detail so that other researchers can repeat the study, which is important for checking the results and doing more research in the future.

Validity of the findings

The authors have made all underlying data available, demonstrating transparency. The data is well-controlled, statistically sound, and supports the robustness of the study's findings. The conclusions are clear and make sense. They match the main research question and are based only on the results, without guessing or adding extra ideas.

Additional comments

The paper can be accepted with its current form.

Reviewer 3 ·

Basic reporting

The content of the article is rich, but the issue of having too many figures and tables still remains.

Experimental design

no comment

Validity of the findings

no comment

Version 0.1 (original submission)

· May 28, 2025 · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

·

Basic reporting

The paper is titled as “A Hybrid CNN-GRU Model with XAI-Driven Interpretability using LIME and SHAP for Static Analysis: Combining Spatial and Sequential Features in Malware Detection” and it is aimed to develop a malware classification framework that integrates Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU) with explainable AI techniques (LIME and SHAP) to improve both accuracy and interpretability in static malware detection.

The topic is one of the hot topics in the cybersecurity and machine learning field, and it fits in the scope of the PeerJ Computer Science journal.

Experimental design

POSITIVE SIDES:

The paper presents a good approach by integrating deep learning and explainable AI techniques for malware classification using static analysis. One of the strengths of the study is its clear and systematic use of hyperparameter tuning, which contributes to the model’s performance optimization. The authors provide detailed information on the tuning process and explicitly display the hyperparameter ranges used for various models (e.g., CNN, GRU, XGB, etc.). This transparency improves reproducibility and allows for better understanding of model configuration choices.

Furthermore, the paper effectively illustrates the outcomes of Explainable Artificial Intelligence (XAI) methods, particularly SHAP and LIME. These tools are utilized to interpret model predictions and to identify which features contribute most significantly to the classification results. The inclusion of SHAP summary plots, dependence plots, and LIME visualizations strengthens the reliability of the model and enhances the interpretability of the results, which is a crucial aspect in security-critical applications such as malware detection.

Overall, the paper's attention to detail in hyperparameter optimization and its commitment to interpretability through XAI techniques are commendable, making it a valuable contribution to the field.

Validity of the findings

The topic of the paper is original and well-described, and it presents several important findings supported by experimental results. The integration of deep learning and explainable AI techniques for malware classification is both timely and relevant.

However, I have several objections regarding the current form of the manuscript, which should be addressed before it can be considered for acceptance. These include issues related to methodological clarity, algorithm presentation, evaluation consistency, and comparative analysis with existing literature.

Additional comments

ON THE OTHER HAND, the following should BE CORRECTED

MAJOR 1)
Algorithm 1 needs significant clarification and restructuring:
-The variable Xtrain_output is confusing—this is typically referred to as y_train.
-There are two functions mentioned, but it’s unclear what each returns.
-The use of SelectKBest() is ambiguous—please specify whether it's a built-in or custom function.
-Line 15 simply states "validation" without explaining what is being validated or how.
Overall, the pseudocode lacks clarity and proper structure. I recommend rewriting the entire algorithm using standard algorithmic formatting and naming conventions.

MAJOR 2)

Figures 6 and 7 present the confusion matrices for ML and DL models. However, the paper also states that a 5-fold cross-validation approach was used.
It is unclear how the confusion matrix values were derived in this context, especially considering the dataset consists of 8,000 records after augmentation.

If 5-fold cross-validation was indeed applied, the reported confusion matrices should either reflect averaged results across the folds or be clearly labeled as representative of a single fold.
Currently, this inconsistency raises questions about the evaluation process.
The authors should clarify how these matrices were generated and how they relate to the cross-validation strategy.

MAJOR 3)
The authors only compare their proposed models with other systems developed within this study.
However, there is no comparative evaluation with existing works in the literature.
While it’s understandable that previous studies may have used different datasets, it would still be valuable to see a statistical overview or a performance summary of these works to contextualize the results.
Including such a comparison would strengthen the paper by demonstrating how the proposed method performs relative to state-of-the-art approaches, even if only qualitatively.

MAJOR 4)

The authors state that Principal Component Analysis is applied for dimensionality reduction, aiming to preserve maximum variance while streamlining the feature set. However, the impact of PCA is not quantitatively discussed. Specifically:

- How much does PCA improve classification performance (e.g., accuracy, F1-score)?
- What is its effect on training and testing time?
- Was there a performance comparison with and without PCA?

Including this analysis would help the reader understand whether PCA meaningfully contributes to model efficiency or accuracy and justify its inclusion in the pipeline.

MAJOR 5)

The authors mention that SelectKBest with f_classif is used to select the top 100 features.
However, it is unclear why the value 100 was chosen.
Was this number determined empirically, through experimentation, or based on prior literature?
A brief justification—possibly supported by a performance comparison using different k values—would strengthen the methodological transparency of the study.

MINOR 1)

The CNN-GRU-3 model gave the best result among all; however, it was not mentioned previously.

MINOR 2)
and analyze the efficiency of the dataset over hybrid models. -- ? What do you mean by the efficiency of the dataset?

MINOR 3)
Cakir and Dogdu Cakir and Dogdu (2018) used word2vec --?

MINOR 4)
Algorithm 1 is currently presented as a figure, which makes it difficult to interpret and lacks the clarity expected in academic writing.
It should be rewritten as a text-based algorithm using standard pseudocode formatting.
This would improve readability, allow for proper referencing of steps, and help clarify function definitions, input/output variables, and logical flow.

Reviewer 2 ·

Basic reporting

The manuscript would benefit from improved figure placement. Specifically, Figures 6 to 14 are not positioned near the corresponding discussions in the text, which may disrupt the reading flow and reduce clarity. It is recommended that all figures be placed close to where they are first referenced. In addition, Figure 1, which presents comparative data, may be more appropriate as a table rather than a figure to enhance readability. Furthermore, the "Related Studies" section should be expanded with more detailed analysis. A literature review matrix comparing related works in terms of methodologies, strengths, and limitations would significantly improve this section and better highlight the novelty of the current works related to malware detection using deep learning and XAI. The title is too long; the author should consider making it more concise.

Experimental design

The manuscript would be strengthened by explicitly stating the research questions that guide the study. This is particularly important to highlight the advantages of employing Explainable AI (XAI) and the hybrid CNN-GRU model. The research questions should be evaluated in the Experimental Results and Analysis section.

Validity of the findings

To strengthen the validity of the proposed approach, the paper should include a comparison with related studies. It is recommended to provide a comparative table presenting performance metrics of the proposed method alongside those of existing methods evaluated on the same dataset. The role of SHAP and LIME should be emphasized in this study.

Reviewer 3 ·

Basic reporting

The manuscript contains an excessive number of figures and tables, which may hinder readability rather than enhance it.

For instance, the content of Table 1 could be succinctly described within the main text, without the need for a standalone table.

Additionally, Figures 6 and 7 appear to hold limited significance in the context of this study, as the models generally perform well and do not require further detailed discussion. Since Table 5 already provides a sufficient overview of the error distribution, it is recommended to remove Figures 6 and 7, and delete the corresponding descriptions in lines 425–436 to streamline the manuscript.

Furthermore, the reference to “Figure 10” in line 471 should be corrected to “Figure 11” to ensure consistency in figure numbering.

Experimental design

Although the use of CountVectorizer and PCA is mentioned, there is no in-depth discussion on the selection of dimensions, potential information loss, and its relationship with model performance. It is recommended to supplement the study with experimental comparisons of results under different dimensional settings.

The data augmentation method should be clearly specified—whether it involves duplication or synthetic data generation (e.g., SMOTE or GAN-based). Otherwise, it may affect the accuracy of the evaluation.

Validity of the findings

Table 5 presents the results of the 10 methods evaluated in the study, all of which achieved accuracy rates above 96%. However, the manuscript does not provide any statistical tests or significance analysis, making it difficult to determine whether the observed differences are statistically meaningful. Among these methods, the XGB model achieved the highest accuracy at 99.81%. Moreover, both XGB and CNN reached a Recall of 100%, indicating that they successfully detected all malicious samples—an outcome with substantial implications for practical defense systems.

In contrast, the proposed CNN-GRU 3 model, despite its higher complexity, did not demonstrate superior performance. According to the principle of Occam’s Razor, which favors simpler solutions when performance is comparable, the proposed model may not represent the most effective approach.

Additional comments

no comment

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.