Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on June 3rd, 2025 and was peer-reviewed by 3 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on July 9th, 2025.
  • The first revision was submitted on September 3rd, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • A further revision was submitted on October 28th, 2025 and was reviewed by 1 reviewer and the Academic Editor.
  • A further revision was submitted on November 7th, 2025 and was reviewed by 1 reviewer and the Academic Editor.
  • The article was Accepted by the Academic Editor on November 13th, 2025.

Version 0.4 (accepted)

· · Academic Editor

Accept

Kappa is often used to measure interrater reliability. Rater dependability indicates how well the study's data represent the variables measured. Interrater reliability measures how consistently data collectors (raters) score the same variable. Interrater reliability has been quantified in many ways, but traditionally it was calculated as percent agreement, or agreement scores divided by total scores. The appropriateness of the kappa level for health studies is questioned. Cohen's interpretation means that a score of 0.41 may be applicable for health research. Thank you for your contribution.

[# PeerJ Staff Note - this decision was reviewed and approved by Mehmet Cunkas, a PeerJ Section Editor covering this Section #]

Reviewer 1 ·

Basic reporting

Looks good

Experimental design

Looks good

Validity of the findings

Looks good

Additional comments

Looks good. Thanks.

Version 0.3

· · Academic Editor

Minor Revisions

Please take care of this issue: The manuscript has been improved; however, the modified Figure 1 has too small text. Please enlarge the figure size or add more figures for the modules.

Reviewer 1 ·

Basic reporting

The manuscript has been improved; however, the modified Figure 1 has too small text. Please enlarge the figure size or add more figures for the modules.

Experimental design

NA

Validity of the findings

NA

Additional comments

NA

Version 0.2

· · Academic Editor

Major Revisions

Please address the requests and criticisms thoroughly.

**PeerJ Staff Note:** Please ensure that all review and editorial comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

**Language Note:** The review process has identified that the English language must be improved. PeerJ can provide language editing services - please contact us at [email protected] for pricing (be sure to provide your manuscript number and title). Alternatively, you should make your own arrangements to improve the language quality and provide details in your response letter. – PeerJ Staff

Reviewer 1 ·

Basic reporting

Intro and backgrounds are okay.

Experimental design

This paper proposes an attention-enhanced deep learning framework integrating self-supervised learning, feature fusion, SIFT-based feature selection, and ensemble classifiers to achieve robust and interpretable musculoskeletal and chest X-ray analysis. However, it mainly involves combining self-attention within the well-known Xception and Inception models, and the explanations are insufficient, requiring further clarification.

- In Figure 1 and its description, it is unclear where and how the attention mechanism is actually integrated. The text mentions that it is placed between the convolutional and batch normalization layers, but the figure does not show an attention layer or batch normalization layers, making its implementation ambiguous.

- What are the output dimensions from the Xception and Inception models? How many features were selected through SIFT feature extraction? The explanation of the SIFT-based feature extraction process is vague and would benefit from a more detailed technical description. Furthermore, there seems to be no clear evidence that reducing features in this way improves performance, so an additional experiment would strengthen the paper.

Validity of the findings

There seems to be no clear evidence that reducing features using SIFT in this way improves performance, so an additional experiment would strengthen the paper.

Additional comments

The confusion matrices presented in the figures provide limited information relative to the space they occupy, and it may be better to move them to the supplementary material.

·

Basic reporting

The manuscript's structure could be significantly improved to enhance clarity and flow. Specifically, I recommend consolidating the presentation and interpretation of results under a single main section, "Results and Discussion." The current findings should be organized into logical subsections within this new unified section. This approach would create a more coherent narrative by immediately discussing the implications of each result, eliminating the need for a separate, repetitive Discussion section. For instance, the content currently in Section 4 could be restructured under this new heading.

The manuscript still requires thorough proofreading to address issues with English grammar, syntax, and technical phrasing. I recommend that the authors seek professional editing services to ensure the language meets the journal's publication standards. Clear and precise writing is essential for the reader to fully understand the scientific merit of the work.

Experimental design

Please report the training time of each deep learning model using the current experimental setup.

Validity of the findings

-

Additional comments

The revisions done by the authors look satisfactory. It addressed almost all concerns and improved the manuscript quality. I recommend the article for acceptance with minor modifications.

Version 0.1 (original submission)

· · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

**Language Note:** When you prepare your next revision, please either (i) have a colleague who is proficient in English and familiar with the subject matter review your manuscript, or (ii) contact a professional editing service to review your manuscript. PeerJ can provide language editing services - you can contact us at [email protected] for pricing (be sure to provide your manuscript number and title). – PeerJ Staff

Reviewer 1 ·

Basic reporting

-

Experimental design

-

Validity of the findings

-

Additional comments

This paper attempts to improve the performance of a CNN-based classification model based on X-ray data by combining it with an attention mechanism. While the research direction is meaningful, the description of the proposed model is not clear, and there are several inconsistencies and ambiguities. Specific comments are as follows.

- Figure 1: The font size of the schematic is too small to be readable, and the content and description of the figure are not specific enough to make it understandable; it is unclear where and how the attention layers fit into the model. Figure 2 has similar issues. The paper explains that “Fusion 1 for the humerus” and “Fusion 2 for the wrist” are combined, but it is not clear which parts of the model are combined and how. Neither the diagram nor the description clearly shows this process. It is also not explained why t-SNE is applied to the combined features. t-SNE is a dimensionality reduction technique that projects high-dimensional data into a lower dimension for visualization, and it does not seem appropriate to perform feature selection at a later stage. The schematic shows it as such, but the rationale for this approach needs to be explained.

- The paper states that they applied self-supervised learning, but there is absolutely no mention of what pretext task they used or what loss function they used for learning. This severely reduces the reliability and reproducibility of the proposed learning method.

- Line 224: Ensemble Prediction via Majority Voting: According to Figure 1, this step is located after t-SNE and Grad-CAM, but there is no clear explanation of which features are finally selected and input to classifiers such as Tree, KNN, SVM, etc. A specific description of the feature cleaning and structuring process prior to classification is needed.

- Figure 2: The flow of spatial information (height × width) is not tracked. The interpretation of the self-attention mechanism is difficult because it is not visually or mathematically explained how the Query, Key, and Value vectors are derived and interact. It is also unclear where exactly the attention module is inserted in the XceptionNet architecture. The three convolutional layers represented in the schematic are not clearly labeled, and it is difficult to know which layer of the backbone model they refer to.

- Line 284: Please remove the typo textbf.

- Regarding the presentation of results, including a confusion matrix for every method and task would be excessively space-consuming. It may be more appropriate to provide them as Supplementary Materials.

- Line 405: Please correct the question mark (?) character error.

- Figure 16: The attention map results look particularly problematic: in many cases, the red areas that the model is focusing on appear in background areas outside the lungs. This suggests that the model is likely responding to ambient noise, rather than anything of diagnostic significance.

- The purpose of applying t-SNE is unclear. There is a lack of explanation of what the visualization is trying to show and how it supports the conclusions of the study.

·

Basic reporting

1. Lack of Methodological Innovation and Justification in Introduction
There is no clear methodological innovation presented in the proposed work. Transfer learning combined with self-attention mechanisms has been widely explored since 2018 (e.g., Soydaner, Shaw et al., Guo et al., Li et al., Usama et al., 2020). Similarly, feature selection is not a novel concept. The introduction lacks sufficient detail regarding the novelty or motivation behind the study. Specifically, the authors should expand upon the knowledge gap being addressed in lines 76–81 to better justify the relevance and contribution of this research.

2. Language and Writing Quality
The English language throughout the manuscript requires a thorough review. Several complex sentences are difficult to understand and could benefit from rephrasing for clarity and conciseness.

3. Need for a Clear Link Between Limitations and Proposed Solutions
The authors must explicitly state how the current study addresses the limitations identified in the related works section, particularly Limitations #1 and #2:

a. Accuracy Across Applications – The claim that attention mechanisms do not consistently yield accurate results across applications is vague. The authors should clarify what is meant by “accurate” and what the upper threshold is for it for attention-based studies using the same dataset to support their claims.

b. Transfer Learning with Greyscale Images – While it is mentioned that previous approaches using ImageNet-based transfer learning may be unsuitable for greyscale images due to their distinct characteristics, the paper fails to specify what steps were taken in the current study to address this issue beyond experimenting with freezing layers. A more detailed explanation of the design choices and preprocessing techniques tailored for greyscale inputs is required.

Experimental design

1. Inadequate Ablation Study and Lack of Clarity
The ablation study does not include results obtained without feature selection on joint tasks, which limits the ability to assess its impact. Additionally, the differences between the second and third Ablation Studies are not clearly defined. It is also unclear whether the attention mechanism was included in Ablation Study 2. This lack of transparency undermines the validity of the comparisons and conclusions drawn from these experiments.

2. Impact of Attention Layers on Model Complexity and Real-World Applicability
Incorporating multiple attention layers within a CNN framework can lead to several practical challenges, including overfitting due to model complexity, increased computational overhead, and longer training times, some of which are acknowledged in the limitations section. These issues may hinder deployment in real-time diagnostic settings. It is essential that the authors investigate and report whether they attempted to determine the optimal number of attention layers needed to achieve a balance between model performance and efficiency.

3. Insufficient Dataset Description and Overfitting Concerns
The description of the chest X-ray dataset used for training and testing is incomplete, e.g. training and test ratio. Notably, if the test accuracy reached 100%, the training and validation accuracies should also be reported to assess potential overfitting. A more comprehensive discussion of data augmentation strategies, class distribution, and cross-validation procedures is necessary to validate the robustness of the results.

Validity of the findings

-

Additional comments

Minor Comments

1. Section Title Correction
Section 4 should be titled "Validation of the Proposed Method" instead of "Approve", which appears to be an error.

2. Language, Writing Quality, and Section Transitions
Overall, the article is poorly written. The English language throughout the manuscript requires a thorough review. Several complex sentences are difficult to understand and could benefit from rephrasing for clarity and conciseness. It is somewhat challenging to follow the article's flow as the transitions between sections are not well maintained.

3. Image Resolution and Figure Quality
The resolution of the majority of the figures, particularly Figure 1 depicting the framework architecture, is too low and should be enhanced for better readability and presentation. Moreover, the manuscript contains unnecessary figures. The authors should consider reducing the number of confusion matrices if the derived metrics are already presented in other tables.

4. Formatting Issues
There are missing table numbers and residual LaTeX commands present in the text. The authors are strongly advised to carefully proofread and format the manuscript before submitting the revised version.

Reviewer 3 ·

Basic reporting

The manuscript is written in clear English.
The introduction effectively introduces the topic of X-ray image classification, especially focusing on the challenges posed by musculoskeletal radiographs and chest X-rays. The authors provide a thorough overview of the context, specifically addressing issues such as low grayscale contrast, overlapping anatomical structures, and noise in the images, which affect diagnostic reliability. The motivation for using attention mechanisms in convolutional neural networks (CNNs) to improve classification accuracy and interpretability is clearly stated. The last part of the Introduction, starting from line 75, could be restructured to separate the literature references from the achievements of the authors.

The manuscript references a wide range of relevant literature, including recent advancements in deep learning techniques, attention mechanisms, and their application in medical imaging. These references are well-chosen and support the discussion of the state-of-the-art. The connection to previous work is clear, and the gaps that the current study aims to address are highlighted. The literature also informs the choice of methodology, demonstrating how prior research has influenced the design of this study.
The structure of the manuscript adheres to the typical structure required by PeerJ Computer Science. The manuscript includes well-defined sections such as the introduction, methodology, experimental evaluation, and results. Each section is appropriately labeled, and the use of subsections adds clarity, making the paper easy to follow. However, there should be refined the indentation of paragraphs according to the journal template.

Experimental design

The article falls well within the scope of the PeerJ Computer Science journal, addressing deep learning techniques for medical image classification, specifically X-ray analysis, which is highly relevant to the journal's focus on AI applications in computational science.

The study demonstrates a rigorous investigation, employing advanced deep learning techniques, including attention mechanisms and ensemble learning, to enhance X-ray image classification. The methods are well-defined, and the experimental setup adheres to high technical standards.
The methods are described with sufficient detail to enable replication. The authors provide clear descriptions of the deep learning framework, including the CNN models (Xception, InceptionResNetV2), self-supervised learning approach, attention mechanisms, and feature fusion techniques.

The paper discusses data preprocessing, particularly addressing challenges like noise in X-ray images and the need for explainable AI (XAI). The use of self-supervised learning for domain-specific pre-training and feature selection techniques is well-explained, enhancing the robustness of the models. However, if possible, more explicit details on data augmentation or additional preprocessing steps would be beneficial.

The evaluation methods are clearly described, with a comprehensive set of performance metrics, including accuracy, recall, precision, specificity, F1-score, and Cohen’s Kappa. These metrics are appropriate for assessing classification performance, and the use of confusion matrices further supports a clear understanding of model performance.

The model selection is adequately explained, with a rationale provided for choosing specific CNN architectures (Xception, InceptionResNetV2) based on their performance and robustness. The authors also justify the integration of attention mechanisms and ensemble learning to enhance performance. The ablation studies support the decision to refine these models further.

Sources are adequately cited, with both direct quotations and paraphrased references to prior work, ensuring that the context and rationale for the methodology are supported by relevant literature.

Validity of the findings

While the manuscript does not explicitly assess the broader impact and novelty of the findings, the proposed framework’s integration of attention mechanisms into CNNs for medical X-ray classification represents a meaningful contribution to the field. The novelty lies in the combination of attention-based visualisation with hierarchical feature fusion and ensemble learning to improve both the accuracy and interpretability of deep learning models in medical imaging. This approach addresses important limitations in the current automated X-ray interpretation systems, such as overfitting and poor generalization.

The rationale for using attention mechanisms and self-supervised learning to handle challenges like noise and limited labeled data in medical X-ray images is clearly stated, with a detailed explanation of the framework's potential benefits for clinical adoption. While the experiments are well-conducted, the paper encourages replication by describing the methodology and evaluation metrics in sufficient detail.
The experiments and evaluations are well conducted, with validation using two distinct datasets (MURA and Chest X-ray datasets). The authors employ evaluation methods, such as confusion matrices and performance metrics, which clearly demonstrate the effectiveness of the proposed framework. The results show strong performance across multiple tasks, including classification of both musculoskeletal and chest X-rays, reinforcing the validity of the findings.

The argument is well-developed and consistently aligns with the goals set out in the introduction. The authors aim to improve robustness, accuracy, and interpretability in medical X-ray classification, and the experimental results support these objectives. The attention mechanisms and ensemble learning strategies significantly enhance model performance.

The conclusion is well-stated and tied to the results. However, the conclusion could benefit from a clearer discussion of unresolved questions or limitations. While the study achieves high classification accuracy, the authors should acknowledge the potential challenges in applying the model to real-world clinical settings, such as variations in image quality or data distribution. Future directions could include exploring the integration of this framework into clinical practice, as well as addressing any limitations related to dataset diversity, rare conditions, and model scalability.

Additional comments

The manuscript presents a robust study on improving X-ray image classification using deep learning techniques, and there are several areas where the authors could make improvements to enhance the clarity and readability of their work:

1. Introduction Restructuring
While the introduction effectively sets the context, I suggest restructuring the last part, starting from line 75, to better separate the literature references from the authors' own achievements. This would help in clearly distinguishing between what has been previously achieved in the field and what new contributions the authors are proposing.

2. Paragraph Indentation
The manuscript follows the general structure of PeerJ Computer Science, but the indentation of paragraphs should be aligned with the journal's template. This is a formatting detail that will improve the visual appeal and adherence to the journal’s guidelines.

3. Data Preprocessing Details
The paper discusses data preprocessing, especially addressing noise and the need for explainable AI (XAI), but it would benefit from more explicit details on data augmentation or additional preprocessing steps. Including these, if possible, would further clarify the preparation steps taken before training the models and enhance reproducibility.

4. Clearer Discussion of Limitations and Future Directions
The conclusion could benefit from a more explicit discussion of the study’s limitations, such as potential challenges when applying the model to real-world clinical settings. This would provide a more balanced perspective. Additionally, the authors could include potential future directions for their research, such as integrating the framework into clinical practice, dealing with rare conditions, and improving model scalability.

These improvements will help enhance the manuscript's clarity, completeness, and alignment with the journal's expectations.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.