Review History


All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

  • The initial submission of this article was received on June 9th, 2025 and was peer-reviewed by 3 reviewers and the Academic Editor.
  • The Academic Editor made their initial decision on August 11th, 2025.
  • The first revision was submitted on October 10th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • A further revision was submitted on October 21st, 2025 and was reviewed by 2 reviewers and the Academic Editor.
  • The article was Accepted by the Academic Editor on October 31st, 2025.

Version 0.3 (accepted)

· · Academic Editor

Accept

The authors have addressed the previous reviews.

[# PeerJ Staff Note - this decision was reviewed and approved by Mehmet Cunkas, a PeerJ Section Editor covering this Section #]

·

Basic reporting

No further comments

Experimental design

None

Validity of the findings

None

Additional comments

None

Reviewer 3 ·

Basic reporting

N/A

Experimental design

N/A

Validity of the findings

N/A

Additional comments

We thank the authors for addressing our previous reviews, and have no more comments to make.

Version 0.2

· · Academic Editor

Minor Revisions

There are remaining issues.

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

·

Basic reporting

No further comments

The paper can be accepted

Experimental design

None

Validity of the findings

None

Additional comments

None

Reviewer 3 ·

Basic reporting

no comment

Experimental design

1. For "choice-level network consensus" η, C_i and I_i might be used in the text description starting Line 624, to prevent confusion with the terminology in Equation 4.

2. C_y and C_x are then stated to be the predictions for N_x and N_y. However, neural network predictions are generally not integer values, and instead reflect probability estimates (from the softmax layer). As such, it might be clarified as to whether and how the N_{x/y} predictions are converted to class labels (e.g. by thresholding); if by thresholding, it might then be discussed as to whether the definition of η is optimal, since it would convert both {Nr_x = Nr_y = 0.51} and {Nr_x = Nr_y = 1.00} (where Nr is the raw prediction) to the same η, despite the models being very uncertain in the first case, and very certain in the second case.

Validity of the findings

no comment

Additional comments

We thank the authors for addressing some of our previous comments. However, a couple of points pertaining to experimental design have not been answered.

Version 0.1 (original submission)

· · Academic Editor

Major Revisions

Three reviewers have reviewed the submission. One reviewer recommended rejection. However, in my opinion, in the context of PeerJ, it can still be revised, and I would like to encourage the authors to address it.

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

·

Basic reporting

This study proposed a preliminary Trustworthiness Indicator (§) to quantify reliability and trustworthiness numerically.

The paper makes a good contribution.

Minor comments:
- The title is too long. Revise.
- Table 1 is not clear. Revise.
- Add more recent studies like this study - Interpretable Deep Learning-Based AI Framework with Multi-Sequence Attention for Brain Tumor Subtype Classification in MRI Scans." Journal of Cyber Security and Risk Auditing.
- The proposed system is clear.
- The analysis is clear.
- Add future work in the conclusion.

Experimental design

-

Validity of the findings

-

Reviewer 2 ·

Basic reporting

The article is written in clear and professional English, making it accessible to the intended audience. The introduction provides a solid context for the use of AI in digital health, specifically in computer-aided diagnosis (CAD) systems for melanoma detection and diabetic retinopathy (DR) classification. The literature is well-referenced, citing relevant works (e.g., Vaswani et al., 2017; Hassija et al., 2024) to establish the background and motivation for the study. The structure generally conforms to PeerJ standards, with clear sections for introduction, methods, results, and discussion. However, there are minor issues:

Suggestions for Improvement:
The introduction could better articulate the specific gap in trustworthiness metrics that this study aims to address. While it discusses the black-box nature of deep learning and the need for trust, it lacks a concise statement of the problem being solved (e.g., why existing metrics are insufficient).

Some sections, such as the description of datasets (EyePACS and ISIC), could benefit from more detail about data preprocessing steps or potential biases (e.g., noise in EyePACS labels is mentioned but not elaborated).

There are minor typographical errors (e.g., "Diabetic Retitroppathy" on page 2, "melan 17" on page 4, inconsistent formatting of references like "3: 3: 3:" on page 22). These should be corrected for clarity and professionalism.

Experimental design

The article is within the scope of PeerJ Computer Science, focusing on AI applications in healthcare. The investigation is rigorous, with a clear methodology for evaluating convolutional neural networks (CNNs) using the ISIC and EyePACS datasets. The methods section details the use of Grad-CAM for interpretability and the Structural Similarity Index Measure (SSIM) for the trustworthiness indicator, which is appropriate for the study’s goals. The computing infrastructure (AMD Ryzen, NVIDIA RTX 4000, MATLAB R2023b) is described, and the code is made available on GitHub and Zenodo, ensuring reproducibility. However, there are areas for improvement:

Suggestions for Improvement:
The preprocessing steps (contrast enhancement, hole erosion, image segmentation) are briefly introduced but lack sufficient detail for replication. For example, specific parameters or algorithms used for contrast enhancement or watershed segmentation are not provided.
The evaluation metrics (e.g., accuracy, precision, F1-score, Hamming Loss) are well-chosen, but the article could clarify how these metrics were calculated for multiclass (DR) versus binary (melanoma) classification tasks. Table 5 is referenced but not fully provided in the document, which hinders the assessment of the results.

The selection of CNN architectures (e.g., AlexNet, DenseNet-121) is justified, but the rationale for excluding other modern architectures (e.g., EfficientNet, Vision Transformers) could be better explained.

Validity of the findings

The study proposes a novel Trustworthiness Indicator (Φ) to quantify the reliability of CNNs in medical imaging, focusing on melanoma and DR classification. The experiments are well-designed, using public datasets (ISIC and EyePACS) and Grad-CAM to visualize model decision-making. The findings are supported by the data, with reported accuracies (e.g., 93.5% for DenseNet-121 in DR classification) and SSIM-based consensus measurements. The argument for the need for trustworthiness metrics is well-developed, aligning with the introduction’s goals. The conclusion identifies limitations (e.g., not exploiting anti-correlated activations) and future directions (e.g., extending to numerical data), which strengthen the study.

Suggestions for Improvement:
The article mentions high accuracy but varying feature learning across networks (Figure 4). This variability could be explored further to strengthen the case for the Trustworthiness Indicator. For example, how does the indicator correlate with clinical reliability?

The evaluation of the Trustworthiness Indicator (Φ) relies on SSIM, but the article notes that negative SSIM values are nullified, which may limit its effectiveness. This limitation should be discussed in more depth, perhaps with a sensitivity analysis.

The results section could benefit from a clearer presentation of quantitative outcomes (e.g., a complete Table 5 or a summary of key metrics across models). The current truncation of results makes it difficult to fully assess performance.

Additional comments

The proposed Trustworthiness Indicator (Φ) is a valuable contribution to the field of explainable AI (XAI) in healthcare, addressing a critical need for reliable and interpretable AI systems. The use of Grad-CAM and SSIM to quantify model consensus is innovative, and the application to real-world medical datasets enhances its relevance. However, the article would benefit from:

A clearer explanation of how Φ differs from or complements existing XAI metrics (e.g., SHAP, LIME).

More discussion on the practical implications of Φ for clinicians (e.g., how it could be integrated into CAD systems).

Addressing potential biases in the datasets (e.g., demographic representation in ISIC or label noise in EyePACS) to ensure generalizability.

Reviewer 3 ·

Basic reporting

The Introduction section sufficiently introduces the motivation for a preliminary Trustworthiness Indicator metric Φ for medical imaging, in particular with convolutional neural networks (CNNs). The writing is generally clear and detailed. However, it might be recommended to explain and focus more on Φ earlier in the manuscript, since it is the primary contribution of the paper.

Experimental design

From the Trustworthiness Definition subsection of the Methods section, Φ is defined in terms of the local trustiworthiness of a network pair, which is in turn based on the difference between "correlation-distance between activation features" and "choice-level network consensus".

1. For "choice-level network consensus" η, C_i and I_i might be used in the text description starting Line 624, to prevent confusion with the terminology in Equation 4.

2. C_y and C_x are then stated to be the predictions for N_x and N_y. However, neural network predictions are generally not integer values, and instead reflect probability estimates (from the softmax layer). As such, it might be clarified as to whether and how the N_{x/y} predictions are converted to class labels (e.g. by thresholding); if by thresholding, it might then be discussed as to whether the definition of η is optimal, since it would convert both {Nr_x = Nr_y = 0.51} and {Nr_x = Nr_y = 1.00} (where Nr is the raw prediction) to the same η, despite the models being very uncertain in the first case, and very certain in the second case.

3. From Tables 6 and 7, Φ is not maximized (i.e. Φ=1.00) even when both CNNs are the same (along the table diagonals). Given that the CNN outputs should be identical in these cases, this might be clarified.

Validity of the findings

While it is claimed in the Aims and scope of the work subsection that Φ surpasses standard XAI techniques such as Grad-CAM, SHAP and LIME, there appears currently no direct evidence towards this primary claim.

In particular, for the Results section, Table 5 contains raw performance results (not directly relevant for XAI), Tables 6 and 8 contain Φ values for CNN pairs (which is descriptive but does not in itself show the utility of Φ), while Table 8 shows additional metrics (for DR only), but does not even relate these to Φ.

There is a subsection on correlation between Φ and classical performance metrics, which suggest that the malignant class exhibits more dispersed Φ values. However, there is currently no detailed quantitative evaluation of this possible effect, nor analysis as to how this effect might be exploited to improve classification performance, nor comparison on some metric against Grad-CAM/SHAP/LIME.

Additional comments

Some section references are missing (i.e. Section ??), in the Introduction.

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.