All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
The authors have addressed the previous reviews.
[# PeerJ Staff Note - this decision was reviewed and approved by Mehmet Cunkas, a PeerJ Section Editor covering this Section #]
No further comments
None
None
None
N/A
N/A
N/A
We thank the authors for addressing our previous reviews, and have no more comments to make.
There are remaining issues.
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
No further comments
The paper can be accepted
None
None
None
no comment
1. For "choice-level network consensus" η, C_i and I_i might be used in the text description starting Line 624, to prevent confusion with the terminology in Equation 4.
2. C_y and C_x are then stated to be the predictions for N_x and N_y. However, neural network predictions are generally not integer values, and instead reflect probability estimates (from the softmax layer). As such, it might be clarified as to whether and how the N_{x/y} predictions are converted to class labels (e.g. by thresholding); if by thresholding, it might then be discussed as to whether the definition of η is optimal, since it would convert both {Nr_x = Nr_y = 0.51} and {Nr_x = Nr_y = 1.00} (where Nr is the raw prediction) to the same η, despite the models being very uncertain in the first case, and very certain in the second case.
no comment
We thank the authors for addressing some of our previous comments. However, a couple of points pertaining to experimental design have not been answered.
Three reviewers have reviewed the submission. One reviewer recommended rejection. However, in my opinion, in the context of PeerJ, it can still be revised, and I would like to encourage the authors to address it.
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
This study proposed a preliminary Trustworthiness Indicator (§) to quantify reliability and trustworthiness numerically.
The paper makes a good contribution.
Minor comments:
- The title is too long. Revise.
- Table 1 is not clear. Revise.
- Add more recent studies like this study - Interpretable Deep Learning-Based AI Framework with Multi-Sequence Attention for Brain Tumor Subtype Classification in MRI Scans." Journal of Cyber Security and Risk Auditing.
- The proposed system is clear.
- The analysis is clear.
- Add future work in the conclusion.
-
-
The article is written in clear and professional English, making it accessible to the intended audience. The introduction provides a solid context for the use of AI in digital health, specifically in computer-aided diagnosis (CAD) systems for melanoma detection and diabetic retinopathy (DR) classification. The literature is well-referenced, citing relevant works (e.g., Vaswani et al., 2017; Hassija et al., 2024) to establish the background and motivation for the study. The structure generally conforms to PeerJ standards, with clear sections for introduction, methods, results, and discussion. However, there are minor issues:
Suggestions for Improvement:
The introduction could better articulate the specific gap in trustworthiness metrics that this study aims to address. While it discusses the black-box nature of deep learning and the need for trust, it lacks a concise statement of the problem being solved (e.g., why existing metrics are insufficient).
Some sections, such as the description of datasets (EyePACS and ISIC), could benefit from more detail about data preprocessing steps or potential biases (e.g., noise in EyePACS labels is mentioned but not elaborated).
There are minor typographical errors (e.g., "Diabetic Retitroppathy" on page 2, "melan 17" on page 4, inconsistent formatting of references like "3: 3: 3:" on page 22). These should be corrected for clarity and professionalism.
The article is within the scope of PeerJ Computer Science, focusing on AI applications in healthcare. The investigation is rigorous, with a clear methodology for evaluating convolutional neural networks (CNNs) using the ISIC and EyePACS datasets. The methods section details the use of Grad-CAM for interpretability and the Structural Similarity Index Measure (SSIM) for the trustworthiness indicator, which is appropriate for the study’s goals. The computing infrastructure (AMD Ryzen, NVIDIA RTX 4000, MATLAB R2023b) is described, and the code is made available on GitHub and Zenodo, ensuring reproducibility. However, there are areas for improvement:
Suggestions for Improvement:
The preprocessing steps (contrast enhancement, hole erosion, image segmentation) are briefly introduced but lack sufficient detail for replication. For example, specific parameters or algorithms used for contrast enhancement or watershed segmentation are not provided.
The evaluation metrics (e.g., accuracy, precision, F1-score, Hamming Loss) are well-chosen, but the article could clarify how these metrics were calculated for multiclass (DR) versus binary (melanoma) classification tasks. Table 5 is referenced but not fully provided in the document, which hinders the assessment of the results.
The selection of CNN architectures (e.g., AlexNet, DenseNet-121) is justified, but the rationale for excluding other modern architectures (e.g., EfficientNet, Vision Transformers) could be better explained.
The study proposes a novel Trustworthiness Indicator (Φ) to quantify the reliability of CNNs in medical imaging, focusing on melanoma and DR classification. The experiments are well-designed, using public datasets (ISIC and EyePACS) and Grad-CAM to visualize model decision-making. The findings are supported by the data, with reported accuracies (e.g., 93.5% for DenseNet-121 in DR classification) and SSIM-based consensus measurements. The argument for the need for trustworthiness metrics is well-developed, aligning with the introduction’s goals. The conclusion identifies limitations (e.g., not exploiting anti-correlated activations) and future directions (e.g., extending to numerical data), which strengthen the study.
Suggestions for Improvement:
The article mentions high accuracy but varying feature learning across networks (Figure 4). This variability could be explored further to strengthen the case for the Trustworthiness Indicator. For example, how does the indicator correlate with clinical reliability?
The evaluation of the Trustworthiness Indicator (Φ) relies on SSIM, but the article notes that negative SSIM values are nullified, which may limit its effectiveness. This limitation should be discussed in more depth, perhaps with a sensitivity analysis.
The results section could benefit from a clearer presentation of quantitative outcomes (e.g., a complete Table 5 or a summary of key metrics across models). The current truncation of results makes it difficult to fully assess performance.
The proposed Trustworthiness Indicator (Φ) is a valuable contribution to the field of explainable AI (XAI) in healthcare, addressing a critical need for reliable and interpretable AI systems. The use of Grad-CAM and SSIM to quantify model consensus is innovative, and the application to real-world medical datasets enhances its relevance. However, the article would benefit from:
A clearer explanation of how Φ differs from or complements existing XAI metrics (e.g., SHAP, LIME).
More discussion on the practical implications of Φ for clinicians (e.g., how it could be integrated into CAD systems).
Addressing potential biases in the datasets (e.g., demographic representation in ISIC or label noise in EyePACS) to ensure generalizability.
The Introduction section sufficiently introduces the motivation for a preliminary Trustworthiness Indicator metric Φ for medical imaging, in particular with convolutional neural networks (CNNs). The writing is generally clear and detailed. However, it might be recommended to explain and focus more on Φ earlier in the manuscript, since it is the primary contribution of the paper.
From the Trustworthiness Definition subsection of the Methods section, Φ is defined in terms of the local trustiworthiness of a network pair, which is in turn based on the difference between "correlation-distance between activation features" and "choice-level network consensus".
1. For "choice-level network consensus" η, C_i and I_i might be used in the text description starting Line 624, to prevent confusion with the terminology in Equation 4.
2. C_y and C_x are then stated to be the predictions for N_x and N_y. However, neural network predictions are generally not integer values, and instead reflect probability estimates (from the softmax layer). As such, it might be clarified as to whether and how the N_{x/y} predictions are converted to class labels (e.g. by thresholding); if by thresholding, it might then be discussed as to whether the definition of η is optimal, since it would convert both {Nr_x = Nr_y = 0.51} and {Nr_x = Nr_y = 1.00} (where Nr is the raw prediction) to the same η, despite the models being very uncertain in the first case, and very certain in the second case.
3. From Tables 6 and 7, Φ is not maximized (i.e. Φ=1.00) even when both CNNs are the same (along the table diagonals). Given that the CNN outputs should be identical in these cases, this might be clarified.
While it is claimed in the Aims and scope of the work subsection that Φ surpasses standard XAI techniques such as Grad-CAM, SHAP and LIME, there appears currently no direct evidence towards this primary claim.
In particular, for the Results section, Table 5 contains raw performance results (not directly relevant for XAI), Tables 6 and 8 contain Φ values for CNN pairs (which is descriptive but does not in itself show the utility of Φ), while Table 8 shows additional metrics (for DR only), but does not even relate these to Φ.
There is a subsection on correlation between Φ and classical performance metrics, which suggest that the malignant class exhibits more dispersed Φ values. However, there is currently no detailed quantitative evaluation of this possible effect, nor analysis as to how this effect might be exploited to improve classification performance, nor comparison on some metric against Grad-CAM/SHAP/LIME.
Some section references are missing (i.e. Section ??), in the Introduction.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.