All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
The authors have revised the manuscript and addressed the reviewers' comments. It should be ready for publication.
[# PeerJ Staff Note - this decision was reviewed and approved by Xiangjie Kong, a PeerJ Section Editor covering this Section #]
The authors have addressed the reviewers' comments in the revised manuscript. However, one reviewer still has some comments that need to be considered before final acceptance.
**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.
NO PROBLEM
NO PROBLEM
NO PROBLEM
Well-written with clear and professional English. Recent literature references from 2022-2024 are incorporated.
To further strengthen the background and contextual framework, the authors are encouraged to cite recent comprehensive reviews on AI-based breast cancer detection systems that discuss machine learning, deep learning, and vision transformers (e.g.,[https://link.springer.com/article/10.1007/s11042-024-19620-y]). This would better situate MIME-ViT within the current state-of-the-art.
**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.
Additionally, the motivation behind selecting Vision Transformer (ViT) as the baseline for MIME-ViT should be explicitly clarified in the revised manuscript to highlight the rationale and expected advantages guiding the architectural choices.
We appreciate the authors for carefully considering the previous suggestion and adding Table 1 summarizing the key hyperparameters of MIME-ViT.
The study presents novel findings with appropriate statistical robustness. The conclusions are well-supported by the results.
The authors have appropriately addressed the concern regarding the bounding box generation method for IoU calculation. The added discussion on how their approach may underestimate detection performance, especially for irregularly shaped lesions, is well noted. Acknowledging this limitation and indicating directions for future improvement strengthens the validity and transparency of the findings.
The discussion could be enhanced by integrating recent advances in explainability and interpretability of AI models in medical imaging. Notably, recent work employing Explainable AI (XAI) techniques with CNN-based breast cancer detection systems (e.g., [https://link.springer.com/article/10.1007/s42979-025-04170-3]) highlights the importance of transparency to build clinical trust.
No further issues noted.
The authors are commended for thoroughly addressing prior reviewer comments and improving the manuscript. The integration of Vision Transformers with multiscale analysis is a valuable contribution. For further enhancement, explicit motivation for using ViT as the baseline model for MIME-ViT should be added, clarifying why this architecture was chosen over alternatives. This will help readers appreciate the design philosophy and contextualize MIME-ViT’s performance gains. Including recent references related to AI-assisted breast cancer detection and explainability strengthens the manuscript’s scientific foundation and clinical relevance.
The reviewers have raised some issues with the current version of the manuscript. Please, revise accordingly and address their concerns in the revised version.
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
Q1.The authors noted on the results that DETR-S did not recognize any lesions, masses, or calcifications, as shown in Table 1. However, it can be seen in Table 2 that Specificity is 100%, please explain the reason according to the formula of Specificity.
Q1. When writing the model training and implementation, the authors did not give the total number of training epoches and training data for each epoch (such as loss and IOU for training and validation sets). I hope the authors can add loss and IOU line charts during the training, so as to simplify readers' understanding and see the comparison more effectively.
Q2. In addition to Specificity, Sensitivity is also a critical indicator for comparison of medical image models, but the author has not specified Sensitivity, so I hope the author can add the comparison of Sensitivity.
Q3. Did the authors conduct Contrast experiment and ablation experiment when constructing the MIME-ViT model (such as the convolution kernel size of CNN and specific scale ranges of VIT)? If so, please supplement.
Q4. DETR is a VIT model proposed in 2020, which is a little old. Is there newer models that is consistent with your research subject? If so, I hope to make a comparison to ensure that your research is valuable.
no commen
English is Ok.
Recent literature references missing.
No Comment
The method of generating bounding boxes around the exterior of segmented patches for IoU calculation may have negatively impacted detection scores. This approach likely introduced excess background area, particularly for irregularly shaped lesions, potentially underestimating the model's true detection performance. Future work should explore alternative bounding box generation methods to provide a more accurate evaluation.
The manuscript presents a novel and well-executed study on breast cancer detection using the Multiscale Image Morphological Extraction Vision Transformer (MIME-ViT). The integration of Vision Transformers with CNNs to enhance mammographic imaging analysis is particularly noteworthy. The study is well-structured, precise, and clearly reported, and it adds significant value to the field of medical imaging. The authors’ effort in developing an advanced model for breast cancer detection is commendable.
However, I have the following suggestions for improvement:
1. While the study is well-researched, most of the cited references are relatively old, with limited citations from 2022, 2023, and 2024. To enhance the relevance and timeliness of the study, the authors are encouraged to incorporate more recent literature.
2. A parametric table summarizing the MIME-ViT model’s key attributes and a comparative analysis with other model variants would strengthen the manuscript. This would offer readers a clearer understanding of the model’s architectural choices, hyperparameters, and performance in relation to existing methods. Small details such as batch size, and how much patience callback is used for early stopping to prevent overfitting, may also be added.
Overall, this study represents a significant advancement in the application of deep learning for breast cancer detection. Addressing these points would further enhance its impact and comprehensiveness.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.