Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

Review History
Refining medical large language models: key insights from instruction tuning

All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

The initial submission of this article was received on March 10th, 2025 and was peer-reviewed by 3 reviewers and the Academic Editor.
The Academic Editor made their initial decision on June 5th, 2025.
The first revision was submitted on July 16th, 2025 and was reviewed by 3 reviewers and the Academic Editor.
The article was Accepted by the Academic Editor on August 22nd, 2025.

Version 0.2 (accepted)

Shibiao Wan · Aug 22, 2025 · Academic Editor

Reviewers are satisfied with the revisions, and I recommend accepting this manuscript.

[# PeerJ Staff Note - this decision was reviewed and approved by Claudio Ardagna, a PeerJ Section Editor covering this Section #]

Balu Bhasuran · Jul 24, 2025

Basic reporting

All my comments have been addressed by the authors; the manuscript is now suitable for publication in its current format.

Experimental design

All my comments have been addressed by the authors; the manuscript is now suitable for publication in its current format.

Validity of the findings

All my comments have been addressed by the authors; the manuscript is now suitable for publication in its current format.

Cite this review as

Bhasuran B (2025) Peer Review #1 of "Refining medical large language models: key insights from instruction tuning (v0.2)". PeerJ Computer Science

Reviewer 2 · Jul 22, 2025

Basic reporting

Experimental design

Validity of the findings

Cite this review as

Anonymous Reviewer (2025) Peer Review #2 of "Refining medical large language models: key insights from instruction tuning (v0.2)". PeerJ Computer Science

Reviewer 3 · Jul 29, 2025

Basic reporting

Experimental design

Validity of the findings

Cite this review as

Anonymous Reviewer (2025) Peer Review #3 of "Refining medical large language models: key insights from instruction tuning (v0.2)". PeerJ Computer Science

Download Version 0.2 (PDF) Download author's response letter - submitted Jul 16, 2025

Version 0.1 (original submission)

PeerJ Staff · Jun 5, 2025 · Academic Editor

Major Revisions

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

Balu Bhasuran · May 1, 2025

Basic reporting

Summary of the Study
This literature review synthesizes recent advancements in instruction tuning for Medical Large Language Models (Med-LLMs). It comprehensively analyzes three primary instruction dataset types—human-crafted, LLM-synthesized, and hybrid RAG-based—while evaluating thirteen instruction-tuned Med-LLMs, such as Med-PaLM, Aloe, Me-LLaMA, and ChatDoctor. The paper explores optimization strategies like phased instruction tuning, mixed prompt training, LoRA-based fine-tuning, and bias mitigation. Future research directions are discussed, with emphasis on privacy, standardization, and equitable deployment. The review is timely and addresses a key bottleneck in aligning LLMs for real-world medical applications.

Major Comments
A major limitation is the details in the consolidated comparison table 2, such as tuning strategy (e.g., LoRA, PEFT), and performance metrics (e.g., accuracy, F1, ROUGE). This would help readers better contextualize model capabilities across benchmarks like USMLE or MedQA.
The review highlights impressive performance from several models, yet lacks critical reflection on evaluation settings (e.g., held-out test sets, zero-shot vs. few-shot). There should be more discussion about reproducibility, standardization across Med-LLM evaluations, and risks of overfitting to specific benchmarks.
The discussion about transformer models like BioBERT, ClinicalBERT, PubMedBERT, and their performance in various biomedical and clinical tasks should be discussed.
The section on bias and fairness is appreciated, but ethical risks of instruction-tuned Med-LLMs (e.g., hallucination risks in clinical advice, regulatory challenges) deserve more detailed consideration and specific examples. Also, discussion of equity issues (e.g., underrepresentation of non-English or rare diseases in datasets) would add depth.
The manuscript is generally well-written, though there are occasional awkward phrasings (e.g., “magnet seed instructions” → “augment seed instructions”) and some typos (e.g., “instnace,” “paltforms,” “an mental health condition”). A round of proofreading or copyediting is recommended.
Table 1 (dataset comparison) and Table 2 (model summary) should be more detailed and consistent—consider adding metrics like dataset size, instruction type, and primary tasks per model.
Some references to unnamed studies or authors are cited as "Mario et al., nd"—please correct these to complete citations.
Fix inconsistencies in author affiliation formatting and references. Several citations (e.g., "Li et al., 1064") appear to contain placeholder errors.

Overall Recommendation
 The manuscript provides a highly valuable and well-structured synthesis of the state of instruction tuning in Med-LLMs. Its strengths lie in its breadth and practical relevance. However, before publication, the authors should address issues of methodological transparency, improve benchmarking comparisons, and refine figures/tables. Enhancing ethical analysis and proofreading the manuscript for clarity and grammar will further improve its quality.

Experimental design

Validity of the findings

Cite this review as

Bhasuran B (2025) Peer Review #1 of "Refining medical large language models: key insights from instruction tuning (v0.1)". PeerJ Computer Science

Reviewer 2 · May 8, 2025

Basic reporting

The article is written in mostly clear and professional English, though minor grammatical issues remain. It provides a well-structured and timely review with relevant literature and figures, but lacks raw data sharing and could better highlight foundational studies and research gaps. While it is within scope and cross-disciplinary, a clearer unique perspective or taxonomy would enhance its contribution.

Experimental design

The article fits well within the journal’s scope and follows the literature review format. While the methodology is generally rigorous, it lacks transparency and reproducibility details like a PRISMA diagram or search strings. The survey is well-organized and sources are cited appropriately, though it leans toward well-known models and misses deeper analysis of ethical and low-resource contexts.

Validity of the findings

The paper presents a cohesive argument that aligns well with the introduction and effectively ties instruction tuning to Med-LLM performance. While it encourages further research, the rationale for replicating its methodology is not clearly outlined. The conclusions are appropriately scoped and highlight key future directions, including bias mitigation and privacy safeguards.

Additional comments

4.1. Expanding the scope to include emerging or less-publicized models, especially those tailored for low-resource settings, would offer a more comprehensive perspective.
4.2. A deeper analysis of how instruction tuning affects model biases, particularly in sensitive clinical contexts, is essential. Incorporating studies that evaluate fairness and ethical considerations would strengthen the review's applicability.
4.3. Recent studies have explored the incorporation of emotional awareness in LLMs to improve response generation. For instance, Rasool et al. (https://www.mdpi.com/2673-2688/6/3/56) Including such approaches can highlight interdisciplinary advancements relevant to Med-LLMs.
4.4. Developing a taxonomy that categorizes instruction tuning methods based on factors like data sources, tuning techniques, and application domains can provide readers with a structured understanding of the field's landscape.

Cite this review as

Anonymous Reviewer (2025) Peer Review #2 of "Refining medical large language models: key insights from instruction tuning (v0.1)". PeerJ Computer Science

Reviewer 3 · May 26, 2025

Basic reporting

no comment

Experimental design

no comment

Validity of the findings

no comment

Additional comments

1. Given the global diversity of healthcare systems, it would be helpful to include a discussion of multilingual or cross-lingual Med-LLMs and how instruction tuning addresses translation ambiguities in clinical context.
2. The review part lacks discussion on how instruction-tuned models perform over time as medical knowledge evolves.
3. How do models calibrate their confidence under verbose prompts, and what implications does this have for interpretability in safety-critical applications?
4. To enrich the manuscript, more related works could be reviewed such as Large Language Models in Genomics—A Perspective on Personalized Medicine.
5. It remains unclear how instruction complexity (e.g., single-step vs. multi-step prompts) correlates with semantic accuracy across LLMs.
6. Could the authors clarify whether all reported accuracy metrics are computed under equivalent conditions (e.g., prompt type, instruction format, sampling temperature)?

Cite this review as

Anonymous Reviewer (2025) Peer Review #3 of "Refining medical large language models: key insights from instruction tuning (v0.1)". PeerJ Computer Science

Download Original Submission (PDF) - submitted Mar 10, 2025

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Review History Refining medical large language models: key insights from instruction tuning

Summary

Version 0.2 (accepted)

Shibiao Wan · Aug 22, 2025 · Academic Editor

Balu Bhasuran · Jul 24, 2025

Basic reporting

Experimental design

Validity of the findings

Reviewer 2 · Jul 22, 2025

Basic reporting

Experimental design

Validity of the findings

Reviewer 3 · Jul 29, 2025

Basic reporting

Experimental design

Validity of the findings

Version 0.1 (original submission)

PeerJ Staff · Jun 5, 2025 · Academic Editor

Balu Bhasuran · May 1, 2025

Basic reporting

Experimental design

Validity of the findings

Reviewer 2 · May 8, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Reviewer 3 · May 26, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Review History
Refining medical large language models: key insights from instruction tuning