All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Reviewers are satisfied with the revisions, and I recommend accepting this manuscript.
[# PeerJ Staff Note - this decision was reviewed and approved by Claudio Ardagna, a PeerJ Section Editor covering this Section #]
All my comments have been addressed by the authors; the manuscript is now suitable for publication in its current format.
All my comments have been addressed by the authors; the manuscript is now suitable for publication in its current format.
All my comments have been addressed by the authors; the manuscript is now suitable for publication in its current format.
-
-
-
-
-
-
**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
Summary of the Study
This literature review synthesizes recent advancements in instruction tuning for Medical Large Language Models (Med-LLMs). It comprehensively analyzes three primary instruction dataset types—human-crafted, LLM-synthesized, and hybrid RAG-based—while evaluating thirteen instruction-tuned Med-LLMs, such as Med-PaLM, Aloe, Me-LLaMA, and ChatDoctor. The paper explores optimization strategies like phased instruction tuning, mixed prompt training, LoRA-based fine-tuning, and bias mitigation. Future research directions are discussed, with emphasis on privacy, standardization, and equitable deployment. The review is timely and addresses a key bottleneck in aligning LLMs for real-world medical applications.
Major Comments
A major limitation is the details in the consolidated comparison table 2, such as tuning strategy (e.g., LoRA, PEFT), and performance metrics (e.g., accuracy, F1, ROUGE). This would help readers better contextualize model capabilities across benchmarks like USMLE or MedQA.
The review highlights impressive performance from several models, yet lacks critical reflection on evaluation settings (e.g., held-out test sets, zero-shot vs. few-shot). There should be more discussion about reproducibility, standardization across Med-LLM evaluations, and risks of overfitting to specific benchmarks.
The discussion about transformer models like BioBERT, ClinicalBERT, PubMedBERT, and their performance in various biomedical and clinical tasks should be discussed.
The section on bias and fairness is appreciated, but ethical risks of instruction-tuned Med-LLMs (e.g., hallucination risks in clinical advice, regulatory challenges) deserve more detailed consideration and specific examples. Also, discussion of equity issues (e.g., underrepresentation of non-English or rare diseases in datasets) would add depth.
The manuscript is generally well-written, though there are occasional awkward phrasings (e.g., “magnet seed instructions” → “augment seed instructions”) and some typos (e.g., “instnace,” “paltforms,” “an mental health condition”). A round of proofreading or copyediting is recommended.
Table 1 (dataset comparison) and Table 2 (model summary) should be more detailed and consistent—consider adding metrics like dataset size, instruction type, and primary tasks per model.
Some references to unnamed studies or authors are cited as "Mario et al., nd"—please correct these to complete citations.
Fix inconsistencies in author affiliation formatting and references. Several citations (e.g., "Li et al., 1064") appear to contain placeholder errors.
Overall Recommendation
The manuscript provides a highly valuable and well-structured synthesis of the state of instruction tuning in Med-LLMs. Its strengths lie in its breadth and practical relevance. However, before publication, the authors should address issues of methodological transparency, improve benchmarking comparisons, and refine figures/tables. Enhancing ethical analysis and proofreading the manuscript for clarity and grammar will further improve its quality.
-
-
The article is written in mostly clear and professional English, though minor grammatical issues remain. It provides a well-structured and timely review with relevant literature and figures, but lacks raw data sharing and could better highlight foundational studies and research gaps. While it is within scope and cross-disciplinary, a clearer unique perspective or taxonomy would enhance its contribution.
The article fits well within the journal’s scope and follows the literature review format. While the methodology is generally rigorous, it lacks transparency and reproducibility details like a PRISMA diagram or search strings. The survey is well-organized and sources are cited appropriately, though it leans toward well-known models and misses deeper analysis of ethical and low-resource contexts.
The paper presents a cohesive argument that aligns well with the introduction and effectively ties instruction tuning to Med-LLM performance. While it encourages further research, the rationale for replicating its methodology is not clearly outlined. The conclusions are appropriately scoped and highlight key future directions, including bias mitigation and privacy safeguards.
4.1. Expanding the scope to include emerging or less-publicized models, especially those tailored for low-resource settings, would offer a more comprehensive perspective.
4.2. A deeper analysis of how instruction tuning affects model biases, particularly in sensitive clinical contexts, is essential. Incorporating studies that evaluate fairness and ethical considerations would strengthen the review's applicability.
4.3. Recent studies have explored the incorporation of emotional awareness in LLMs to improve response generation. For instance, Rasool et al. (https://www.mdpi.com/2673-2688/6/3/56) Including such approaches can highlight interdisciplinary advancements relevant to Med-LLMs.
4.4. Developing a taxonomy that categorizes instruction tuning methods based on factors like data sources, tuning techniques, and application domains can provide readers with a structured understanding of the field's landscape.
no comment
no comment
no comment
1. Given the global diversity of healthcare systems, it would be helpful to include a discussion of multilingual or cross-lingual Med-LLMs and how instruction tuning addresses translation ambiguities in clinical context.
2. The review part lacks discussion on how instruction-tuned models perform over time as medical knowledge evolves.
3. How do models calibrate their confidence under verbose prompts, and what implications does this have for interpretability in safety-critical applications?
4. To enrich the manuscript, more related works could be reviewed such as Large Language Models in Genomics—A Perspective on Personalized Medicine.
5. It remains unclear how instruction complexity (e.g., single-step vs. multi-step prompts) correlates with semantic accuracy across LLMs.
6. Could the authors clarify whether all reported accuracy metrics are computed under equivalent conditions (e.g., prompt type, instruction format, sampling temperature)?
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.