All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Dear Authors,
The reviewers think that your paper is sufficiently improved and can be accepted. However minor edits provided by Reviewer 1 are to be provided prior to commencing the production process. Furthermore, Keywords in the Abstract section should be correctly listed according to the alphabetical order. Many of the equations are part of the related sentences. Attention is needed for correct sentence formation. Equations should be used with correct equation number. Please do not use “defined as”, “as follows”, etc. Explanation of the equations should also be checked. All variables should be written in italic as in the equations. Their definitions and boundaries should be defined before production step.
Best wishes
[# PeerJ Staff Note - this decision was reviewed and approved by Jyotismita Chaki, a PeerJ Section Editor covering this Section #]
The updated version adequately addresses concerns that I raised in the first review.
-
-
A couple of statements that could be amended:
The description of LoRA:
48 employs low-rank matrices (A and B) into the weight 48 matrices of the target layers for fine-tuning
and then below
180, Another common method involves fine-tuning only
181, a subset of the model’s parameters. LoRA achieves this by injecting low-rank matrices,
182, reducing the number of parameters that need to be updated.
In 48, "employs" probably was supposed to be "injected", like in 180-182. In both cases, however, it is unclear what actually happens (some model weights are frozen, while the update in others is represented as a product of two low-rank matrices).
Figure 1. Structural flow diagram of different efficient parameter fine-tuning methods
Parameter efficient?
309, the dataset spans six major medical departments
Does "departments" stand for "domains"?
512, It is also worth noting that Prompt-Tuning performs comparably to more structurally
513, complex methods such as LoRA
LoRA may actually be considered less structurally complex from the user's point of view, as it demands no modifications of either the input or the model.
Reporting is clear, and the abstract is much improved.
The experimental design is clear and described in full.
The findings are described in full, and the full prompt text is now included.
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
**Language Note:** The review process has identified that the English language must be improved. PeerJ can provide language editing services - please contact us at [email protected] for pricing (be sure to provide your manuscript number and title). Alternatively, you should make your own arrangements to improve the language quality and provide details in your response letter. – PeerJ Staff
The quality of basic reporting is acceptable. The language is clear and professional, the background is presented adequately, the paper is self-contained, uses high-quality figures and formulas, where appropriate, and presents the algorithm, which forms the core of the paper, both in plain English and in pseudo-code.
The paper proposes a new version of lightweight pretraining of large language models. The authors argue that their method improves the performance of models on downstream tasks while providing high training and inference efficiency. This is all perfectly valid, although the exact frasing of the motivation (‘Although [Prompt-Tuning] works well for many tasks, further improvements are needed to handle long inputs effectively and improve the model’s efficiency’) is questionable since in the end it is not shown how the proposed method helps with these particular issues: the conclusions refer exclusively to the fact that the new model has better performance.
The proposed approach makes intuitive sense: taking prompt tuning as the starting point, a single sequence of N tokens prepended to the input of a frozen decoder-only model (only their embeddings are trained) is replaced with a batch of such sequences, whose embeddings are averaged to produce the eventual continuous prefix, capturing task-specific information. This increases the number of trainable parameters: while in standard prompt tuning we are training an (N, d) matrix, where d is the embedding dimension of the target model, now we are training an (N, d, M) tensor, where M is the number of individual prompts to be averaged. An increased number of parameters naturally leads to improved performance, with M=2 outperforming M=1 (i.e., classic prompt tuning), M=2 outperforming M=4, and M=8 being superior to M=4.
It may be pointed out that another option to increase the number of trainable parameters would be to grow N while keeping M=1, and the authors demonstrate (albeit only on one experiment, where both N and M are varied) that this works slightly worse. Thus, one model has the same performance with N=8, M=8 as with N=64, M=1, but in most case,s a better performance is achieved by increasing M, which is, of course more computationally efficient since the parallelisation of the tunable prompt does not increase the sequence length with concomitant quadratic self-attention complexity.
Overall the benefits of the proposed method seem to be well supported by the data. However, the selected downstream tasks look a bit ad hoc: a rather specialised dataset (Chinese Medical Dialogue Data) is used as the main test set, and it is accompanied by four tasks from the GLUE benchmark. This testing benchmark is not state of the art (cf., e.g., the tests from the HELM project: https://crfm.stanford.edu/helm/instruct/latest), which weakens the authors' argument.
Regarding availability of the data, the GLUE benchmark is public. As for Chinese Medical Dialogue Data, this datasets seems to be public as well, but the paper doesn't seem to provide a reference.
Some of the statements about the proposed method's benefits could be toned down. Cf. "This highlights that CE-Prompt not only effectively leverages the strengths of Prompt-Tuning but also enhances the model’s generative capabilities through its design, maintaining adaptability and effectiveness across different Prompt-Length configurations" or "Through comprehensive experiments, we demonstrated that CE-Prompt significantly improves the stability and efficiency of pre-trained language models across various natural language processing tasks" --- I would say that such statements should be backed by more experiments, preferably on more recent benchmarks.
The reporting is clear throughout. The English in the abstract is actually less clear than the English in the manuscript itself. The abstract also opens with technical details of the initialization of the CE-prompt method rather than a general overview of the paper. This should be changed.
The approach the authors propose---to leverage multiple soft prompts---makes sense to me. This means the prompts potentially encode multiple parts of a given prompt tuning template. The experiments the authors conduct are appropriate for the question at hand. And the proposed method outperforms the basic prompt tuning benchmark on all of these.
The findings are obviously relevant for understanding computationally efficient ways of finetuning large language models. The underlying data are provided. I would have liked more clarity on the exact prompts they used for the tuning approach. The training data and full prompt (I think) are provided in the Appendix, but the labelling of these is not very clear.
The manuscript proposes CE-Prompt, a novel method for improving the stability and efficiency of prompt-tuning in large language models by introducing multiple embedding layers. The study aims to address limitations in existing initialization strategies, particularly in long-text and small-task scenarios.
The manuscript is written in clear, with appropriate use of technical terminology. The introduction and related work sections provide a comprehensive overview of the field, citing relevant and up-to-date literature. The structure adheres to academic standards, with logical organization and effective use of figures and tables. Mathematical formulations are clearly defined, and the inclusion of algorithmic pseudocode enhances reproducibility. However, the manuscript would benefit from minor grammatical refinements and a clearer statement regarding the availability of code and raw data to support full reproducibility.
Although the algorithmic formulation is mathematically sound, the manuscript lacks intuitive visualizations or concrete examples to help readers better understand the mechanism of CE-Prompt. This may limit accessibility for a broader audience.
The manuscript presents several issues related to clarity and formatting that may hinder reader comprehension. First, it introduces technical terms such as “composite embedding,” “multi-embedding initialization,” and “prompt fusion” (e.g., lines 12–20 and 163–183) without sufficient explanation or illustrative examples. These concepts are central to the proposed method but are not accompanied by intuitive visualizations or analogies, which may limit accessibility for readers unfamiliar with prompt engineering.
Second, the formatting and integration of figures and tables are inconsistent. For instance, Figure 1 is referenced in line 29 but is not clearly described or discussed in the surrounding text. Similarly, Figure 4 (line 298) is briefly mentioned, yet lacks a detailed caption or interpretation, reducing its explanatory value.
Third, the Related Work section (lines 71–124) contains redundant descriptions of methods such as Adapter-Tuning, LoRA, and Prefix-Tuning, which are also discussed in the Introduction (lines 25–47). This repetition could be streamlined to improve conciseness and focus.
Fourth, the mathematical formulation in the Methods section (lines 167–209) is dense and presented without visual aids. Equations (1) through (8) and Algorithm 1 are described in detail, but the lack of diagrams or flowcharts makes it difficult to follow the logic of the CE-Prompt mechanism, especially for readers less familiar with formal notation.
Fifth, the Results section (lines 298–344) includes long, uninterrupted paragraphs that describe multiple experiments and datasets without intermediate summaries or subheadings. This affects readability and makes it challenging to extract key findings efficiently.
Finally, while Tables 1–3 provide detailed numerical results, the manuscript lacks visual comparisons such as bar charts or line graphs that could help readers quickly interpret performance trends across different configurations.
Overall, the manuscript presents a timely and relevant contribution to the field of parameter-efficient fine-tuning. The proposed method is original and supported by comprehensive experiments. However, certain aspects require clarification and improvement to enhance the manuscript’s rigor and reproducibility.
The study presents original research within the journal’s scope, addressing a well-defined and relevant problem in prompt-tuning. The proposed CE-Prompt method introduces a novel multi-embedding initialization strategy to improve stability and inference efficiency. The experimental design is rigorous, employing two strong baseline models (LLaMA3-8B-Instruct and Qwen2.5-7B-Instruct) and evaluating performance across both domain-specific (medical) and general NLP tasks (GLUE benchmark). Evaluation metrics (BLEU-4, ROUGE-L) are appropriate and well-justified. The methodology is described in sufficient detail to allow replication, although public access to implementation code would further strengthen reproducibility.
Although the experimental results indicate that CE-Prompt outperforms traditional Prompt-Tuning, the manuscript does not include statistical significance testing (e.g., p-values, confidence intervals, or standard deviations). This omission weakens the reliability of the reported improvements and limits the reader’s ability to assess the robustness of the findings.
Besides, CE-Prompt introduces multiple embedding layers, which may increase computational overhead. However, the manuscript does not provide a quantitative analysis of training time, memory usage, or inference latency. Such analysis is essential to assess the practical feasibility of deploying the method in real-world applications.
The findings are supported by robust experimental results. The CE-Prompt method consistently outperforms traditional Prompt-Tuning across multiple tasks and models. The results are statistically sound and demonstrate the method’s effectiveness in enhancing both stability and performance, particularly in small-task and long-text scenarios. Conclusions are well-aligned with the research objectives and are appropriately limited to the scope of the presented results. The study encourages replication and provides a solid foundation for future work in prompt optimization.
The manuscript does not address potential failure cases or limitations of CE-Prompt, such as its performance on noisy data, low-resource languages, or tasks with limited structure. A critical discussion of these aspects would provide a more balanced perspective and guide future research directions.
Overall, the manuscript presents a timely and relevant contribution to the field of parameter-efficient fine-tuning. The proposed method is original and supported by comprehensive experiments. However, certain aspects require clarification and improvement to enhance the manuscript’s rigor and reproducibility.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.