All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Many thanks to the authors for their efforts to improve the work. The current version can be accepted. Congrats!
[# PeerJ Staff Note - this decision was reviewed and approved by Jyotismita Chaki, a PeerJ Section Editor covering this Section #]
This version looks good, thanks for addressing review comments.
This version looks good, thanks for addressing review comments.
This version looks good, thanks for addressing review comments.
This version looks good, thanks for addressing review comments.
-
-
-
Thanks for the revision, changes and answers are sufficient.
**PeerJ Staff Note:** Please ensure that all review and editorial comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
**Language Note:** The review process has identified that the English language must be improved. PeerJ can provide language editing services - please contact us at [email protected] for pricing (be sure to provide your manuscript number and title). Alternatively, you should make your own arrangements to improve the language quality and provide details in your response letter. – PeerJ Staff
The paper is written in clear, professional English. However, some typos don't flow well, e.g: Missing spaces between words and citations "problemChen et al. (2017)", "techniquesJu et al. (2023)". A proofreading could improve these.
Lines 29-34. The authors refer to "recent years" but cite works from as early as 2006 and 2017, which is inappropriate in 2025.
Lines 38-45. The authors group a large block of unrelated papers after a general statement about deep learning techniques without explaining the specific contributions of each work. Then they incorrectly introduce the SpeakerBeam model (Zmolikova et al., 2017) using "among them," although it was not part of the previously cited group.
Line 47. No explanation of what the TCN abbreviation means is given.
Lines 50–73 are poorly organized and mix background, proposed ideas, and experimental setup without clear structure. The description of TD-ConvNeXt and Spk block is confusing, with insufficient explanation of how they differ from existing models.
How Spk block differs from simply applying an SE module to ConvNeXt?
Contribution 1: "We incorporated a reference speech encoder" is vague. Idea (using auxiliary encoders) already exists in previous models like SpEx+. Why their design is different or better?
Contribution 4: "We use multi-task learning" is also vague. Multi-task learning in speaker extraction (speaker ID + mask estimation) is also already known. Why the authors sugested multi-task setup is novel?
Cleanly separating what is truly new and what is standard should be given.
Lines 107-108. The paper cites Luo and Mesgarani (2019) as the basis for their architecture design, but this key prior work was not discussed in the literature review.
The authors do not explicitly discuss the availability of data sets or trained models for replication.
In the experimental part, not all new components are isolated and tested.
The experimental section lacks a systematic ablation of all newly introduced components. While TD-ConvNeXt is partially tested, other key elements such as the Spk block, the multi-task learning setup, and the time-frequency fusion in the reference encoder, are not isolated and evaluated individually.
The results presented in the paper show consistent improvements over baseline models on objective measures such as SI-SDR, PESQ, and STOI. However, statistical significance tests are not reported.
Clear English and logical flow, but needs improvement in mathematical clarity and reproducibility.
Figures and tables are relevant, but more illustrations (e.g., spectrograms, architecture blocks) would help.
Dataset and loss design are suitable, though training details are underreported.
Lack of statistical significance analysis weakens empirical claims.
Results support conclusions, but failure case analysis and real-world generalizability are not addressed.
This is a nice article. The article goes beyond traditional signal-based losses (like SDR) by incorporating perceptual and spectral losses that align better with human hearing. By building upon the widely recognized Sepformer architecture and showing consistent improvements, the authors demonstrate both practical relevance and methodological rigor. This article addresses a specific yet impactful challenge in single-channel speaker extraction. Thank you so much for informative article.
Refer below for peer review comments to improve readability and reach.
Abstract:
1. The abstract is generally clear but lacks a precise quantification of the performance gain.
2. Can you emphasize key novelty clearly in the abstract and summarize results numerically.
Introduction:
1. The introduction presents the general SE problem but could more clearly define the unique challenges of single-channel speaker extraction and how human speech properties are relevant.
2. Add a stronger statement explicitly highlighting what prior works (ex. VoiceFilter, SpeakerBeam) lack—such as a focus on human auditory modeling.
3. Clarify whether the target use case is real-time speech systems, offline enhancement, or forensic applications. This will help position the contribution better.
Related Work:
1. The section describes prior models but lacks a summarized table comparing architecture, supervision method, speaker embedding, and domain focus.
2. The literature on perceptual losses (e.g., PESQNet, MOSNet) is not discussed. Can you Incorporate a paragraph on speech-aware objective functions.
Methodology:
1. The spectral and perceptual loss functions are not formally defined. Can you introduce mathematical expressions for each and clearly specify how they’re combined.
2. Can you clarify if the new loss functions were used during full training or only during fine-tuning. Also, describe how loss weights were determined (manual search, grid search?).
3. Can you include a block diagram of the modified Sepformer pipeline with labeled modules showing where the new loss components are applied.
Experimental Setup:
1. Provide more details about the LibriMix split: how many speakers per subset, speaker overlap, noise types used, and signal-to-noise ratios.
2. It’s unclear whether speaker embeddings were fixed or updated during training. Describe embedding source (e.g., ECAPA-TDNN), duration of reference utterance, and dimensionality.
3. Can you add training batch size, optimizer, learning rate, number of epochs, hardware (e.g., GPU model), and memory usage to enhance reproducibility.
Evaluation Metrics:
1. Can you report standard deviation across 3–5 runs and apply t-tests or confidence intervals to make claims statistically rigorous.
2. Can you include per-gender or per-age-group performance to test model robustness across diverse speaker characteristics.
Results:
1. Can you add a “relative improvement (%)” column and highlight statistically significant improvements in bold.
2. Can you include spectrogram plots showing model output vs. ground truth for both successful and failed cases to visually confirm speech quality retention.
Discussion and Insights:
1. Can you add an error case analysis showing where the model fails (e.g., low SNR, same-gender overlaps).
2. Why does perceptual loss improve over SDR alone? Can you elaborate on the connection between human auditory perception and loss function geometry.
3. Can you discuss if improved speech quality comes at the cost of increased training time or complexity.
Conclusion:
1. Overstates generalization - Claims about applicability could be moderated unless real-world tests (beyond LibriMix) are shown.
2. Include 2–3 directions for future work: multi-microphone extension, subjective evaluation, application to real-world audio (e.g., meeting transcription).
All comments have been added in detail to the last section.
All comments have been added in detail to the last section.
All comments have been added in detail to the last section.
Review Report for PeerJ Computer Science
(Addressing human speech characteristics in single-channel speaker extraction networks)
1. In this study, a TD-ConvNeXt-based model for single-channel speaker extraction is proposed by enhancing the ConvNeXt architecture with temporal and spectral features and integrating it with a TCN block, along with an auxiliary Spk block that learns speaker identity, and the multi-task learning framework used for training leads to improved performance compared to previous methods.
2. In the introduction, what Speaker speech extraction is, its development with deep learning, and the importance of the subject are sufficiently mentioned. In addition, at the end of this section, the main contributions of the study are clearly stated.
3. The related works section of the study definitely needs to be detailed. First of all, for the sake of paper integrity, it is suggested that related works should be included in the section after the introduction, not in the last sections. In this section, the studies in the literature related to speech extraction are generally mentioned. It is suggested to add a literature table consisting of many more columns such as "pros and cons, method, originality, result" at the end of this section, thus expressing the importance of this study and which deficiencies it eliminates in the literature and its originality in a more emphatic way.
4. It is observed that WHAM and LibriSpeech are used as datasets in the study. Although there are many different datasets similar to this in the literature for the solution of the problem within the scope of this study, it should be stated in more detail why these datasets were chosen.
5. In relation to the Experiment Environment, the types and values of the hyperparameters preferred in the study should be detailed and tabulated. It should be detailed whether different optimizers, learning rates, etc. are used and/or how these parameters are determined.
6. The speaker extraction network block diagram clearly expresses the study. When the details of the proposed TD-ConvNext network model are examined, it is observed that it has a certain level of originality, especially with the Speech Extractor and Speaker Encoder sections.
7. The results obtained with the proposed models and the comparison with the literature clearly demonstrate the quality of the study and show that this study has the potential to make a very important contribution to the literature.
In conclusion, this study can contribute to the literature for single-channel speaker extraction with the model it proposes, but all the sections mentioned above must be addressed completely.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.