All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Based on the reviewers’ feedback, I am pleased to inform you that your manuscript has been accepted for publication. Both reviewers confirmed that the revision was detailed, careful, and that all concerns were fully addressed. Congratulations on this outcome, and thank you for your thorough efforts in revising the paper.
[# PeerJ Staff Note - this decision was reviewed and approved by Shawn Gomez, a PeerJ Section Editor covering this Section #]
The authors have provided a detailed and careful revision. All of my concerns have been fully addressed.
The authors have provided a detailed and careful revision. All of my concerns have been fully addressed.
The authors have provided a detailed and careful revision. All of my concerns have been fully addressed.
no comment
no comment
no comment
Thank you for your revision. The reviewers agree that the manuscript has improved substantially. However, before acceptance, some minor revisions are required. Specifically, please (1) strengthen the comparison with recent state-of-the-art methods on the MELD dataset, as suggested by Reviewer 1, and (2) revise and merge Tables 1 and 2 in line with Reviewer 3’s comments.
We look forward to receiving your revised manuscript.
**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.
The general comments raised in the previous version have been largely addressed. However, the primary concern regarding comparison with state-of-the-art (SOTA) techniques remains inadequately addressed.
Although the authors have now considered a more naturalistic dataset, such as MELD, the manuscript lacks a proper comparison with existing published methods for speech emotion recognition (SER) on the MELD dataset. The baseline models selected—while varied and including deep learning approaches—are not substantiated by recent peer-reviewed publications involving SER on MELD.
Specifically, Table 6 does not include any comparisons with published models from the literature that utilize the MELD dataset. For instance, the work published in Information (2024) reports an accuracy of 94% using audio modality alone on MELD (https://www.mdpi.com/2078-2489/16/7/518). Ideally, comparison with SOTA methods should be presented in a tabular format, where each row cites a peer-reviewed publication on SER using MELD, along with the corresponding accuracy metrics. The final row(s) should present the results of the proposed method under the same evaluation protocol.
Moreover, the recent survey published in Artificial Intelligence Review (2025) offers a comprehensive overview of SOTA methods for SER on MELD and can be used to identify relevant techniques specifically using the speech modality (https://link.springer.com/article/10.1007/s10462-025-11197-8). Additionally, the preprint available at https://arxiv.org/abs/2312.15185 presents another competitive state-of-the-art result that should be considered for inclusion in the comparison.
In summary, for the paper to convincingly position itself within the current state of research, it is essential to include a more robust and up-to-date comparison with published state-of-the-art methods on the MELD dataset, especially those focusing on the speech modality.
-
-
The authors have provided a detailed and careful revision. All of my concerns have been fully addressed.
-
-
The author has effectively solved most of the concerns, but there are still some issues that need to be addressed.
1) Tables 1 and 2 should be combined together, including sections such as Model, Features, Advantages, and Disadvantages. Sections such as Databases and Results should be removed.
2) Table 2 seems to have exceeded the page limit and needs to be modified
-
-
**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.
**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors agree that they are relevant and useful.
**Language Note:** PeerJ staff have identified that the English language needs to be improved. When you prepare your next revision, please either (i) have a colleague who is proficient in English and familiar with the subject matter review your manuscript, or (ii) contact a professional editing service to review your manuscript. PeerJ can provide language editing services - you can contact us at [email protected] for pricing (be sure to provide your manuscript number and title). – PeerJ Staff
-
Experimental design is fairly detailed, and the source code has also been shared to replicate the experiments.
The RAVDESS Emotional Speech Audio dataset is a scripted dataset. In my opinion, while it is good for training baseline models (since it is balanced and has high-quality audio), it does not mimic real-life scenarios. Models trained first on acted datasets like RAVDESS must be evaluated or fine-tuned on natural datasets (like IEMOCAP, MELD) for better real-world applicability.
Another utility of RAVDESS is that it is multi-model; however, the current paper does emotion recognition using only the speech data.
Further, the paper has no comparison with existing state-of-the-art methods for SER.
I suggest that the authors resubmit the manuscript afresh after evaluating the proposed model on natural datasets (like IEMOCAP, MELD) and comparing the results with existing state-of-the-art methods for SER.
The manuscript in its existing form does not provide sufficient novelty for advancing the research in SER.
Other general comments:
The motivation for using “including Hubert, LSTM, and ResNet” needs to be strengthened. That is what was missing in the existing literature that motivated the authors to explore the proposed framework needs to be highlighted.
Additional space between bulleted points of contributions can be reduced.
Present the relevant literature review in a tabular form, highlighting methods (ML/DL models), dataset, features, and results.
Figure 3 can be prepared in vector graphics format so that the image does not blow up when zoomed in.
Authors have used a closing quote as both an opening and closing quote. They need to use ` for the opening quote in LaTeX.
The core idea of combining these powerful models from different domains to address the SER task is innovative. However, the manuscript suffers from several severe and fundamental flaws that critically undermine the validity and credibility of the findings. The most significant issues are grossly inconsistent reporting of experimental results (accuracy) and a major contradiction regarding the core dataset used for the experiments. Furthermore, there are errors in tables and references.
The description of the model's overall architecture (Fig. 1) is reasonably clear. However, the details of feature extraction are ambiguous. The abstract and Section 3.2 mention the use of both MFCCs and spectrograms. But in Section 2 (Methodology), the input to both the ResNet-50 and HuBERT-LSTM modules is described as a Log-Mel Spectrogram. The authors fail to explain how the 40-dimensional MFCC features are used in conjunction with the 300x300 spectrogram images. This requires a more detailed explanation. Due to the dataset confusion and the broken repository link, this study is currently not reproducible.
For the literature review, I suggest the authors read the following recent studies:
“CENN: Capsule-enhanced neural network with innovative metrics for robust speech emotion recognition”
“Sparse temporal aware capsule network for robust speech emotion recognition.”
These works introduce Capsule-based and sparse temporal attention mechanisms tailored for SER, with an emphasis on robustness in low-resource and imbalanced settings.
**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful.
The current conclusions are invalid because they are based on contradictory and unverifiable results. The conclusion restates the 78% accuracy figure, which conflicts with other data in the paper. Any claims about the model's effectiveness are unsupported until all data discrepancies are resolved.
The inclusion of an ablation study is a positive feature. However, the discussion of the results could be more precise. For example, the authors should re-check the reported Precision values in Table 3, where the "w/o ResNet-50" and "w/o Transformer and ResNet-50" configurations yield the exact same score (0.7538). While possible, this is unusual and warrants verification.
This is the manuscript's most fatal flaw. The authors describe using two entirely different datasets in different sections, making the experimental foundation unreliable.
In Section 3 (Implementation and Experiments) and 3.1 (Dataset), the paper explicitly states that the RAVDESS Emotional Speech Audio Dataset was used. The experimental results and analysis appear to be based on this dataset, with detailed descriptions of its composition (24 actors, 1,440 files).
However, in Section 4.1.2 (Dataset Acquisition) and Section 8 (Third-Party Data), the authors state that the nEMO dataset, accessible via Huggingface, was utilized for model training and evaluation.
These are two completely different datasets. The authors must clarify which dataset was actually used and ensure that all descriptions, links, and references are consistent. This error calls the entire study's credibility into question.
The paper presents a novel approach for Speech Emotion Recognition (SER) using a Hybrid - Module - Transformer model. However, several aspects require significant improvement.
1)Intro & Background:
The introduction part simply lists previous works and lacks an in-depth analysis of the mentioned papers. For example, while it mentions various models like RNNs, HMMs, and DNNs, the authors should list a more detailed comparison (i.e., a table) of their drawbacks and advantages in the context of SER.
Additionally, the discussion on recent advancements (Transformers) in SER should be expanded to show the novelty of the proposed model more clearly. Authors should analyze the advantages of the proposed method over other transformer-based SER methods. You may see examples here:
[1] Chen, Shuaiqi, et al. "Dwformer: Dynamic window transformer for speech emotion recognition." ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
[2] Wang, Yong, et al. "Time-frequency transformer: A novel time frequency joint learning method for speech emotion recognition." International Conference on Neural Information Processing. Singapore: Springer Nature Singapore, 2023.
[3] Wang, Yong, et al. "Speech swin-transformer: Exploring a hierarchical transformer with shifted windows for speech emotion recognition." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
[4] Chen, Weidong, et al. "Dst: Deformable speech transformer for emotion recognition." ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
2)Experimental Design:
The paper claims high accuracy rates (78% and 95%), but the evaluation metrics and experimental setup are not well-defined.
3)Validity of the findings:
A stricter evaluation with cross-validation and comparison with a wider range of baselines is necessary.
Besides, one dataset (RAVDESS) is used for evaluation, which can not prove its effectiveness. Authors should add at least one more dataset for evaluation. IEMOCAP or MSP-IMPROV, which has more samples, is recommended.
-
-
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.