Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

Review History
Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

The initial submission of this article was received on March 4th, 2025 and was peer-reviewed by 2 reviewers and the Academic Editor.
The Academic Editor made their initial decision on June 21st, 2025.
The first revision was submitted on July 23rd, 2025 and was reviewed by 1 reviewer and the Academic Editor.
The article was Accepted by the Academic Editor on September 8th, 2025.

Version 0.2 (accepted)

Aurora Saibene · Sep 8, 2025 · Academic Editor

Accept

I thank the Authors for having addressed the Reviewers’ concerns.

Reviewer 2 is satisfied by the implemented changes and your responses to Reviewer 1 and corresponding modifications have been assessed by me. All Reviewer 1’s concerns have been clearly addressed.

Considering the previously reported comments, the manuscript is now ready for publication.

The Editorial Staff will surely guide you in the revision of paper formatting, such as the separated acronym tables in the Appendix A, text reference style, and figure dimensions.

I leave a couple of formatting-related comments in what follows.

Thank you for submitting your paper to PeerJ in Computer Science and best wishes for your future research.

Comments

Lines 102-103 (highlighted manuscript) should be revised, having that the provided statement refers to a phrase that is not directly attached to the statement itself.

At line 236 (highlighted manuscript) there is a broken reference.

At line 653 (highlighted manuscript), you should report the extended version of the WA and UW acronyms.

Landry et al. (2020) performance value seems to be missing from Table 12.

Please, check the code links.

[# PeerJ Staff Note - this decision was reviewed and approved by Xiangjie Kong, a PeerJ Section Editor covering this Section #]

Barlian Henryranu Prasetio · Jul 28, 2025

Basic reporting

The revision successfully addresses the key reviewer concerns, particularly in language clarity, expanded references, and logical flow.

Experimental design

-

Validity of the findings

-

Cite this review as

Prasetio BH (2025) Peer Review #2 of "Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention (v0.2)". PeerJ Computer Science

Download Version 0.2 (PDF) Download author's response letter - submitted Jul 23, 2025

Version 0.1 (original submission)

PeerJ Staff · Jun 21, 2025 · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review and editorial comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

**Language Note:** The review process has identified that the English language must be improved. PeerJ can provide language editing services - please contact us at [email protected] for pricing (be sure to provide your manuscript number and title). Alternatively, you should make your own arrangements to improve the language quality and provide details in your response letter. – PeerJ Staff

Reviewer 1 · Apr 21, 2025

Basic reporting

The manuscript is clearly written and uses professional, technically accurate English throughout. The structure follows the standard academic format and includes well-labeled figures and tables that appropriately support the study. However, there are several areas that could be improved:

1. While the authors provide a general background, the related work section is overly reliant on classical CNN/LSTM models. The manuscript would benefit significantly from referencing more recent architectures in SER, such as capsule networks [https://doi.org/10.1016/j.knosys.2024.112499] or GAN-based data augmentation models [https://doi.org/10.1016/j.cmpbup.2024.100152]. In particular, recent work such as capsule-enhanced neural models and GAN-based balancing strategies would help position this work more clearly in the current state of the art.

**PeerJ Staff Note:** It is PeerJ policy that additional references suggested during the peer-review process should only be included if the authors are in agreement that they are relevant and useful.

2. Some of the formulas (e.g., Eq. (4)–(6)) lack a detailed explanation of dimensions or variable definitions. Additionally, diagrams such as Figures 2–3 could benefit from consistent module labeling and a clearer legend to improve visual clarity.

3. Report additional computational metrics such as the number of parameters or inference latency.

Experimental design

The research addresses a meaningful and well-defined problem within the scope of PeerJ Computer Science. The integration of parallel CNNs, a Transformer encoder, and co-attention fusion is clearly motivated, and the investigation is technically rigorous.

However, the study does not provide metrics on model size, complexity (e.g., FLOPs), or inference speed, which are important for assessing real-world deployability, especially in HCI applications.

Validity of the findings

The experimental results are generally well presented and align with the original research question. The conclusions are appropriate and supported by the data. However, the lack of a detailed comparison with more advanced recent models weakens the strength of the findings. For instance, results could be compared to capsule-based SER or adversarial data augmentation approaches, which are gaining traction in this field.

Cite this review as

Anonymous Reviewer (2025) Peer Review #1 of "Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention (v0.1)". PeerJ Computer Science

Barlian Henryranu Prasetio · Apr 23, 2025

Basic reporting

General Comments:
The manuscript is written in clear and professional English. However, some sections could benefit from language tightening to improve conciseness and clarity, especially in descriptions of the model architecture and feature extraction.
The article is well-structured with an appropriate introduction, methodology, and discussion. Yet, some figures (e.g., Figures 9 and 10) lack visual clarity and consistent referencing in the text.
Definitions of technical terms (e.g., WA, UW, co-attention) are scattered or implied. These should be explicitly defined where first introduced, ideally with a list of acronyms.
Although the study is self-contained, the introduction would benefit from more direct articulation of the research gap and how this work uniquely addresses it.

Inline Suggestions:
Line 13 (Abstract): Suggest explicitly defining WA and UW for clarity.
Line 105: Consider bulleting the contributions more succinctly or integrating them better into the narrative.
Figures 9 & 10: Improve figure resolution and ensure all figure references in the main text are consistent and explained.
Throughout the Results section: Avoid redundancy in reporting metrics—tables already convey much of the content.

Experimental design

General Comments:
The research question is meaningful and clearly situated within the speech emotion recognition (SER) field.
While the model design is described in detail, some experimental procedures—such as the implementation of early stopping, choice of batch size, and hyperparameter tuning—need more elaboration.
The inclusion of several augmentation techniques is appreciated, but there is limited justification or validation on whether these augmentations preserve emotional content.
Although the code is shared via GitHub, reproduction of results would benefit from clearer documentation, including dependencies, environment setup, and data usage instructions.

Inline Suggestions:
Line 538: Clarify how early stopping was implemented (e.g., validation loss plateau over how many epochs?).
Line 453 (Model Training): Specify versions of major libraries (e.g., PyTorch, TensorFlow) and environment (e.g., CUDA version).
Line 409: Consider adding an evaluation or visualization of class distribution before and after augmentation.
Line 425: Recommend validating whether augmented samples (e.g., pitch-shifted, AWGN) retain emotional clarity.

Validity of the findings

General Comments:
The results are promising and demonstrate consistent performance improvements with the proposed model.
However, there is no indication whether differences in performance (e.g., across models) are statistically significant. A statistical test (e.g., Wilcoxon signed-rank test) could strengthen the claims.
The authors briefly mention limitations (e.g., the challenge of recognizing neutral emotion) but do not explore them thoroughly.
The generalization across datasets is a strength, yet additional commentary on the linguistic or demographic diversity of these datasets would help contextualize robustness.

Inline Suggestions:
Table 5: Add statistical significance indicators between model performances, if available.
Line 661: Expand on why neutral emotions are difficult to classify—perhaps link to acoustic features or labeling inconsistency.
Line 687: Discuss generalization across different languages (English, Persian, etc.)—does the model require adaptation for each language?

Additional comments

The paper presents an innovative approach to SER using a co-attention-enhanced architecture and achieves encouraging results across multiple datasets. With some improvements in clarity, methodological transparency, and statistical validation, the paper will be a strong contribution to the field.

Cite this review as

Prasetio BH (2025) Peer Review #2 of "Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention (v0.1)". PeerJ Computer Science

Download Original Submission (PDF) - submitted Mar 4, 2025

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Review History Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention

Summary

Version 0.2 (accepted)

Aurora Saibene · Sep 8, 2025 · Academic Editor

Barlian Henryranu Prasetio · Jul 28, 2025

Basic reporting

Experimental design

Validity of the findings

Version 0.1 (original submission)

PeerJ Staff · Jun 21, 2025 · Academic Editor

Reviewer 1 · Apr 21, 2025

Basic reporting

Experimental design

Validity of the findings

Barlian Henryranu Prasetio · Apr 23, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Review History
Enhancing speech emotion recognition through parallel CNNs with transformer encoder and co-attention