Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

Review History
Music audio emotion regression using the fusion of convolutional neural networks and bidirectional long short-term memory models

All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.

Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.

View examples of open peer review.

Summary

The initial submission of this article was received on January 17th, 2025 and was peer-reviewed by 2 reviewers and the Academic Editor.
The Academic Editor made their initial decision on May 14th, 2025.
The first revision was submitted on June 6th, 2025 and was reviewed by 2 reviewers and the Academic Editor.
The article was Accepted by the Academic Editor on July 7th, 2025.

Version 0.2 (accepted)

Davide Chicco · Jul 7, 2025 · Academic Editor

Accept

The reviewers appreciated the changes made by the authors to the manuscript and therefore I can recommend this article for acceptance.

[# PeerJ Staff Note - this decision was reviewed and approved by Shawn Gomez, a PeerJ Section Editor covering this Section #]

Reviewer 1 · Jun 19, 2025

Basic reporting

The concerns I raised regarding the attention mechanism and its placement in the model were thoroughly addressed. The updated Figure 1 now clearly shows the attention layer, and the corresponding section in the methodology provides a detailed explanation of how it works and how it contributes to refining feature representations before the final prediction layer. This significantly improves the interpretability of the model structure.

In addition, the authors provided an extended explanation of Equation (1) and clarified how their BiLSTM implementation differs from the standard LSTM. I found this clarification helpful, especially in the context of how forward and backward hidden states are used together.

Regarding Figure 3, I appreciate that the authors didn’t just keep it for formality but took the time to redraw and contextualize it properly. The revised diagram makes the bidirectional structure explicit, which was previously ambiguous. These updates collectively improve the clarity and completeness of the manuscript.

Experimental design

I had previously pointed out the lack of discussion around dataset bias and training strategies. The revised version now explicitly acknowledges potential cultural and demographic biases in the DEAM and PMEmo datasets. I find this addition important, as it reflects an awareness of generalization challenges—especially relevant for music emotion recognition tasks across diverse populations.

Furthermore, the authors have now provided a clear description of the training setup, including how hyperparameters were selected and how data was preprocessed and augmented. These details make the study much more reproducible and trustworthy.

Validity of the findings

The performance results were already impressive, but my concern was the lack of statistical context. The authors have now added comparative statistics showing significant improvements over previous models in terms of R² and MAE, and they quantify these differences clearly.

More importantly, the newly added ablation study is a strong contribution. It isolates the effects of each module (CNN, BiLSTM) and shows how their removal affects performance. This validates the architectural choices made in the proposed model and gives readers a deeper understanding of component-wise contributions.

I also appreciate the honest discussion in the conclusion about the model’s limited applicability to non-Western music and the commitment to explore this in future work. It’s important to see limitations acknowledged, and I think this addition strengthens the paper.

Additional comments

The authors have outlined plans to explore real-world applications of their model, including potential user studies, which is great to see. They’ve also improved the readability of figures, and I can confirm that the visuals (particularly Figures 1 and 3) are now clearer and more informative.

Finally, the decision to release the code and configurations via GitHub is a welcome move. It aligns well with open research practices and will be helpful to others working in this space.

Cite this review as

Anonymous Reviewer (2025) Peer Review #1 of "Music audio emotion regression using the fusion of convolutional neural networks and bidirectional long short-term memory models (v0.2)". PeerJ Computer Science

Medha Basu · Jun 10, 2025

Basic reporting

The manuscript is well-written with easy understanding, and aligns with the Aims and Scopes of the Journal. All points mentioned in the reviews have been addressed properly in detail. The title has also been modified and is now in sync with the work.

Experimental design

-

Validity of the findings

-

Cite this review as

Basu M (2025) Peer Review #2 of "Music audio emotion regression using the fusion of convolutional neural networks and bidirectional long short-term memory models (v0.2)". PeerJ Computer Science

Download Version 0.2 (PDF) Download author's response letter - submitted Jun 6, 2025

Version 0.1 (original submission)

PeerJ Staff · May 14, 2025 · Academic Editor

Major Revisions

**PeerJ Staff Note:** Please ensure that all review, editorial, and staff comments are addressed in a response letter and that any edits or clarifications mentioned in the letter are also inserted into the revised manuscript where appropriate.

Reviewer 1 · Feb 14, 2025

Basic reporting

This paper introduces an innovative CNN-BiLSTM model for music emotion regression.

The manuscript is well-structured and adheres to academic standards, with clear language and relevant citations. However, the followings needs to be updated.

- In figure1, it would be better to add the proposed attention mechanism. Especially, how to apply attention mechanism to the proposed method is not well described.

- Additional descriptions of Eq(1) considering the difference between LSTM and the proposed LSTM is required.

- Is there any difference in Figure 3 comparing to basic LSTM structure? If so, please describe it. If not, I recommend to delete Figure 3.

Experimental design

The proposed model and methodology are well-defined, incorporating CNN, BiLSTM, and multi-head attention. While the study utilizes well-known datasets (DEAM, PMEmo), it lacks a discussion on potential dataset biases and their impact on generalization.

Furthermore, more details on hyperparameter selection and model training strategies would strengthen the reproducibility of the study.

Validity of the findings

The results indicate that the proposed CNN-BiLSTM model outperforms existing methods, with an impressive R² of 0.845. However, the statistical significance of these improvements is not thoroughly discussed, and additional ablation studies would help verify the contribution of each component. Moreover, the study should address potential limitations, such as its applicability to non-Western music genres.

Additional comments

This study presents a valuable contribution to the field of music emotion recognition and its application in psychological therapy. Nevertheless, a deeper exploration of real-world applications, such as case studies or user studies, would enhance its practical relevance. Finally, improving the clarity of some figures and providing supplementary materials, such as code and hyperparameter configurations, would be beneficial for future researchers.

Cite this review as

Anonymous Reviewer (2025) Peer Review #1 of "Music audio emotion regression using the fusion of convolutional neural networks and bidirectional long short-term memory models (v0.1)". PeerJ Computer Science

Medha Basu · Feb 15, 2025

Basic reporting

The present paper is quite well-written with professional and easy to follow English.
Check line 52 for the part ‘relaxation.be computationally’ and make necessary corrections.
In line 82, abbreviations to be defined beforehand for the term PTSD.

The work proposes a novel methodology combining CNN architecture with Bidirectional Long Short-Term Memory (BiLSTM) networks to analyze and predict the emotional impact of musical audio. For this the authors make use of two commonly used database of over 9000 music clips.
The authors have shown with the help of different widely used parameters such as RMSE, R2, MAPE that the model outperforms others while evaluating correctly the arousal and valence dimension in the annotated music emotion database.
Although the paper may be technically strong, though neural network architectures are not essentially my area of expertise, the title of the paper is quite misleading. The entire paper does not shed any light on how the accurate identification of emotion would be helping in psychology therapy.
The authors would benefit on elaborating how they propose this innovative regression model would help in psychology therapy in general. Otherwise drop that portion from the title and pitch the paper as a document of new technological advancement with a novel deep learning architecture, which I feel is the strong area of the paper.

Experimental design

The article content is well within the Aims and Scope of the journal. The two neural network architectures, CNN and BiLSTM have been combined here to form an overlap model, which has performed with higher classification accuracy in MER than the two architectures individually. All technical points related to the applied architecture have been rigorously reported here, like loss curves, accuracy etc., and detailed description of the tools applied have been provided with respective codes and algorithms. The novelty of the paper lies in the description and applicability of this overlap model, and the relevant technical details have been elaborately addressed and explained.

Validity of the findings

Emotion Classification with neural network has become a very popular research domain in recent times and numerous works have been reported so far. In that matter, it becomes hard to add any innovative approach to this field. But, the novelty of this work lies in the fact that here an CNN-BiLSTM overlap model has been proposed and applied, which successfully shows higher classification accuracy with the large dataset of 9000 clips.
But, as the title suggests, the findings do not discuss or explore any points within the domain of psychology therapy. My main area of interest for the work, was the application of the proposed neural network overlap model of CNN-BiLSTM for assessing therapeutic effects of music, which has not been addressed here. All the points covered in the ‘Conclusion’ section discuss the technicality of the model and not its applicability to music intervention therapy. This deviates from the focal idea with which the work started, and the title also becomes somewhat misleading in that case. Leveraging MER in the discipline of scientific music therapy is indeed a potent area of exploration and would have added ample novelty to the paper apart from being technically sound one.

Additional comments

Authors should either consider changing the title according to the points reported in the paper, or they should consider incorporating valid studies and discussions regarding the application of the CNN-BiLSTM overlap model for psychology therapy, which also is quite a vast field of research and comes with its own challenges and complications. How well the proposed model fits into this domain requires rigorous studies. Either such points should be reported in the paper, and if the authors decide to focus only on the technical details and applied algorithms, then the title should be accordingly changed.

Cite this review as

Basu M (2025) Peer Review #2 of "Music audio emotion regression using the fusion of convolutional neural networks and bidirectional long short-term memory models (v0.1)". PeerJ Computer Science

Download Original Submission (PDF) - submitted Jan 17, 2025

All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Review History Music audio emotion regression using the fusion of convolutional neural networks and bidirectional long short-term memory models

Summary

Version 0.2 (accepted)

Davide Chicco · Jul 7, 2025 · Academic Editor

Reviewer 1 · Jun 19, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Medha Basu · Jun 10, 2025

Basic reporting

Experimental design

Validity of the findings

Version 0.1 (original submission)

PeerJ Staff · May 14, 2025 · Academic Editor

Reviewer 1 · Feb 14, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Medha Basu · Feb 15, 2025

Basic reporting

Experimental design

Validity of the findings

Additional comments

Review History
Music audio emotion regression using the fusion of convolutional neural networks and bidirectional long short-term memory models