Arabic speech emotion recognition (2015–2024): A systematic review of datasets, dialects, and classification methods
Abstract
Background. Arabic Speech Emotion Recognition (SER) is increasingly important for Human–Computer Interaction (HCI), including mental health monitoring, adaptive learning systems, and smart environments. Progress in this field is constrained by the linguistic diversity of Arabic and the limited availability of well-documented emotional speech datasets. These limitations hinder the development of generalizable SER models and restrict cross-study comparability.
Methodology. This review systematically examines Arabic SER research published between 2015 and 2024. A PRISMA-guided process was used to identify 83 eligible studies across major academic databases. We analyse 24 emotional speech datasets in terms of dialectal coverage, emotional categories, speaker demographics, and recording methodologies (acted, semi-natural, and natural). We also review classification approaches used in the field, including Classical Machine Learning, Deep Learning, and Transformer-based Self-Supervised Learning, and evaluate how dataset characteristics influence reported outcomes.
Results. The review reveals substantial variability in dataset design, annotation practices, and evaluation protocols. Most datasets are acted and dominated by a small set of emotions, with limited representation of spontaneous speech, nuanced affective states, and underrepresented dialects such as Levantine varieties. Speaker metadata is inconsistently reported, and many datasets are not publicly accessible, restricting reproducibility. Recent modelling trends show a transition from handcrafted-feature approaches to Deep Learning and Self-Supervised Learning, yet the lack of standardized benchmarks prevents meaningful comparison across studies.
Conclusions. Arabic SER research has advanced in methodological diversity and modelling capabilities, but structural limitations in dataset availability, dialect representation, and evaluation standards continue to impede progress. Developing dialect-inclusive, openly available emotional speech corpora with transparent metadata, balanced emotion coverage, and unified benchmarking protocols is essential for supporting robust, generalizable SER systems.