Cross-modal emotion recognition with causality inference in human conversations
Abstract
Emotion recognition plays an important role in a wide range of application domains. Although previous studies have made progress in this domain, they often fall short in achieving a better understanding of emotions and inferring their underlying causes. To address these limitations, we propose an emotion recognition framework that integrates visual, audio, and textual modalities within a unified architecture. The proposed framework integrates an adaptive cross-modal attention module to capture inter-modal interactions. This module dynamically adjusts the contribution of each modality based on contextual relevance, enhancing recognition accuracy. Additionally, an emotion causality inference module uses a fine-tuned, trainable LLaMA2-Chat (7B) model to jointly process image and text data. This offers precise identification of the causes behind detected emotions. Furthermore, a real-time emotion feedback module delivers instantaneous assessments of emotional states during conversations, supporting timely and context-aware interventions. The experimental results on four datasets, SEMAINE, AESI, ECF, and MER-2024, demonstrate that our method achieves improvements in F1 scores compared to baselines.