Emoji-aware multimodal representations for multi-label emotion analysis in Indonesian social media
Abstract
Text Emotion Analysis (TEA) has received growing attention, yet most studies rely solely on textual features and adopt single-label classification, oversimplifying the complexity of human emotions. Research on low-resource languages such as Indonesian remains limited, with few datasets incorporating paralinguistic cues like emojis—despite their strong affective value. To address these gaps, this study proposes an emoji-aware multimodal framework for multi-label emotion classification in Indonesian social media texts. A novel dataset was constructed using a semi-automated annotation pipeline that integrates distant supervision with expert refinement, combining textual and emoji signals. Three fusion strategies—early (concatenation and tokenizer expansion), intermediate (cross-attention), and late (average and weighted)—were implemented and compared against a text-only IndoBERT baseline. Experimental results show that all multimodal fusion strategies consistently outperform the baseline, confirming the value of emojis as paralinguistic features. The Tokenizer Expansion–based Early Fusion achieved the highest F1-Micro (0.8559), while the Cross-Attention–based Intermediate Fusion attained the best F1-Macro (0.7409), indicating strong overall and balanced performance, respectively. This study advances TEA in low-resource languages by (i) introducing a new Indonesian emoji-aware multi-label dataset, (ii) providing comparative insights into multimodal fusion strategies, and (iii) demonstrating the benefits of integrating emojis for affective text understanding.