Comparative analysis of multilingual transformer models for Urdu sentiment analysis
Abstract
Over 169 million people worldwide, concentrated mainly in Pakistan and India, speak Urdu, contributing to a rapidly growing content stream across social media and digital platforms. The widespread adoption of the internet has fueled a rapid surge in user-generated Urdu content, including user opinions and reviews from various online platforms. Despite the vast volume, language resources and tools for processing and analyzing Urdu text remain scarce, as most natural language processing (NLP) tools and models are primarily developed for English and other widely spoken languages. This research aims to address this gap by systematically evaluating the performance of multilingual transformer models for sentiment analysis of Urdu text, drawing on the Urdu Corpus for Sentiment Analysis (UCSA). This study compares the performance of mBERT, DistilBERT-multilingual-cased, and XLM-RoBERTa-base using metrics such as test accuracy, F1 score, precision, and recall. The findings indicate that the XLM-RoBERTa-base model outperforms its counterparts, achieving the highest test accuracy of 0.795, an F1-score of 0.795, a precision of 0.799, and a recall of 0.795. These results suggest that XLM-RoBERTa-base is a highly effective model for sentiment analysis of Urdu text, offering a robust solution for sentiment classification in the Urdu language.