The shift towards auto-labeling in NLP: A literature review on LLM-based methods
Abstract
Manual data annotation is a challenging and labor-intensive task for subjective NLP applications like healthcare, cyberbullying, and sentiment analysis, where contextual understanding is required. Annotated data is essential for training supervised Machine Learning (ML) models. With the growing demand for high-quality labeled datasets, Large Language Models (LLMs) are increasingly used for automatic data annotation (auto-labeling). This review aims to explore the use of LLMs in auto-labeling Natural Language Processing (NLP) tasks, propose a taxonomy categorizing LLMs' roles in annotation pipelines, highlight pressing challenges, and provide practical insights and recommendations for future research. We analyzed 34 peer-reviewed articles published between 2019 and 2025, focusing on LLM applications in auto-labeling tasks such as sentiment analysis and stance detection. The review examines model types, training and prompting strategies, and compares LLM-based auto-labeling to traditional human labeling methods. We introduce a taxonomy that identifies four key LLM roles in auto-labeling: DirectLabeler, PseudoLabeler, AssistantLabeler, and DataGenerator. Across different fine-tuning and prompting techniques, findings show that LLMs generally approach and do not surpass human-level accuracy, while offering significant speed and cost advantages. Closed-source models dominate by frequency, yet open-source alternatives like BERT remain important. Zero-shot and few-shot prompting techniques are common due to their convenience, though fine-tuning yields better domain-specific results. Challenges include prompt sensitivity, limited domain generalization, and ethical considerations. The proposed taxonomy offers a structured framework to understand and develop LLM-based auto-labeling methods, with scope for future expansion as the field evolves. This highlighted the potential of LLMs to automate and enhance data labeling across NLP tasks. Based on the identified strengths and limitations, our recommendations emphasize domain adaptation, fine-tuning on diverse datasets, iterative prompt engineering, and privacy-conscious open-source usage. Our findings suggest that with continued advancements in research and careful deployment, LLMs are poised to improve annotation scalability and efficiency in NLP and beyond.