A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features
Author and article information
Abstract
Part-of-speech (POS) tagging is the process of assigning tags or labels to each word of a text based on the grammatical category. It provides the ability to understand the grammatical structure of a text and plays an important role in many natural language processing tasks like syntax understanding, semantic analysis, text processing, information retrieval, machine translation, and named entity recognition. The POS tagging involves sequential nature, context dependency, and labeling of each word. Therefore it is a sequence labeling task. The challenges faced in Urdu text processing including resource scarcity, morphological richness, free word order, absence of capitalization, agglutinative nature, spelling variations, and multipurpose usage of words raise the demand for the development of machine learning automatic POS tagging systems for Urdu. Therefore, a conditional random field (CRF) based supervised POS classifier has been developed for 33 different Urdu POS categories using the language-independent features of Urdu text for the Urdu news dataset MM-POST containing 119,276 tokens of seven different domains including Entertainment, Finance, General, Health, Politics, Science and Sports. An analysis of the proposed approach is presented, proving it superior to other Urdu POS tagging research for using a simpler strategy by employing fewer word-level features as context windows together with the word length. The effective utilization of these features for the POS tagging of Urdu text resulted in the state-of-the-art performance of the CRF model, achieving an overall classification accuracy of 96.1%.
Cite this as
2024. A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features. PeerJ Computer Science 10:e2577 https://doi.org/10.7717/peerj-cs.2577Main article text
Introduction
Part-of-speech (POS) tagging is the process of assigning tags or labels to each word of a sentence based on its grammatical category. These labels can be used for associating each word to its corresponding grammatical category (i.e., POS) in a given text (Warjri et al., 2021). POS tagging is an essential and often a prerequisite step in many natural language processing (NLP) applications including text analysis, syntax understanding, semantic analysis, information retrieval, machine translation and named entity recognition.
Due to its pivotal role in many NLP tasks, much attention has been given in the recent past to achieving accurate and efficient POS tagging in many different languages of the world. POS tagging has been widely covered and much advancement achieved for most Western languages. However, for resource-scarce languages like Urdu little work has been done particularly in the application of contemporary machine learning (ML) and deep learning approaches for Urdu POS tagging.
Urdu is the national language of Pakistan and is widely spoken across the world. POS tagging is a sequence labeling task as it involves a sequential nature, context dependency, and labeling of each word within the text. Many different approaches including supervised learning, unsupervised learning, hybrid of rule-based and machine learning, and deep learning based approaches have been followed for Urdu POS tagging. Supervised Learning techniques require labeled data for training. Among the traditional Supervised techniques, HMM, maximum entropy models (MaxEnt), and support vector machine (SVM) are popularly used for Urdu POS tagging. In this work, an Urdu POS classifier also called tagger is developed for classification and prediction of POS tags of Urdu news text using the Mushtaq & Muzammil POS Tagged (MM-POST) dataset (Ali & Khan, 2024a) available at Ali & Khan (2024b). The simple language-independent and smaller feature set has been selected for the training of the CRF model to learn the pattern and predict the Urdu POS tags from the real Urdu text. The CRF is a simple probabilistic graphical model popular for segmentation and labelling tasks due to its less computation and low requirement of extensive feature engineering.
Challenges of Urdu POS tagging
The generally faced challenges that make the POS tagging of the low-resource Urdu language a difficult task include resource scarcity, morphological richness, no capitalization, free word order, agglutinative nature, spelling variations, and words serving multiple grammatical functions (Malik & Sarwar, 2015; Shah et al., 2016). Due to the complex morphological and syntactic structure of Urdu, POS tagging requires careful handling due to the ambiguity in word categorization. The scarcity of large-quality annotated corpora, need for use of sophisticated features set and the requirement of huge processing power for running heavy computations in learning and predicting the POS tags make Urdu POS tagging a challenging task. Conditional random fields (CRF), a sophisticated supervised machine learning algorithm, is widely used for POS tagging in Western languages and has also been successfully applied to Urdu POS tagging. A CRF-based POS classifier has been developed for 33 Urdu POS categories, utilizing language-independent features and trained on the MM-POST Urdu news dataset, demonstrating its effectiveness in handling Urdu’s linguistic complexities.
Motivation for study
POS tag provides linguistic information about how a word can be used in a phrase, sentence or document. It helps in identifying grammatical context of the text that is highly effective in text prediction and generation. POS tagging is an essential part of many state-of-the-art NLP applications like text analysis, syntax understanding, semantic analysis, information retrieval, machine translation, text to speech systems, question answering, sentiment analysis and named entity recognition.
In the past, various rule-based approaches have resulted in encouraging performance for Urdu POS tagging. However, rule-based systems are difficult to develop and are less portable to other domains. Notable performance has been reported by various researchers using ML and deep learning approaches for Urdu POS tagging but limited application of various ML and deep learning techniques is found for Urdu POS tagging due to the existing challenges. Therefore, exploring the area of Urdu POS tagging can lead to feasible, efficient and automatic solutions.
The tagset of 12 Urdu POS categories having 32 subcategories and POS tagging of 100,000 words of Urdu Digest Corpus made through the Tree Tagger using decision tree and the Center for Language Engineering (CLE) tagset, achieving an accuracy of 96.8% (Ahmad et al., 2014). In our work, these main categories and subcategories of the CLE Tagset have been used for annotation of the training dataset and the effectiveness of these POS categories is utilized for learning and prediction of labels through ML based Urdu POS tagging system.
Different lexical word level, lexical character level, ngram and word-embeddings have been used as features in the literature for Urdu POS tagging. Adeeba, Hussain & Akram (2016) inferred that lexical features are more effective than structural features, and that the size of training data affects the accuracy as larger data improves the performance accuracy. In our previous work, an Urdu POS tagged dataset, the Mushtaq & Muzammil POS Tagged (MM-POST) dataset, comprising of 119,276 words or tokens has been developed (Ali & Khan, 2024a). The annotated data and a good feature set are the key requirements for any machine learning classification system (Khan et al., 2019a). Therefore, an effective simple, smaller, language independent word level features-set has been selected for devising a ML classification system to automatically identify and predict POS tags of Urdu news text using the MM-POST dataset.
Motivation for choice of method used
The rule-based, machine learning based, and deep learning-based approaches have been used by the researchers for POS tagging of the post-positional and morphologically rich Urdu language. Modern machine learning and deep learning techniques are useful due to their portability across domains. The supervised ML approaches require large, labelled data for training models and automatically inducing rules in a shorter time than rule-based and deep learning approaches. The use of the CRF as one of the more advanced Supervised ML algorithms is widely found for POS tagging of western languages. Few researchers have also effectively demonstrated the use of CRF technique for POS tagging of Urdu text. Utilization of the correlativity inside two POS tags and capturing of sequential dependencies in a less computational time make CRF a suitable choice for POS tagging.
The CRF has been preferred over other ML techniques and has been reported by the research community of Urdu POS tagging to achieve state-of-the-art performance. Therefore, CRF technique has been selected for implementation of our Urdu POS tagging system. CRF becomes the better choice in comparison to Neural Networks, Transformers or other deep learning based models for tasks like Urdu POS tagging that involves training data scarcity, availability of less computational resources, sequence labelling and handling rich morphology among other challenges. However, if these challenges of Urdu language are handled, the neural networks, Transformers and other deep learning models can offer many potential advantages like automatic feature extraction, contextualized embeddings, handling long range dependencies, providing transfer learning and better generalization etc.
The lexical features have been reported as more effective than structural features. The lexical word level features are easy to determine and does not require linguistic knowledge or heavy computation. In the literature, language dependent, language independent and mixed types of features sets have been used. The structural and language-dependent features are more complex, many in number and large sized. They are difficult to determine and require huge computational resources. In contrast, in our work, simple, fewer and easy to find word level language-independent features have been used. The feature set includes context word window features along with the Word Length feature for each token for training and testing of a supervised machine learning model with a moderate sized Mushtaq & Muzammil POS Tagged (MM-POST) dataset (Ali & Khan, 2024a). The 33 POS categories of the CLE tagset have been used as labels.
The selected smaller features-set of five language-independent features has been used to train and test the CRF model for Urdu POS tagging using the MM-POST annotated dataset. The features set comprising of one immediately preceding lexical token (i.e., previous lexical token) and two successive next lexical tokens (i.e., next and second next tokens) of the current word or token have been utilized to serve as the context window together with the Word Length of the current word or token have been used for learning and prediction of the POS tag for every current word or token of the dataset.
Our work is different from other researchers in terms of the number and complexity of features used to train the machine learning models. We used simple and fewer features (five in number) for the training and testing of the model. These features are language independent i.e., the features selection and understanding do not require linguistic knowledge. It provides a wider scope for the use of the selected features set for experimentation with many different tasks and techniques. Similarly, less number of features are used to learn the pattern of the Urdu POS tags and ensure their prediction in an optimal computation time. This allows the expansion of the application of the pursued approach for much larger datasets in the future. The Urdu news dataset, the MM-POST, has been used for training and testing of the model because the news text is formal and is rich in occurrences of different POS tags as compared to other genres. The informal text has inconsistent syntax, less accuracy of words and much noisy data. In contrast, the news text has consistency in words structure, less noisy data and richness of POS occurrence that make the news text a preferred choice in achieving better model performance in POS tagging of scarce resourced Urdu language text. To effectively tackle POS tagging for informal text, targeted experiments must adapt techniques from formal text to suit the distinct characteristics of informal language. This involves refining preprocessing methods and feature extraction strategies to improve the accuracy and robustness of models applied to informal datasets.
The 33 different grammatical categories or POS tags selected from the POS tagset of the Center for Language Engineering (CLE), in the creation and labeling of the MM-POST dataset, have been used as POS labels for the model’s training and prediction. Instead of relying on the requirements of large data, computational resources and sophisticated techniques that are difficult to interpret, we have built an efficient Urdu POS classifier that can effectively predict the POS label of Urdu text for the training dataset as well as the unseen validation data. Our approach benefits from the use of simple language independent features like context word window and word length utilizing medium sized dataset making the proposed approach extendable to other resource-scarce languages.
The article is organized as follows. ‘Related Work’ provides a survey of the related work regarding Urdu POS tagging. ‘Research Methodology’ discusses the research methodology adopted for this work and the evaluation of results is presented in ‘Evaluation of Results’. Finally, the ‘Conclusion and Future Work’ section concludes the article and provides directions for future work.
Research methodology
The research methodology for Urdu POS tagging through a supervised CRF-based learning approach involves the use of a dataset for training, testing, and evaluation of the model, the selection of a features-set, and experimentation for evaluation of the proposed approach as explained below.
Urdu POS tagging through CRF
The conditional random fields (CRF) is a probabilistic graphical model suitable for segmentation and sequence labeling tasks. The CRF is characterized by its simplicity for its less computation time and low requirement for extensive featuring engineering, thereby minimizing the workload of human experts (Khan et al., 2019a). The CRF is an advanced supervised ML algorithm that can capture sequential dependencies among data points and is used for Urdu POS tagging as one of the more advanced techniques.
When utilizing CRF for POS tagging, the tokens are represented as an observation sequence:
X=(x1,x2,…,xn) and labeled as tag sequence Y=(y1,y2,....,yn), CRF model aims to identify the label y that maximizes the Conditional Probability of Y given X, for the sequence X and is mathematically expressed (Khan et al., 2019b) as shown in Eq. (1).
(1)
P(Y|X)=1Z(X)exp(n∑i=1m∑j=1λjfj(yi,yi−1,X,i))
P(Y|X) is the conditional probability of the output sequence Y given the input sequence X.
Z(X) is the normalization factor or partition function.
λj represents the parameters or weights associated with feature functions fj.
fj are feature functions capturing dependencies between neighboring variables in the sequence.
The application of the CRF model has been demonstrated for the Urdu part-of-speech tagging in the research community.
In this research work, the CRF model has been trained and tested on the MM-POST (Mushtaq & Muzammil POS Tagged) dataset, using the word level language-independent features of the current word/token as context window together with the Word Length of the current word.
The dataset
The MM-POST dataset has been used for training and testing of the CRF model for Urdu POS tagging. The dataset contains POS-labeled data from seven different news domains of the Urdu language including Entertainment, Finance, General, Health, Politics, Science, and Sports with 119,276 total tokens for 2,871 sentences (Ali & Khan, 2024a) as shown in Table 4. The number and percentage shares of POS tags of different news domains in the MM-POST dataset are graphically shown in Fig. 1.
Domain | Sentences | Tokens |
---|---|---|
Entertainment | 459 | 19,792 |
Finance | 351 | 13,377 |
General | 389 | 15,035 |
Health | 430 | 16,084 |
Politics | 579 | 27,409 |
Science | 388 | 16,727 |
Sports | 275 | 10,852 |
Total | 2,871 | 119,276 |
Figure 1: Domain-wise distribution of POS tags in the MM-POST dataset.
The tokenization has been already done in our previously developed dataset, the MM-POST dataset and the tokenized lexical words with their corresponding POS tags are readily available for use. The tokenization of well-structured news data resulted into well edited, consistent and useful tokens to be used for training and testing of machine learning models for any of the sophisticated Urdu NLP tasks. Our proposed CRF model for Urdu POS tagging was trained and tested using the tokenized data of the MM-POST. The necessary preprocessing for normalization of the dataset has been already done by removing extra spaces and unnecessary characters from individual words and manually correcting the inconsistent or incorrect words in the dataset. Thus, relieving the need for separate tokenization and pre-processing of the data. The dataset has been considered in its original position by determining the contextual window and word length of every lexical word in the corpus. The word level contextual window comprises of immediately preceding/previous token of every current lexical word or token, immediate first successive token and immediate second successive token of every current lexical word or token of the dataset.
For tagging of different grammatical categories in the MM-POST dataset, 33 POS tags of the CLE tagset as provided in Table 2 have been used. The number of available POS tags in the CLE tagset are originally 35 but the MM-POST dataset has occurrences for 33 POS tags among them. The CLE Tagset contains 12 main linguistic categories and 35 subcategories. These 35 subcategories form the set of POS tags. However, in the MM-POST dataset, two of the categories including common particle (POS Tag: PRT) and common symbol (POS Tag: SYM) has no single occurrence in the news articles of the MM-POST dataset. The POS tag-wise frequency distribution of the MM-POST dataset is given in Table 5.
POS tag | Freq: | POS tag | Freq: | POS tag | Freq: |
---|---|---|---|---|---|
NN | 32,678 | AUXA | 2,325 | OD | 488 |
PSP | 21,975 | VBI | 2,041 | VALA | 470 |
VBF | 10,200 | CD | 1,930 | SCK | 367 |
NNP | 8,622 | Q | 1,560 | PRS | 356 |
PU | 7,052 | RB | 1,080 | PRD | 111 |
JJ | 6,122 | NEG | 1,071 | PRF | 85 |
PRP | 4,804 | PRR | 857 | INJ | 44 |
AUXT | 4,370 | SCP | 635 | FR | 42 |
CC | 3,042 | APNA | 629 | PRE | 38 |
SC | 2,620 | AUXP | 551 | QM | 20 |
PDM | 2,534 | AUXM | 544 | FF | 13 |
Total | 119,276 |
The actual instances of text for Urdu POS tags occurring in the MM-POST dataset are given as illustrative examples in Fig. 2.
Figure 2: Illustrative examples of POS tags in the MM-POST dataset.
Features
The input features used for training and testing of the model are described as follows:
Feature Description
-
1.
Word: the current word/token
-
2.
PrevWord: previous word of the current word/token
-
3.
NextWord: next word of the current word/token
-
4.
Next2Word: second next word of the current word/token
-
5.
WordLength: length of the current word/token
The “Word” i.e., the current word or token and the context words window of the current “Word” including “Previous Word”, “Next Word” and “Second Next2 Word” are used together with the “Word Length” of the current “Word” as input features for learning the pattern and predicting the POS tag of every current word/token of the corpus.
The “Word” feature is the key feature that contains all the tokens including both words and other than words, for which the model has to learn the pattern and predict its corresponding POS tag. The “PrevWord” i.e., Previous Word, is the word/token just preceding the current word/token. The “NextWord” i.e., Next Word is the feature that contains the words/tokens just after the current word and the “Next2Word” i.e., the second next word of the current word/token contains the 2nd next word/token of the current word. Similarly, the “WordLength” (i.e., Word Length), is the feature containing the length of words/tokens in terms of the number of characters of each current word/token.
The use of “Word”, “Previous Word”, “Next Word” and “Next2 Word” provides an effective utilization of these features as the context-window for every current word/token, in the prediction of POS tag for every current “Word” or token. One previous adjacent token (i.e., the previous lexical word or token) and two next adjacent tokens (i.e., next and second next lexical word or token) of the current word/token are considered to serve as the context window together with the Word Length of the current word/token in characterizing the target POS tag for every current word/token. The format of the training file is shown in Fig. 3.
Figure 3: CRF POS tagging training-file example.
The names of the input features i.e., “Word”, “PrevWord”, “NextWord”, and “Next2Word” contain the term “word” to keep them descriptive about the data they contain i.e., the lexical words or tokens, although other tokens like numbers, punctuation, and special characters also exist. For example, the “Word” feature indicates the current word for which the POS tag is to be predicted but the individual values of this feature may also include other tokens like numbers and punctuation in addition to the Urdu textual words.
Different experimentation was made for inclusion and exclusion of various words both in proceeding and succeeding position in the context window. Increasing or decreasing the size of context window and selecting other words for the presently selected features did not improve results and even resulted in degraded performance.
Experiment
The CRF model with Python-CRFsuite has been used for learning the pattern of the input features in predicting the POS tags in the MM-POST Urdu corpus. Python-CRFsuite provides an interface to the CRFsuite which is a CRF library implemented in C++. It allows the use of CRF model functionality within the Python scripts. CRF is well suited for sequence labeling tasks like POS tagging that involve sequential nature, context dependency, and labeling of each word. The input variables used are “Word”, “PrevWord”, “NextWord”, “Next2Word” and “WordLength” for learning and prediction of the labels of the target variable “POS” i.e., the Part of Speech tag.
The MM-POST dataset was split into 80% Training portion and 20% Testing. The part of the dataset considered for the training portion incorporated 95,420 tokens and the testing portion contained 23,855 tokens. The data of the dataset is already in a tokenized format of words/tokens, therefore there is no need for tokenization of text. The dataset was imported from a Microsoft Excel file. Each column of the file represents a feature and each row of the file describes a record to be input to the model. The five columns on the left side are input features whereas the rightmost sixth column is the target variable of the training file. However, in the testing phase, the target variable is not included in the input of the model and the model performs prediction of labels based on the learning achieved during its training.
The CRF model has been used with “lbfgs” as an optimization algorithm. The maximum number of iterations for the algorithm to reach the optimized result is kept at 100 with the inclusion of all possible label transitions. The optimization algorithm with its default 100 iterations for the CRF model was chosen because of empirical testing with convergence speed and memory efficiency. The model showed a balance between accuracy and Overfitting for the “lbfgs” value of 100. For higher values, the unnecessary computational overhead was observed. The model is trained over the training data to learn the patterns and relationships from input features within the data for predictions of 33 POS tags as labels that exist in the “POS” target variable.
Evaluation of results
Performance metrics
The performance of the CRF model has been measured using the evaluation metrics including precision, recall, F1-measure, and accuracy. The Table 6 shows the values for these evaluation metrics in addition to support values for all the target labels i.e., the Urdu POS tags. Hence the dataset is split into 80% training and 20% testing portions, the Support value for every POS tag is the number of samples of every POS tag or class-label existing in the testing portion. For every class-label of the dataset, 80% of the samples have been selected to become part of the training set and 20% of the testing set. The Support value for every label is the count of 20% of the total number of instances of a particular class-label that has been included in the testing data. For example, the total frequency of proper noun (NNP) is 8,622 in the dataset. The 6,900 NNP tokens have been included in the training data whereas 1,720 tokens have been made part of the testing data. Thus, the support value for the class-label NNP is 1,722. The overall Support for the dataset having 119,276 tokens is 23,855.
POS | Precision | Recall | F1-score | Support |
---|---|---|---|---|
APNA | 1 | 0.99 | 1 | 120 |
AUXA | 0.95 | 0.95 | 0.95 | 459 |
AUXM | 0.99 | 0.94 | 0.96 | 113 |
AUXP | 0.93 | 0.94 | 0.93 | 95 |
AUXT | 0.98 | 0.97 | 0.98 | 868 |
CC | 0.98 | 0.99 | 0.99 | 643 |
CD | 0.97 | 0.97 | 0.97 | 359 |
FF | 1 | 0 | 0 | 7 |
FR | 0.88 | 0.7 | 0.78 | 10 |
INJ | 1 | 0 | 0 | 8 |
JJ | 0.97 | 0.89 | 0.93 | 1,241 |
NEG | 1 | 0.99 | 0.99 | 226 |
NN | 0.92 | 0.98 | 0.95 | 6,550 |
NNP | 0.94 | 0.86 | 0.9 | 1,722 |
OD | 1 | 0.89 | 0.94 | 105 |
PDM | 0.98 | 0.98 | 0.98 | 485 |
PRD | 1 | 0.76 | 0.86 | 21 |
PRE | 1 | 0.71 | 0.83 | 7 |
PRF | 1 | 1 | 1 | 23 |
PRP | 0.99 | 0.97 | 0.98 | 984 |
PRR | 0.97 | 0.98 | 0.97 | 174 |
PRS | 1 | 0.98 | 0.99 | 81 |
PSP | 0.99 | 1 | 0.99 | 4,320 |
PU | 1 | 1 | 1 | 1,444 |
Q | 1 | 0.98 | 0.99 | 298 |
QM | 1 | 0.33 | 0.5 | 3 |
RB | 0.96 | 0.91 | 0.93 | 211 |
SC | 1 | 0.99 | 0.99 | 504 |
SCK | 0.9 | 0.93 | 0.92 | 71 |
SCP | 0.97 | 0.85 | 0.91 | 123 |
VALA | 0.99 | 1 | 1 | 109 |
VBF | 0.94 | 0.93 | 0.94 | 2,073 |
VBI | 0.98 | 0.92 | 0.95 | 398 |
Accuracy | 0.96 | 23,855 | ||
Macro avg | 0.98 | 0.86 | 0.88 | 23,855 |
Weighted avg | 0.96 | 0.96 | 0.96 | 23,855 |
A brief description and formulas of the evaluation metrics are given as below:
Precision
Precision measures how correctly the model tags the words. It is helpful particularly in understanding the assignment of POS tags to frequent words or when incorrect tagging has been resulted for larger instances of words. High precision means reducing false positive predictions of the model.
Recall
Recall ensures that the model captures most of the instances of words of a particular POS class/tag. It is the ratio of all correctly predicted/tagged words to all actual tags of words. High recall attempts lowering the number of false negatives, and it reflects the model ability to correctly predict most instances of a class.
F1-score
F1-score combines precision and recall, offering a unified metric for performance. F1 is the key metric when both the false positives and false negatives are important, or POS tags are unevenly distributed. High F1-score indicates that the model is accurately predicting tags for the words and identifying all instances of a POS class/tag.
Accuracy
Accuracy reflects the model’s overall ability to correctly tag or assign labels to words across all POS categories. It is a good metric for knowing the percentage correct prediction of a model but only in balanced datasets because for unbalanced data, the accuracy can be less informative.
(5)
Accuracy=NumberofCorrectPredictionsTotalNumberofPredictions
Results analysis
The bottom rows of Table 6 show the values for accuracy, macro average, and weighted average. Accuracy means the overall aggregated number of correct predictions of POS tags per total number of predictions of POS tags. The overall accuracy achieved for the CRF model is reported as 96.1%.
The macro average metrics are used to evaluate the model performance across all classes treating them equally and are calculated by taking the simple average of results for all the classes, giving equal weight to the result of each class regardless of its size in the actual data. The macro average values for precision, recall, and F1-score of our CRF model are 0.98, 0.86, and 0.88 respectively. The weighted average metrics are used to know the model performance based on influencing the result on class distribution i.e., giving more importance to larger classes. The disparity between the high macro-precision (0.98) and lower recall (0.86) indicates that the model has been successful in reducing the False Positives by most of the times (98%) correctly predicting the tag for a word and in very few cases it misclassified them to incorrect tag. The recall of 0.86 means that the model misses to avoid false negatives in some cases i.e., for certain POS class, the model fails to recognize the correct tag for the words. The lower recall can be improved by enabling the model training over sufficiently large instances of the rare class labels or POS tags. Thus, the pattern for missed classified instances of the present dataset shall be properly learned by the model and performance shall be further improved.
The weighted result is achieved for every class by its support value and the sum of the weighted values. The weighted average value of our CRF model for accuracy, precision, recall, and F1-measure each, is 0.96. It means that after giving effect to the larger classes to influence the model in predictions, the performance metrics improve substantially which can be seen through the enhancement of the F1-score from 0.88 (macro average) to 0.96 (weighted average). Higher values reported for weighted average than macro average for the performance metrics of our model highlight the need to have a sufficiently larger number of occurrences for all the label classes and enhancing the number of observations or samples for the low-frequency classes shall improve the effectiveness of the model.
Out of the total 33 POS tags, 26 (i.e., 79% of labels) have an F1-score of more than 0.90 which is encouraging regarding the efficiency of the model. The four POS tags have an F1-score of more than 0.78, one POS tag (i.e., QM) has an F1-score of 0.50, and two (i.e., FF and INJ) have zero F1-score. The zero F1-score for the two POS tags FF and INJ is due to the lower Support values of only seven and eight respectively. The two POS Tags “FF” and “INJ” have rare frequency of only seven and eight respectively in the dataset. The “FF” tag has been confused six times with “NN” and once with “PRP” as is shown in the Confusion Matrix of Fig. 4. The “INJ” tag has been confused four times with “NN”, two times with “NNP”, once with “CD” and once with “PU”. Thus, the rare occurrence frequencies of both the POS Tags “FF” and “INJ” caused lack of required contextual understanding for the model in their tagging. Similarly, the QM POS tag has a support value of only three.
Figure 4: Confusion matrix—CRF Urdu POS tagging.
Most of the POS tags have been correctly predicted to their corresponding true tags as demonstrated in the Confusion Matrix of Fig. 4. The Confusion Matrix reflects that the “NN” tag has been confused the most with “NNP” & “VBF”. The “NNP” has been wrongly predicted the most as “NN” and in a few cases as “JJ”. The “VBF” has been confused the most with “NN” & “AUXA”, the “PSP” is confused in a few cases with “NN” & “VBF” and the label “JJ” is confused the most with “NN”, “CD” & “VBF”.
The number of True predictions both for positive and negative cases are higher for most of the labels as is evident from the figures for true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for the POS tags of the dataset in Table 7.
POS | TP | FP | TN | FN | Support |
---|---|---|---|---|---|
APNA | 119 | – | 23,735 | 1 | 120 |
AUXA | 434 | 22 | 23,374 | 25 | 459 |
AUXM | 106 | 1 | 23,741 | 7 | 113 |
AUXP | 89 | 7 | 23,753 | 6 | 95 |
AUXT | 842 | 14 | 22,973 | 26 | 868 |
CC | 639 | 13 | 23,199 | 4 | 643 |
CD | 347 | 12 | 23,484 | 12 | 359 |
FF | – | – | 23,848 | 7 | 7 |
FR | 7 | 1 | 23,844 | 3 | 10 |
INJ | – | – | 23,847 | 8 | 8 |
JJ | 1,104 | 29 | 22,585 | 137 | 1,241 |
NEG | 223 | – | 23,629 | 3 | 226 |
NN | 6,420 | 524 | 16,781 | 130 | 6,550 |
NNP | 1,484 | 93 | 22,040 | 238 | 1,722 |
OD | 93 | – | 23,750 | 12 | 105 |
PDM | 474 | 12 | 23,358 | 11 | 485 |
PRD | 16 | – | 23,834 | 5 | 21 |
PRE | 5 | – | 23,848 | 2 | 7 |
PRF | 23 | – | 23,832 | – | 23 |
PRP | 954 | 8 | 22,863 | 30 | 984 |
PRR | 170 | 5 | 23,676 | 4 | 174 |
PRS | 79 | – | 23,774 | 2 | 81 |
PSP | 4,299 | 29 | 19,506 | 21 | 4,320 |
PU | 1,441 | 2 | 22,409 | 3 | 1,444 |
Q | 293 | – | 23,557 | 5 | 298 |
QM | 1 | – | 23,852 | 2 | 3 |
RB | 191 | 7 | 23,637 | 20 | 211 |
SC | 497 | 2 | 23,349 | 7 | 504 |
SCK | 66 | 7 | 23,777 | 5 | 71 |
SCP | 105 | 3 | 23,729 | 18 | 123 |
VALA | 109 | 1 | 23,745 | – | 109 |
VBF | 1,935 | 126 | 21,656 | 138 | 2,073 |
VBI | 365 | 7 | 23,450 | 33 | 398 |
Generalization on external data
To evaluate the generalization capability of our trained model on external validation data, an unseen news article of 1,161 words or tokens that is not part of the MM-POST dataset, was used. The data was converted into the trained model’s format and the saved model was loaded. The POS tagging label-prediction was performed through the model. Analysis of the results revealed that out of 1,161 words or tokens, the model correctly predicted 1,134 words or tokens (97.7%) whereas 27 tokens (2.3%) were labelled incorrectly. The overall accuracy achieved for the external validation data of 97.7% is highly encouraging. The average accuracy resulted for an individual POS tag becomes 83% that is less than the average accuracy achieved by the model for training dataset (i.e., 88%). This is due to far less number of instances of individual labels in the external validation data than the training dataset, in addition to the model failure in correct prediction of unseen distinctive instances.
Interestingly, the incorrect predictions resulted for only 10 out of 33 POS tags or labels whereas correct predictions for the remaining 22 POS tags were resulted. Six among the POS tags had incorrect predictions for 23 times and the remaining four have one incorrect prediction each. The label-wise accuracy achieved by the trained model for external validation data is given in Table 8.
POS tag | Correct prediction | Incorrect prediction | Total | Accuracy |
---|---|---|---|---|
NN | 325 | 5 | 330 | 98.5 |
PSP | 202 | 202 | 100 | |
VBF | 104 | 2 | 106 | 98.1 |
PU | 62 | 62 | 100 | |
PRP | 61 | 4 | 65 | 93.8 |
JJ | 57 | 7 | 64 | 89.1 |
AUXT | 49 | 1 | 50 | 98.0 |
PDM | 36 | 1 | 37 | 97.3 |
NNP | 33 | 2 | 35 | 94.3 |
CC | 30 | 30 | 100 | |
SC | 26 | 1 | 27 | 96.3 |
VBI | 26 | 3 | 29 | 89.7 |
AUXA | 18 | 18 | 100 | |
NEG | 16 | 16 | 100 | |
PRR | 14 | 14 | 100 | |
Q | 11 | 11 | 100 | |
AUXP | 10 | 10 | 100 | |
CD | 10 | 10 | 100 | |
RB | 9 | 9 | 100 | |
SCK | 9 | 9 | 100 | |
PRF | 5 | 5 | 100 | |
OD | 5 | 1 | 6 | 83.3 |
VALA | 5 | 5 | 100 | |
PRD | 4 | 4 | 100 | |
PRS | 3 | 3 | 100 | |
APNA | 2 | 2 | 100 | |
SCP | 1 | 1 | 100 | |
AUXM | 1 | 1 | 100 | |
Total | 1,134 | 27 | 1,161 | 97.7 |
Thus the outstanding performance of the trained CRF model in POS tagging of training dataset as well unseen external validation data, demonstrates the model ability of generalization to the external out-of-sample data and proves scale-able to unseen data.
Implementation of proposed CRF model using Urdu universal dependency treebank
The implementation and evaluation of our proposed model were performed using POS tagged data from Urdu Universal Dependency Treebank (UDTB). The UDTB was developed at IIIT Hyderabad India by automatic conversion from Urdu Dependency Treebank (Bhat et al., 2017). The data containing 14,604 Urdu words tagged with 17 POS tags downloaded from Bhat & Zeman (2024) was used for training and testing of our proposed CRF based supervised POS classifier. For easy comparison, compound Urdu words were broken into single words and few POS tags of the UDTB data were renamed to the CLE Tagset used for tagging of the MM-POST dataset.
The model achieved an accuracy of 89.6% using UDTB dataset in comparison to the accuracy of 96.1% for MM-POST dataset. The results show that using the UDTB dataset having eight times less number of POS tagged tokens than the MM-POST dataset, the performance of the model degraded only to approximately six percent. This demonstrates the scalability and generalizability of our CRF-based model for Urdu POS tagging on a smaller dataset, generated from different genres and annotated with a tagset different from the CLE tagset that we primarily used. The evaluation metrics given in Table 9 show the POS tag-wise results of the CRF model achieved using POS tagged Urdu UDTB data.
POS | Precision | Recall | F1-score | Support |
---|---|---|---|---|
AUXT | 0.33 | 0.50 | 0.40 | 2 |
CC | 0.98 | 0.99 | 0.98 | 136 |
JJ | 0.81 | 0.79 | 0.80 | 261 |
NEG | 1.00 | 1.00 | 1.00 | 10 |
NN | 0.84 | 0.93 | 0.88 | 785 |
NNP | 0.82 | 0.73 | 0.78 | 278 |
PDM | 0.95 | 0.80 | 0.87 | 50 |
PRP | 0.92 | 0.86 | 0.89 | 102 |
PSP | 0.98 | 0.98 | 0.98 | 586 |
Q | 0.94 | 0.82 | 0.88 | 102 |
RB | 1.00 | 0.23 | 0.38 | 13 |
RP | 1.00 | 0.84 | 0.91 | 37 |
SYM | 1.00 | 1.00 | 1.00 | 111 |
VAUX | 0.91 | 0.90 | 0.91 | 186 |
VBF | 1.00 | 0.50 | 0.67 | 4 |
VM | 0.91 | 0.88 | 0.90 | 249 |
Punct | 1.00 | 1.00 | 1.00 | 12 |
Accuracy | 0.90 | 2,924 | ||
Macro avg | 0.91 | 0.81 | 0.84 | 2,924 |
Weighted avg | 0.9 | 0.9 | 0.89 | 2,924 |
SVM implementation and comparison with CRF-based Urdu POS tagging
The SVM model has been widely used in the research community for Urdu POS tagging. The implementation of our proposed approach using SVM model was made with our selected features set (Word, PrevWord, NextWord, Next2Word, WL) and the MM-POST dataset. The SVM model achieved an accuracy of only 68% which is far less than the accuracy of 96% achieved by the CRF. The results given in Table 10 show that only three out of 33 POS tags (i.e., AUXT, PSP and PU) have F1-score as 0.80 or above. For all the others the SVM has failed in correct tagging. The analysis reveals that SVM is unable to properly learn and classify most of the POS tags except few because it could not successfully model the contextual window information of the lexical word and their corresponding POS tags. Thus, proving our CRF based model performing better and suitable in the sequence labelling task of Urdu POS tagging.
POS tag | Precision | Recall | F1-score | Support |
---|---|---|---|---|
APNA | 0.47 | 0.49 | 0.48 | 122 |
AUXA | 0.66 | 0.71 | 0.69 | 444 |
AUXM | 0.69 | 0.73 | 0.71 | 105 |
AUXP | 0.75 | 0.82 | 0.78 | 104 |
AUXT | 0.76 | 0.85 | 0.80 | 860 |
CC | 0.59 | 0.66 | 0.62 | 655 |
CD | 0.49 | 0.57 | 0.53 | 389 |
FF | 1.00 | 0.50 | 0.67 | 2 |
FR | 0.45 | 0.56 | 0.50 | 9 |
INJ | 0.33 | 0.27 | 0.30 | 11 |
JJ | 0.42 | 0.34 | 0.38 | 1,229 |
NEG | 0.50 | 0.59 | 0.54 | 215 |
NN | 0.67 | 0.72 | 0.69 | 6,621 |
NNP | 0.72 | 0.66 | 0.69 | 1,722 |
OD | 0.39 | 0.46 | 0.42 | 93 |
PDM | 0.45 | 0.40 | 0.42 | 498 |
PRD | 0.24 | 0.31 | 0.27 | 16 |
PRE | 0.00 | 0.00 | 0.00 | 4 |
PRF | 0.56 | 0.50 | 0.53 | 18 |
PRP | 0.55 | 0.46 | 0.50 | 972 |
PRR | 0.53 | 0.57 | 0.55 | 147 |
PRS | 0.27 | 0.22 | 0.24 | 78 |
PSP | 0.78 | 0.84 | 0.81 | 4,361 |
PU | 0.99 | 0.99 | 0.99 | 1,450 |
Q | 0.43 | 0.34 | 0.38 | 307 |
QM | 0.00 | 0.00 | 0.00 | 5 |
RB | 0.40 | 0.29 | 0.34 | 218 |
SC | 0.71 | 0.75 | 0.73 | 515 |
SCK | 0.58 | 0.53 | 0.56 | 75 |
SCP | 0.46 | 0.32 | 0.38 | 128 |
VALA | 0.43 | 0.45 | 0.44 | 88 |
VBF | 0.66 | 0.53 | 0.59 | 2,019 |
VBI | 0.62 | 0.53 | 0.57 | 375 |
Accuracy | 0.68 | 23,855 | ||
Macro avg | 0.53 | 0.51 | 0.52 | 23,855 |
Weighted avg | 0.67 | 0.68 | 0.68 | 23,855 |
Comparison with benchmark approaches
Our approach for CRF model implementation achieved an accuracy of 96.1% which is higher than the CRF accuracy of 88.74% by Khan et al. (2019b), the 83.52% on CLE dataset by Khan et al. (2019a), the 88.4% on Bushra Jawaid dataset by Khan et al. (2019a) and the accuracy of 95.8% on Bushra Jawaid dataset by Nasim, Abidi & Haider (2020). However, the performance achieved by Nasim, Abidi & Haider (2020) using the CRF together with BiLSTM (i.e., 96.3%) is subtly higher than our approach by 0.2%.
The benchmark approaches used large sized feature sets making them complex to understand and computationally less efficient as detailed in the following:
Khan et al. (2019b) used language-dependent (i.e., POS tag of the previous word and suffix of the current word) and language-independent features (i.e., context words window). They used ten unigram templates for feature set generation. Their features set included “Previous Lexical Word”, “Current Lexical Word”, “Next Lexical Word”, “Current Lexical Word + Previous Lexical Word”, “Current Lexical Word + Next Lexical Word”, “Current Lexical Word + N-1 and N-2 Previous Words”, “Current Lexical Word + N+1 and N+2 Next Words”, “Part of Speech tag of Previous Lexical Word”, “Suffix of Current Lexical Word” and “Length of Current Lexical Word”.
Khan et al. (2019a) used the context word features including (1) the token (the current word) (2) the word to the left of the current word (3) the word to the right of the current word (4) Joint use of the current word and the word to the left of the current word (5) Joint use of the current word and the word to the right of the current word (6) Joint use of the current word and N-1, N-2 left words of the current word, and (7) joint use of the current word and N+1, N+2 right words of the current word.
Nasim, Abidi & Haider (2020) utilized the features including Word, Length, Is_First, Is_Last, Suffix, Prev_Word_1, Prev_Word_2 and Next_Word of the current word.
Thus the use of simple and fewer language-independent features of Urdu text combined with the efficient performance make our CRF-based Urdu POS tagging approach surpassing the previous benchmark CRF approaches. Figure 5 provides a comparison of performance between our work and other researchers’ approaches for Urdu POS tagging. The performance of our Urdu POS tagging approach is better than seven among nine researchers whereas two of them have slightly better accuracy.
Figure 5: Comparison of our CRF POS tagging approach with other approaches.
In contrast to previous works, our Urdu POS tagging approach benefits from the small-sized feature set comprising only of five features; four among them are lexical word-based and one is the word length. These few word-based features in addition to the word length are simple to determine and are language-independent. Thus enabling our Urdu POS tagging approach to have the potential for scaling, generalization, and adaptability.
Conclusion and future work
A CRF based automatic POS classifier for Urdu news text using the MM-POST dataset was discussed. The model achieved state-of-the-art performance by attaining an overall accuracy of 96.1% and macro average values for Precision, Recall, and F1-score as 98%, 86%, and 88% respectively using the training dataset. The trained model proved excellent ability of generalization by achieving even higher performance than on training dataset. The overall accuracy of 97.7% was reported for prediction of POS tags on external validation data that is not part of the training dataset. However, the average prediction accuracy of an individual POS tag on external validation data remained 83% in comparison to 86% on training data. It can be improved further by increasing the size of annotated data in the dataset for training of the model to learn further distinctive instances and variations of Urdu POS.
The input features “Word”, “Previous Word”, “Next Word” and “Second Next Word” of current word/token, used as context words window served well in addition to the “Word Length” feature of the current word/token in the classification and prediction of the Urdu POS tags. The utilization of lexical words as context window of current words helped in the effective learning and prediction of Urdu POS tags.
The CRF model has been proved efficient in multi-label or multi-class classification and predictions of Urdu POS tags for the dataset having 33 number of POS tags. The model achieved excellent performance for most of the POS tags, particularly for those having a sufficient number of occurrences in the dataset.
However, the performance can be further improved and the Weighted and Macro Average values of the F1-score can be enhanced from 0.96 and 0.88 respectively, if the size of the POS-tagged corpus is increased to incorporate a sufficiently large number of instances particularly for the less frequent POS tags; for example, “FF”, “INJ” and “QM” having the support values of only seven, eight and three respectively. Thus the CRF model will be able to effectively predict the POS tags with high accuracy through the use of selected features set.
Our approach for Urdu POS tagging has the potential for expansion to other Indo-Aryan languages particularly which are agglutinative and free word order languages like Hindi, Punjabi, Pashto and Arabic. Experimentation can be done with the easy to determine and computationally efficient features set in other languages using CRF, other machine learning or deep learning models and modern ensemble or transformer-based models. The ease and effectiveness in selection and processing of the features set provides the opportunity of customization and introduction of further sophistication for all types of natural languages. Our work opens up avenues of research for application of proposed approach for other Urdu NLP tasks including named entity recognition, sentiment analysis, machine translation and text to speech systems, and so on.
In our future work, the POS tags generated through our presented approach, shall be employed as one of the features set for various prediction-tasks like named entity recognition of Urdu text using ML and deep learning techniques.
Supplemental Information
Additional Information and Declarations
Competing Interests
The authors declare that they have no competing interests.
Author Contributions
Mushtaq Ali conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.
Muzammil Khan conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.
Yasser Alharbi conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.
Data Availability
The following information was supplied regarding data availability:
The raw data and code are available in the Supplemental Files.
The Mushtaq and Muzammil Part of Speech Tagged (MM-POST) dataset is available at GitHub and Zenodo:
- https://github.com/Mushtaq-Ali/MM-POST-dataset.
- Mushtaq Ali. (2024). Mushtaq-Ali/MM-POST-dataset: MM-POST dataset v1.0.0 (POSTagging). Zenodo. https://doi.org/10.5281/zenodo.14165184.
The POS tagged data from Urdu Universal Dependency Treebank (UDTB) is available at GitHub: https://github.com/UniversalDependencies/UD_Urdu-UDTB/blob/master/README.md.
Funding
The authors received no funding for this work.