A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features

View article
PeerJ Computer Science

Main article text

 

Introduction

Challenges of Urdu POS tagging

Motivation for study

Motivation for choice of method used

Research methodology

Urdu POS tagging through CRF

where:

The dataset

Features

  • Feature      Description

  • 1.

    Word:  the current word/token

  • 2.

    PrevWord:  previous word of the current word/token

  • 3.

    NextWord:  next word of the current word/token

  • 4.

    Next2Word:  second next word of the current word/token

  • 5.

    WordLength:  length of the current word/token

Experiment

Evaluation of results

Performance metrics

Precision

Recall

F1-score

Accuracy

Results analysis

Generalization on external data

Implementation of proposed CRF model using Urdu universal dependency treebank

SVM implementation and comparison with CRF-based Urdu POS tagging

Comparison with benchmark approaches

Conclusion and future work

Supplemental Information

CRF Implementation Code.

DOI: 10.7717/peerj-cs.2577/supp-1

CRF model POSTagging.

DOI: 10.7717/peerj-cs.2577/supp-2

URDU POS Tagged Dataset.

DOI: 10.7717/peerj-cs.2577/supp-5

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Mushtaq Ali conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Muzammil Khan conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Yasser Alharbi conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The raw data and code are available in the Supplemental Files.

The Mushtaq and Muzammil Part of Speech Tagged (MM-POST) dataset is available at GitHub and Zenodo:

- https://github.com/Mushtaq-Ali/MM-POST-dataset.

- Mushtaq Ali. (2024). Mushtaq-Ali/MM-POST-dataset: MM-POST dataset v1.0.0 (POSTagging). Zenodo. https://doi.org/10.5281/zenodo.14165184.

The POS tagged data from Urdu Universal Dependency Treebank (UDTB) is available at GitHub: https://github.com/UniversalDependencies/UD_Urdu-UDTB/blob/master/README.md.

Funding

The authors received no funding for this work.

343 Visitors 342 Views 12 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more