Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT

View article
PeerJ Computer Science

Main article text

 

Introduction

  • We developed a text corpus of over 15 million words and used it to pre-train the first monolingual BERT model and static word embeddings for the low-resource Pashto language.

  • We developed a benchmark Pashto Offensive Language Dataset (POLD).

  • We developed an NLP model for automatic detection of Pashto offensive language.

Offensive language

Hate speech

Profanity and vulgarity

Aggression and cyberbullying

Literature review

Data acquisition and dataset development

Pashto text corpus

POLD dataset

Tweets collection

Pre-processing

Manual annotation

Dataset summary

Methods

Deep learning methods

Word embeddings

Neural networks

Training the neural networks

Transfer learning methods

Pre-trained multilingual models

Pashto BERT: pre-training from scratch

Fine-tuning

Experimental Results and Evaluation

Evaluation matrices

Comparison of all the models

Comparison of the static word embeddings

Error analysis

Conclusion

Supplemental Information

Dataset with translations

DOI: 10.7717/peerj-cs.1617/supp-3

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Ijazul Haq conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, and approved the final draft.

Weidong Qiu conceived and designed the experiments, analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Jie Guo analyzed the data, authored or reviewed drafts of the article, and approved the final draft.

Peng Tang performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

Data is available at Haq, Ijazul. (2023). Pashto Offensive Language Dataset [Data set]. Zenodo. https://zenodo.org/record/8195797.

Funding

The authors received no funding for this work.

6 Citations 1,504 Views 66 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more