Pashto poetry generation: deep learning with pre-trained transformers for low-resource languages

View article
PeerJ Computer Science

Main article text

 

Introduction

Methodology

Development of dataset

Data pre-processing

Proposed methods

Big science/bloomz-560m

  1. Tokenizer: The tokenizer is responsible for breaking down text into individual tokens, which can be words, subwords, or other units.

  2. Encoder: The encoder takes the tokens from the tokenizer and converts them into a sequence of hidden states. Each hidden state represents the model’s understanding of the token and its context.

  3. Decoder: The decoder takes the hidden states from the encoder and generates text one token at a time.

  4. Attention: Attention is a mechanism that allows the model to focus on specific parts of the input sequence when generating output.

MBZUAI/LaMini-Cerebras-590M

  1. Input: The input to the model is a natural language instruction.

  2. Tokenizer: The tokenizer breaks down the instruction into individual tokens, which can be words, subwords, or other units.

  3. Transformer decoder: The Transformer decoder takes the tokens from the tokenizer and generates a sequence of hidden states. Each hidden state represents the model’s understanding of the token and its context.

  4. Output: The output of the model is the generated response to the instruction.

Results and discussions

Quantitative evaluation

Qualitative evaluation

Conclusions and future work

Supplemental Information

A Jupyter Notebook designed for fine-tuning the BLOOM P560M language model.

Steps for loading and preprocessing the dataset, setting up the BLOOM tokenizer and model, and executing the fine-tuning process.

DOI: 10.7717/peerj-cs.2163/supp-2

A fine-tuning dataset containing instructional entries.

Each entry includes an instruction in Pashto, an empty input, and a poetic output.

DOI: 10.7717/peerj-cs.2163/supp-3

Additional Information and Declarations

Competing Interests

Khursheed Aurangzeb is an Academic Editor for PeerJ.

Author Contributions

Imran Ullah conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Khalil Ullah conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Hamad Khan conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Khursheed Aurangzeb conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Muhammad Shahid Anwar conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Ikram Syed conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The dataset and the code are available in the Supplemental Files.

Funding

This research is funded by the Researchers Supporting Project Number (RSPD2024R947), King Saud University, Riyadh, Saudi Arabia. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

1,013 Visitors 1,043 Views 52 Downloads