Deep learning-based approach for Arabic open domain question answering

View article
PeerJ Computer Science

Main article text

 

Introduction

Model overview

Dense passage retriever

Dense passage retriever methodology

Retriever evaluation

AraELECTRA passage reader

AraELECTRA

Reader evaluation

Dataset

Arabic reading comprehension dataset

TyDiQA

End-to-end system: arabic openqa

Experiments and results

Fine tuning multilingual dense passage retriever on Arabic datasets

  • We used the 01-09-2021 dump of Arabic Wikipedia (Wikimedia Foundation, 2021) as our knowledge source to answer the factoid questions. Only the plain text was extracted and all other data, such as lists and figures, were removed using the WikiExtractor tool (Attardi, 2015). After removing the internal disambiguation, list, index, and outline pages, we were able to extract 3.2 million pages with 2,130,180 articles. Due to memory limitations, we only used 491,253 articles.

  • We used Elasticsearch (elastic, 2021) to store the document text and other metadata. We pre-processed by removing empty lines, whitespaces, and long headers and footers. We also split files into small documents of around 100 words, storing these documents in the Elasticsearch document storage. The text’s vector embeddings were indexed based on Elasticsearch Indexing, which was then searched to get answers.

  • We initialized our DPR to search for documents in DocumentStore, retrieve some documents, and return the top 20 passages that are most related to the query. Initializing and training the DPR retriever contained the following arguments:

  1. document_store: A DocumentStore object from which documents can be retrieved.

  2. query_embedding_model: A question encoder checkpoint. We used the mDPR (Voidful, 2021) by hugging-face transformers.

  3. passage_embedding_model: A passage encoder checkpoint. We used also the mDPR (Voidful, 2021) by hugging-face transformers.

  4. max_seq_len_query: The Maximum number of tokens for the query is 64.

  5. max_seq_len_passage: The Maximum number of tokens for the passage is 256. batch_size: The number of queries and passages to encode. The batch size is set to 4.

  6. similarity_function: During training, the dot_product function is used to calculate the similarity of the query and passage embeddings.

  7. query: The question

  8. filters: Contains the dictionary of the keys that indicate a metadata field and the value, which is a list of acceptable values for that field.

  9. top_k: Contains the number of passages to retrieve per question.

  10. index: Contains the name of the DocumentStore index from which documents can be retrieved.

Retriever results

Fine-tuning AraELECTRA for reading comprehension task

  • Replace emojis

  • Remove HTML markups, except in the TyDiQA-GoldP dataset

  • Replace email

  • Remove diacritics and tatweel

  • Insert whitespaces before and after all non-Arabic digits, English digits, and Arabic and English Alphabet letters

  • Insert whitespace between words and numbers or numbers and words

Final results

Conclusions

Supplemental Information

ARCD+TYDIQA Dataset.

DOI: 10.7717/peerj-cs.952/supp-2

Code for combining retriever with reader.

DOI: 10.7717/peerj-cs.952/supp-3

Additional Information and Declarations

Competing Interests

The authors declare that they have no competing interests.

Author Contributions

Kholoud Alsubhi conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Amani Jamal analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.

Areej Alhothali analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The TyDiQA dataset is available at GitHub: https://github.com/google-research-datasets/tydiqa.

The ARCD dataset is available at https://huggingface.co/datasets/arcd.

Funding

The authors received no funding for this work.

13 Citations 1,948 Views 463 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more