Digitising a machine-tractable version of Kamus Dewan with TEI-P5

Lian Tze Lim; Ruoh Tau Chiew; Enya Kong Tang; Rusli Abdul Ghani; Naimah Yusof

doi:10.7287/peerj.preprints.2205v1

Digitising a machine-tractable version of Kamus Dewan with TEI-P5

Lian Tze Lim ¹, Ruoh Tau Chiew², Enya Kong Tang¹, Rusli Abdul Ghani³, Naimah Yusof³

1 (not affiliated), Penang, Malaysia

2 The Name Technology, Cyberjaya, Malaysia

3 Dewan Bahasa dan Pustaka, Kuala Lumpur, Malaysia

DOI: 10.7287/peerj.preprints.2205v1

Published: 2016-07-01
Accepted: 2016-07-01

Subject Areas: Computational Linguistics, Natural Language and Speech
Keywords: Machine-tractable dictionaries, TEI, Language resources, Bahasa Malaysia

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Lim LT, Chiew RT, Tang EK, Abdul Ghani R, Yusof N. 2016. Digitising a machine-tractable version of Kamus Dewan with TEI-P5. PeerJ Preprints 4:e2205v1 https://doi.org/10.7287/peerj.preprints.2205v1

Abstract

Kamus Dewan is the authoritative dictionary for Bahasa Malaysia, containing a wealth of linguistic and cultural information about Bahasa Malaysia. It is currently available in print, as well as a searchable online dictionary. However, the online dictionary lacks advanced search capabilities that target specific fields within each headword and lemma entry. For this information to be targeted and extracted efficiently by computers, the macro- and micro-structures of Kamus Dewan entries need to be first annotated or marked up explicitly. We describe how TEI-P5 guidelines have been applied in this endeavour to make the Kamus Dewan more machine-tractable. We also give some examples of how the machine-tractable data from Kamus Dewan can be used for linguistic research and analysis, as well as for producing other language resources.

Author Comment

This is a preprint submission to PeerJ Preprints.