Digitising a machine-tractable version of Kamus Dewan with TEI-P5

(not affiliated), Penang, Malaysia
The Name Technology, Cyberjaya, Malaysia
Dewan Bahasa dan Pustaka, Kuala Lumpur, Malaysia
DOI
10.7287/peerj.preprints.2205v1
Subject Areas
Computational Linguistics, Natural Language and Speech
Keywords
Machine-tractable dictionaries, TEI, Language resources, Bahasa Malaysia
Copyright
© 2016 Lim et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Lim LT, Chiew RT, Tang EK, Abdul Ghani R, Yusof N. 2016. Digitising a machine-tractable version of Kamus Dewan with TEI-P5. PeerJ Preprints 4:e2205v1

Abstract

Kamus Dewan is the authoritative dictionary for Bahasa Malaysia, containing a wealth of linguistic and cultural information about Bahasa Malaysia. It is currently available in print, as well as a searchable online dictionary. However, the online dictionary lacks advanced search capabilities that target specific fields within each headword and lemma entry. For this information to be targeted and extracted efficiently by computers, the macro- and micro-structures of Kamus Dewan entries need to be first annotated or marked up explicitly. We describe how TEI-P5 guidelines have been applied in this endeavour to make the Kamus Dewan more machine-tractable. We also give some examples of how the machine-tractable data from Kamus Dewan can be used for linguistic research and analysis, as well as for producing other language resources.

Author Comment

This is a preprint submission to PeerJ Preprints.