Designing Universal Chemical Markup (UCM) through the reusable methodology based on analyzing existing related formats

Department of Inorganic Chemistry, University of Chemistry and Technology Prague, Prague, Czech Republic
Department of Software Engineering, Czech Technical University in Prague, Prague, Czech Republic
DOI
10.7287/peerj.preprints.1335v1
Subject Areas
Data Science, World Wide Web and Web Science, Software Engineering
Keywords
chemical formats analysis, reusable methodology, designing UCM, UCM concepts, utilizing XML benefits
Copyright
© 2015 Mokrý et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Mokrý J, Nič M. 2015. Designing Universal Chemical Markup (UCM) through the reusable methodology based on analyzing existing related formats. PeerJ PrePrints 3:e1335v1

Abstract

Background: In order to design concepts for a new general-purpose chemical format we analyzed the strengths and weaknesses of current formats for common chemical data. While the new format is discussed more in the next article, here we describe our software tools and two stage analysis procedure that supplied the necessary information for the development. The chemical formats analyzed in both stages were: CDX, CDXML, CML, CTfile and XDfile. In addition the following formats were included in the first stage only: CIF, InChI, NCBI ASN.1, NCBI XML, PDB, PDBx/mmCIF, PDBML, SMILES, SLN and Mol2. Results: A two stage analysis process devised for both XML (Extensible Markup Language) and non-XML formats enabled us to verify if and how potential advantages of XML are utilized in the widely used general-purpose chemical formats. In the first stage we accumulated information about analyzed formats and selected the formats with the most general-purpose chemical functionality for the second stage. During the second stage our set of software quality requirements was used to assess the benefits and issues of selected formats. Additionally, the detailed analysis of XML formats structure in the second stage helped us to identify concepts in those formats. Using these concepts we came up with the concise structure for a new chemical format, which is designed to provide precise built-in validation capabilities and aims to avoid the potential issues of analyzed formats. Conclusions: We believe our analysis methodology is potentially highly reusable and could be easily adapted even for domains outside the chemistry area. It is because the methodology and software tools will need only few changes, although analyzed formats and software quality requirements for a format will differ according to the given domain.

Author Comment

This is a preprint submission to PeerJ Computer Science.

Supplemental Information

Designing Universal Chemical Markup - Supplemental information

Supplemental information for the article "Designing Universal Chemical Markup (UCM) through the reusable methodology based on analyzing existing related formats" includes additional file 1 (Interactive references), 2 (Formats excluded from second stage) and 3 (UCM tree structure).

DOI: 10.7287/peerj.preprints.1335v1/supp-1