An automated identification and analysis of ontological terms in gastrointestinal diseases and nutrition-related literature provides useful insights

Human Nutrition, School of Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
Infrastructure and Environment Research Division, School of Engineering, University of Glasgow, Glasgow, United Kingdom
Department of Paediatric Gastroenterology, Royal Hospital for Children, Glasgow, UK, Glasgow, United Kingdom
DOI
10.7287/peerj.preprints.26869v1
Subject Areas
Bioinformatics, Algorithms and Analysis of Algorithms, Data Mining and Machine Learning, Software Engineering, Visual Analytics
Keywords
ontology, inflammatory bowel disease, text mining, ecological statistics, human nutrition, ordination, gastrointestinal disease, Crohn's disease, Coeliac disease, Ulcerative Colitis
Copyright
© 2018 Koci et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Koci O, Logan M, Svolos V, Russell RK, Gerasimidis K, Ijaz UZ. 2018. An automated identification and analysis of ontological terms in gastrointestinal diseases and nutrition-related literature provides useful insights. PeerJ Preprints 6:e26869v1

Abstract

With an unprecedented growth in the biomedical literature, keeping up to date with the new developments presents an immense challenge. Publications are often studied in isolation of the established literature, with interpretation being subjective and often introducing human bias. With ontology-driven annotation of biomedical data gaining popularity in recent years and online databases offering metatags with rich textual information, it is now possible to automatically text-mine ontological terms and complement the laborious task of manual management, interpretation, and analysis of the accumulated literature with downstream statistical analysis. In this paper, we have formulated an automated workflow through which we have identified ontological information, including nutrition-related terms in PubMed abstracts (from 1991 until 2016) for two main types of Inflammatory Bowel Diseases: Crohn's Disease and Ulcerative Colitis; and two other gastrointestinal diseases, namely, Coeliac Disease and Irritable Bowel Syndrome. Our analysis reveals unique clustering patterns as well as spatial and temporal trends inherent to the considered gastrointestinal diseases in terms of literature that has been accumulated so far. Although automated interpretation cannot replace human judgement, the developed workflow shows promising results and can be a useful tool in systematic literature reviews. The workflow is available at https://github.com/KociOrges/pytag .

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Differential expression analysis of nutrition-related terms between disease conditions

Four tables, Table_S1A (1991-1998), Table_S1B (1999-2004), Table_S1C (2005-2010), and Table_S1D (2011-2016) for differential expression analysis of nutrition-related terms between diseases using Kruskal-Wallis test. Only those terms are shown where the adjusted p-value (Padj) < 0.05. Mean expression indicates the mean document-based normalised frequency obtained for a specific term for each disease group. A post hoc pairwise Dunn’s comparison indicating significant differences between the groups is shown on the right half.

DOI: 10.7287/peerj.preprints.26869v1/supp-1

Differential expression analysis of nutrition-related terms between years

Four tables, Table_S2A (CCD), Table_S2B (CD), Table_S2C (IBS), and Table_S2D (UC) for differential expression analysis of nutrition-related terms between years using Kruskal-Wallis test. Only those terms are shown where the adjusted p-value (Padj) < 0.05. Mean expression indicates the mean document-based normalised frequency obtained for a specific term for each interval.

DOI: 10.7287/peerj.preprints.26869v1/supp-2