TaxaSE: Exploiting evolutionary conservation within 16S rDNA sequences for enhanced taxonomic annotation

Hawkesbury Institute for the Environment, University of Western Sydney, Richmond, Australia
Warwick Medical School - Microbiology and Infection, University of Warwick, Warwick, United Kingdom
DOI
10.7287/peerj.preprints.2941v1
Subject Areas
Biodiversity, Bioinformatics, Computational Biology, Microbiology
Keywords
Taxonomic Annotation, Shannon Entropy, Pipeline, 16S rDNA, Microbial, Bacterial, Next Generation Sequencing, QIIME
Copyright
© 2017 Ijaz et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Ijaz AZ, Jeffries T, Quince C, Hamonts K, Singh B. 2017. TaxaSE: Exploiting evolutionary conservation within 16S rDNA sequences for enhanced taxonomic annotation. PeerJ Preprints 5:e2941v1

Abstract

Amplicon based taxonomic analysis, which determines the presence of microbial taxa in different environments on the basis of marker gene annotations, often uses percentage identity as the main metric to determine sequence similarity against databases. These data are then used to study the distribution of biodiversity as well as response of microbial communities to environmental conditions. However the 16S rRNA gene displays varying degrees of sequence conservation along its length and percentage identity does not fully utilize this information. Additionally, the prevalent usage of Operational Taxonomic Unit, or OTUs is not without its own issues and may lead to a reduction in annotation capability of the system. Hence a novel approach to taxonomic annotation is needed. Here we introduce a new taxonomic annotation pipeline, TaxaSE, which utilizes Shannon entropy to quantify evolutionary conservation within 16S rDNA sequences for enhanced taxonomic annotations. Furthermore, the system is capable of annotation of individual sequences in order to improve fine grain taxonomic annotations. We present both in-silico comparison of the new similarity metric with percentage identity, as well as comparison with the popular QIIME pipeline. The results demonstrate the new similarity metric achieves better performance especially at lower taxa levels. Furthermore, the pipeline is able to extract more fine grain taxonomic annotations compared to QIIME. These exhibit not only the effectiveness of the new pipeline but also highlight the need to shift away from both percentage identity and OTU based approaches for ecological projects.

Author Comment

This is the first version of the TaxaSE pipeline, developed primarily in Java 8 and UNIX bash scripts.

Supplemental Information