Taxonomic identification of environmental DNA with informatic sequence classification trees.

School of Biological Sciences, Victoria University of Wellington, Wellington, New Zealand
Faculty of Science and Engineering, Curtin University of Technology, Perth, Western Australia, Australia
Department of Biological Sciences, Macquarie University, Sydney, New South Wales, Australia
DOI
10.7287/peerj.preprints.26812v1
Subject Areas
Biodiversity, Bioinformatics, Computational Biology, Marine Biology, Data Mining and Machine Learning
Keywords
eDNA, Metabarcoding, Machine Learning, Bioinformatics
Copyright
© 2018 Wilkinson et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Wilkinson SP, Davy SK, Bunce M, Stat M. 2018. Taxonomic identification of environmental DNA with informatic sequence classification trees. PeerJ Preprints 6:e26812v1

Abstract

High-throughput sequencing of environmental DNA (eDNA) offers a simple and cost-effective solution for marine biodiversity assessments. Yet several analytical challenges remain, including the incorporation of statistical inference in the assignment of taxonomic identities. We developed a probabilistic method for DNA barcode classification that can be used for both eDNA and traditional single-source sampling. The pipeline involves: (1) compiling a primer-specific database of barcode sequences to be used as training data (obtained from GenBank and other sequence repositories), (2) generating a classification tree using an iterative learning algorithm that divisively sorts the training data into hierarchical clusters based on profile hidden Markov models, (3) assignment of each query sequence to a cluster using a recursive series of model-comparison tests, and (4) taxonomic identification of the query sequences based on the lowest common taxonomic rank of the training sequences within the cluster. This method compares favorably to other DNA classification methods when tested on benchmark datasets, and offers the added features of classifying at higher taxonomic ranks and returning interpretable confidence values in the form of the Akaike weight statistic. This bioinformatics pipeline is available as an open source R package called ‘insect’ (informatic sequence classification trees).

Author Comment

This is an abstract which has been accepted for the WCMB