Taxonomic identification of environmental DNA with informatic sequence classification trees.

Shaun P Wilkinson; Simon K Davy; Michael Bunce; Michael Stat

doi:10.7287/peerj.preprints.26812v1

Taxonomic identification of environmental DNA with informatic sequence classification trees.

Shaun P Wilkinson ¹, Simon K Davy¹, Michael Bunce², Michael Stat³

1 School of Biological Sciences, Victoria University of Wellington, Wellington, New Zealand

2 Faculty of Science and Engineering, Curtin University of Technology, Perth, Western Australia, Australia

3 Department of Biological Sciences, Macquarie University, Sydney, New South Wales, Australia

DOI: 10.7287/peerj.preprints.26812v1

Published: 2018-03-30
Accepted: 2018-03-30

Subject Areas: Biodiversity, Bioinformatics, Computational Biology, Marine Biology, Data Mining and Machine Learning
Keywords: eDNA, Metabarcoding, Machine Learning, Bioinformatics

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Wilkinson SP, Davy SK, Bunce M, Stat M. 2018. Taxonomic identification of environmental DNA with informatic sequence classification trees. PeerJ Preprints 6:e26812v1 https://doi.org/10.7287/peerj.preprints.26812v1

Abstract

High-throughput sequencing of environmental DNA (eDNA) offers a simple and cost-effective solution for marine biodiversity assessments. Yet several analytical challenges remain, including the incorporation of statistical inference in the assignment of taxonomic identities. We developed a probabilistic method for DNA barcode classification that can be used for both eDNA and traditional single-source sampling. The pipeline involves: (1) compiling a primer-specific database of barcode sequences to be used as training data (obtained from GenBank and other sequence repositories), (2) generating a classification tree using an iterative learning algorithm that divisively sorts the training data into hierarchical clusters based on profile hidden Markov models, (3) assignment of each query sequence to a cluster using a recursive series of model-comparison tests, and (4) taxonomic identification of the query sequences based on the lowest common taxonomic rank of the training sequences within the cluster. This method compares favorably to other DNA classification methods when tested on benchmark datasets, and offers the added features of classifying at higher taxonomic ranks and returning interpretable confidence values in the form of the Akaike weight statistic. This bioinformatics pipeline is available as an open source R package called ‘insect’ (informatic sequence classification trees).

Author Comment

This is an abstract which has been accepted for the WCMB