A standardized, extensible framework for optimizing classification improves marker-gene taxonomic assignments

Nicholas A Bokulich; Jai Ram Rideout; Evguenia Kopylova; Evan Bolyen; Jessica Patnode; Zach Ellett; Daniel McDonald; Benjamin Wolfe; Corinne F Maurice; Rachel J Dutton; Peter J Turnbaugh; Rob Knight; J Gregory Caporaso

doi:10.7287/peerj.preprints.934v1

A standardized, extensible framework for optimizing classification improves marker-gene taxonomic assignments

Nicholas A Bokulich¹, Jai Ram Rideout², Evguenia Kopylova³, Evan Bolyen², Jessica Patnode⁴, Zach Ellett⁵, Daniel McDonald^6,7, Benjamin Wolfe⁸, Corinne F Maurice⁹, Rachel J Dutton⁸, Peter J Turnbaugh¹⁰, Rob Knight³, J Gregory Caporaso ⁴

1 Department of Medicine, New York University Langone Medical Center, New York, NY, USA

2 Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ, 86011

3 Department of Pediatrics, University of California, San Diego, San Diego, CA, USA

4 Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ, United States

5 Department of Computer Science, Northern Arizona University, Flagstaff, AZ, USA

6 Department of Computer Science, University of Colorado at Boulder, Boulder, CO, USA

7 BioFrontiers Institute, University of Colorado at Boulder, Boulder, CO, United States

8 FAS Center for Systems Biology, Harvard University, Cambridge, MA, USA

9 Department of Microbiology and Immunology, Microbiome and Disease Tolerance Centre, McGill University, Montreal, QC, Canada

10 Department of Microbiology and Immunology, University of California San Francisco, San Francisco, CA, USA

DOI: 10.7287/peerj.preprints.934v1

Published: 2015-03-27
Accepted: 2015-03-27

Subject Areas: Bioinformatics, Computational Biology, Ecology, Microbiology, Taxonomy
Keywords: bioinformatics, microbiome, executable paper, qiime, rRNA, ITS, microbial ecology, fungal diversity, bacterial diversity

Copyright: © 2015 Bokulich et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Bokulich NA, Rideout JR, Kopylova E, Bolyen E, Patnode J, Ellett Z, McDonald D, Wolfe B, Maurice CF, Dutton RJ, Turnbaugh PJ, Knight R, Caporaso JG. 2015. A standardized, extensible framework for optimizing classification improves marker-gene taxonomic assignments. PeerJ PrePrints 3:e934v1 https://doi.org/10.7287/peerj.preprints.934v1

Abstract

Background: Taxonomic classification of marker-gene (i.e., amplicon) sequences represents an important step for molecular identification of microorganisms.

Results: We present three advances in our ability to assign and interpret taxonomic classifications of short marker gene sequences: two new methods for taxonomy assignment, which reduce runtime up to two-fold and achieve high precision genus-level assignments; an evaluation of classification methods that highlights differences in performance with different marker genes and at different levels of taxonomic resolution; and an extensible framework for evaluating and optimizing new classification methods, which we hope will serve as a model for standardized and reproducible bioinformatics methods evaluations.

Conclusions: Our new methods are accessible in QIIME 1.9.0, and our evaluation framework will support ongoing optimization of classification methods to complement rapidly evolving short-amplicon sequencing and bioinformatics technologies. Static versions of all of the analysis notebooks generated with this framework, which contain all code and analysis results, can be viewed at http://bit.ly/srta-010.

Author Comment

This paper is being submitted to Microbiome for peer review.

Supplemental Information

Supplementary Figure 6

Taxonomy classifier configuration and mock community composition alter assignment accuracy at family-level.

DOI: 10.7287/peerj.preprints.934v1/supp-6

Download

Supplementary Figure 7

Taxonomy classifier configuration and mock community composition alter assignment accuracy at species-level.

DOI: 10.7287/peerj.preprints.934v1/supp-7

Download

Supplementary Figure 8

Taxonomy classifier selection critically shapes assignment accuracy of simulated communities. Violin plots illustrate the distribution of precision, recall, and F-measure values across all simulated communities and all parameter configurations for a given method for family-level (left), genus-level (middle), or species-level taxonomy assignments (right). Heavy dashed lines indicate median values, fine dashed lines indicate quartiles.

DOI: 10.7287/peerj.preprints.934v1/supp-8

Download

Supplemental Information

Supplementary Figure 1. Mock community datasets analyzed in this study

Supplementary Figure 2. Mock community A composition

Supplementary Figure 3. Mock community B composition

Supplementary Figure 4. Mock community C composition

Supplementary Figure 5. Mock community D composition

Supplementary Figure 6

Supplementary Figure 7

Supplementary Figure 8