A standardized, extensible framework for optimizing classification improves marker-gene taxonomic assignments

Department of Medicine, New York University Langone Medical Center, New York, NY, USA
Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ, 86011
Department of Pediatrics, University of California, San Diego, San Diego, CA, USA
Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ, United States
Department of Computer Science, Northern Arizona University, Flagstaff, AZ, USA
Department of Computer Science, University of Colorado at Boulder, Boulder, CO, USA
BioFrontiers Institute, University of Colorado at Boulder, Boulder, CO, United States
FAS Center for Systems Biology, Harvard University, Cambridge, MA, USA
Department of Microbiology and Immunology, Microbiome and Disease Tolerance Centre, McGill University, Montreal, QC, Canada
Department of Microbiology and Immunology, University of California San Francisco, San Francisco, CA, USA
DOI
10.7287/peerj.preprints.934v1
Subject Areas
Bioinformatics, Computational Biology, Ecology, Microbiology, Taxonomy
Keywords
bioinformatics, microbiome, executable paper, qiime, rRNA, ITS, microbial ecology, fungal diversity, bacterial diversity
Copyright
© 2015 Bokulich et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Bokulich NA, Rideout JR, Kopylova E, Bolyen E, Patnode J, Ellett Z, McDonald D, Wolfe B, Maurice CF, Dutton RJ, Turnbaugh PJ, Knight R, Caporaso JG. 2015. A standardized, extensible framework for optimizing classification improves marker-gene taxonomic assignments. PeerJ PrePrints 3:e934v1

Abstract

Background: Taxonomic classification of marker-gene (i.e., amplicon) sequences represents an important step for molecular identification of microorganisms.

Results: We present three advances in our ability to assign and interpret taxonomic classifications of short marker gene sequences: two new methods for taxonomy assignment, which reduce runtime up to two-fold and achieve high precision genus-level assignments; an evaluation of classification methods that highlights differences in performance with different marker genes and at different levels of taxonomic resolution; and an extensible framework for evaluating and optimizing new classification methods, which we hope will serve as a model for standardized and reproducible bioinformatics methods evaluations.

Conclusions: Our new methods are accessible in QIIME 1.9.0, and our evaluation framework will support ongoing optimization of classification methods to complement rapidly evolving short-amplicon sequencing and bioinformatics technologies. Static versions of all of the analysis notebooks generated with this framework, which contain all code and analysis results, can be viewed at http://bit.ly/srta-010.

Author Comment

This paper is being submitted to Microbiome for peer review.

Supplemental Information

Supplementary Figure 1. Mock community datasets analyzed in this study

DOI: 10.7287/peerj.preprints.934v1/supp-1

Supplementary Figure 2. Mock community A composition

DOI: 10.7287/peerj.preprints.934v1/supp-2

Supplementary Figure 3. Mock community B composition

DOI: 10.7287/peerj.preprints.934v1/supp-3

Supplementary Figure 4. Mock community C composition

DOI: 10.7287/peerj.preprints.934v1/supp-4

Supplementary Figure 5. Mock community D composition

DOI: 10.7287/peerj.preprints.934v1/supp-5

Supplementary Figure 6

Taxonomy classifier configuration and mock community composition alter assignment accuracy at family-level.

DOI: 10.7287/peerj.preprints.934v1/supp-6

Supplementary Figure 7

Taxonomy classifier configuration and mock community composition alter assignment accuracy at species-level.

DOI: 10.7287/peerj.preprints.934v1/supp-7

Supplementary Figure 8

Taxonomy classifier selection critically shapes assignment accuracy of simulated communities. Violin plots illustrate the distribution of precision, recall, and F-measure values across all simulated communities and all parameter configurations for a given method for family-level (left), genus-level (middle), or species-level taxonomy assignments (right). Heavy dashed lines indicate median values, fine dashed lines indicate quartiles.

DOI: 10.7287/peerj.preprints.934v1/supp-8