A standardized, extensible framework for optimizing classification improves marker-gene taxonomic assignments
- Published
- Accepted
- Subject Areas
- Bioinformatics, Computational Biology, Ecology, Microbiology, Taxonomy
- Keywords
- bioinformatics, microbiome, executable paper, qiime, rRNA, ITS, microbial ecology, fungal diversity, bacterial diversity
- Copyright
- © 2015 Bokulich et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
- Cite this article
- 2015. A standardized, extensible framework for optimizing classification improves marker-gene taxonomic assignments. PeerJ PrePrints 3:e934v2 https://doi.org/10.7287/peerj.preprints.934v2
Abstract
Background: Taxonomic classification of marker-gene (i.e., amplicon) sequences represents an important step for molecular identification of microorganisms.
Results: We present three advances in our ability to assign and interpret taxonomic classifications of short marker gene sequences: two new methods for taxonomy assignment, which reduce runtime up to two-fold and achieve high-precision genus-level assignments; an evaluation of classification methods that highlights differences in performance with different marker genes and at different levels of taxonomic resolution; and an extensible framework for evaluating and optimizing new classification methods, which we hope will serve as a model for standardized and reproducible bioinformatics methods evaluations.
Conclusions: Our new methods are accessible in QIIME 1.9.0, and our evaluation framework will support ongoing optimization of classification methods to complement rapidly evolving short-amplicon sequencing and bioinformatics technologies. Static versions of all of the analysis notebooks generated with this framework, which contain all code and analysis results, can be viewed at http://bit.ly/srta-012 .
Author Comment
This is a revision following our response to comments from reviewers at Microbiome.
Supplemental Information
Supplementary Figure 1. Mock community datasets analyzed in this study
Supplementary Figure 2. Mock community A composition
Supplementary Figure 3. Mock community B composition
Supplementary Figure 4. Mock community C composition
Supplementary Figure 5. Mock community D composition
Supplementary Figure 6
Taxonomy classifier configuration and mock community composition alter assignment accuracy at family-level.
Supplementary Figure 7
Taxonomy classifier configuration and mock community composition alter assignment accuracy at species-level.
Supplementary Figure 8
Taxonomy classifier selection critically shapes assignment accuracy of simulated communities. Violin plots illustrate the distribution of precision, recall, and F-measure values across all simulated communities and all parameter configurations for a given method for family-level (left), genus-level (middle), or species-level taxonomy assignments (right). Heavy dashed lines indicate median values, fine dashed lines indicate quartiles.
Supplementary Figure 9
Taxonomic lineages represented in reference databases
Supplementary Figure 10
Evaluation of mothur taxonomy classifier. A, Distribution of F-measure scores across all partial-reference simulated communities and all parameter configurations for each method for species-level taxonomy assignments (right). Heavy dashed lines indicate median values, fine dashed lines indicate quartiles. SM = SortMeRNA. B, Confidence configuration and simulated community composition alter assignment accuracy at species-level. See figure 4 for full description of analysis and comparison to other classifiers and configurations.