Consistent, comprehensive and computationally efficient OTU definitions

Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ, USA
Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
School of Public Health and Tropical Medicine, Southern Medical University, Guangzhou, Guangdong, China
Department of Computer Science, University of Colorado Boulder, Boulder, CO, USA
Department of Molecular, Cellular, and Developmental Biology, University of Colorado at Boulder, Boulder, CO, USA
Department of Chemistry and Biochemistry, University of Colorado Boulder, Boulder, CO, USA
Graduate Program in Biophysical Sciences, University of Chicago, Chicago, IL, U.S.A.
Institute for Genomics and Systems Biology, Argonne National Laboratory, Lemont, IL, U.S.A.
Department of Biological Sciences, Northern Arizona University, Flagstaff, AZ, United States
BioFrontiers Institute, University of Colorado at Boulder, Boulder, CO, USA
Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, USA
Department of Pathology and Laboratory Medicine, Brown University, Providence, RI, USA
Howard Hughes Medical Institute, Boulder, CO, USA
DOI
10.7287/peerj.preprints.411v1
Subject Areas
Bioinformatics, Ecology, Microbiology
Keywords
OTU picking, microbial ecology, microbiome, qiime, bioinformatics
Copyright
© 2014 Rideout et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Rideout JR, He Y, Navas-Molina JA, Walters WA, Ursell LK, Gibbons SM, Chase JH, McDonald D, Gonzalez A, Robbins-Pianka A, Clemente JC, Gilbert J, Huse SM, Zhou H, Knight R, Caporaso JG. 2014. Consistent, comprehensive and computationally efficient OTU definitions. PeerJ PrePrints 2:e411v1

Abstract

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because parts of our algorithm can be run in parallel, it makes open-reference OTU picking tractable on massive amplicon sequence data sets. We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “legacy” open-reference OTU picking, where less of the process can be parallelized, through comparisons on three well-studied datasets. We therefore recommend that subsampled open-reference OTU picking always be applied in favor of “legacy” open-reference OTU picking. An implementation of this algorithm is provided in the popular QIIME software package. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters.

Supplemental Information

Supplementary Data

Description of where to obtain data for this study.

DOI: 10.7287/peerj.preprints.411v1/supp-1