Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity

Department of Biological Sciences, Northern Arizona University, Flagstaff, Arizona, United States
NAU Environmental Genetics and Genomics Laboratory, Northern Arizona University, Flagstaff, Arizona, United States
Department of Computer Science, University of Colorado at Boulder, Boulder, Colorado, United States
Department of Pediatrics, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado, United States
Merriam-Powell Center for Environmental Research, Northern Arizona University, Flagstaff, Arizona, United States
DOI
10.7287/peerj.preprints.2196v1
Subject Areas
Biodiversity, Bioinformatics, Ecology, Microbiology
Keywords
QIIME, mock communities, amplicon, 16S, OTU, quality filtering, optimization, Illumina, sequencing error, microbial ecology
Copyright
© 2016 Krohn et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Krohn A, Stevens B, Robbins-Pianka A, Belus M, Allan GJ, Gehring C. 2016. Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity. PeerJ Preprints 4:e2196v1

Abstract

Diversity of complex microbial communities can be rapidly assessed by community amplicon sequencing of marker genes (e.g., 16S), often yielding many thousands of DNA sequences per sample. However, analysis of community amplicon sequencing data requires multiple computational steps which affect the outcome of a final data set. Here we use mock communities to describe the effects of parameter adjustments for raw sequence quality filtering, picking operational taxonomic units (OTUs), taxonomic assignment, and OTU table filtering as implemented in QIIME 1.9.1. We demonstrate a workflow optimization based upon this exploration which we also apply to environmental samples. We found that quality filtering of raw data and filtering of OTU tables had large effects on observed OTU diversity. While all taxonomy assigners performed with similar accuracy, an appropriate choice of similarity threshold for defining OTUs depended on the method used for OTU picking. Our “default” analysis in QIIME overestimated mock community diversity by at least a factor of ten, compared to the optimized analysis which correctly characterized the taxonomic composition of the mock communities while still overestimating OTU diversity by about a factor of two. Though observed relative abundances of mock community member taxa were approximately correct, most were still represented by multiple OTUs. Low-frequency OTUs conspecific to constituent mock community taxa were characterized by multiple substitution and indel errors and the presence of a low quality base call resulting in sequence truncation during quality filtering. Low quality base calls were observed at “G” positions most of the time, and were also associated with a preceding “TTT” trinucleotide motif. Environmental diversity estimates were reduced by about 40% from 2508 to 1533 OTUs when comparing output from the default and optimized workflows. We attribute this reduction in observed diversity to the removal of erroneous sequences from the data set. Our results indicate that both strict quality filtering of raw sequencing data and careful filtering of raw OTU tables are important steps for accurate estimation of microbial community diversity.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Supplemental tables and figures

DOI: 10.7287/peerj.preprints.2196v1/supp-1