Minimizing spurious features in 16S rRNA gene amplicon sequencing

Jing Wang; Qianpeng Zhang; Guojun Wu; Chenhong Zhang; Menghui Zhang; Liping Zhao

doi:10.7287/peerj.preprints.26872v1

Minimizing spurious features in 16S rRNA gene amplicon sequencing

Jing Wang, Qianpeng Zhang, Guojun Wu, Chenhong Zhang, Menghui Zhang , Liping Zhao

State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Ministry of Education Key Laboratory of Systems Biomedicine, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China

DOI: 10.7287/peerj.preprints.26872v1

Published: 2018-04-19
Accepted: 2018-04-19

Subject Areas: Bioinformatics, Microbiology, Molecular Biology
Keywords: 16S rRNA gene, sequencing error, abundance filtering

Copyright: © 2018 Wang et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Wang J, Zhang Q, Wu G, Zhang C, Zhang M, Zhao L. 2018. Minimizing spurious features in 16S rRNA gene amplicon sequencing. PeerJ Preprints 6:e26872v1 https://doi.org/10.7287/peerj.preprints.26872v1

Abstract

The 16S rRNA gene amplicon sequencing is a widely used high-throughput method for the taxonomic inference in microbial communities. Many data analysis pipelines have been developed to enhance the accuracy in reflecting the real taxonomy, in order to better guide the downstream identification, isolation and mechanistic studies. Though rigorous quality filtration steps were adopted in these pipelines, with well-designed mock and simulated data sets, we found that there were still a widely divergent number of spurious features due to the “pseudo sequences” artificially generated during the PCR and sequencing process. These pseudo sequences were in low abundances, and were unreliable determined through a weighted re-sampling test. To minimize their influences on the characterization of taxonomy, we proposed an approach that contains two steps, an abundance filtering (AF) step and the subsequent AF-based OTU picking and remapping (AOR) step, which can efficiently decrease the spurious OTUs, sequences or oligotyping features, and improve Matthew's Correlation Coefficient (MCC) values in OTU clustering. The approach can be easily integrated with the popularly-used 16S rRNA sequencing data analysis pipelines, to make the number of OTUs, alpha and beta diversities from divergent pipelines more consistent with the real structure of microbial communities.

Author Comment

This is a preprint submission to PeerJ Preprints.

Supplemental Information

Figure S1 The OTUs obtained by AOR approach in Mock data

(a-c) The number of OTUs decreased to 22 at thresholds; (d-f) the total ratio of sequences remapped back to OTUs also maintained at >99%; (g-i) the MCC values increased to >0.95, indicating ideal OTU delineation quality. The alternative x axis at the bottom indicates how many sequences did not attending initial OTU delineation at each threshold levels. After OTU delineation, qualified unique sequences were remapped to OTUs with 97% similarity threshold. Dots indicate the original results of corresponding OTU delineation methods.

DOI: 10.7287/peerj.preprints.26872v1/supp-1

Download

Figure S2 Coefficient of variation(a-d) and the 99% confidential intervals of bootstrapped abundance (e-h) in (a, e) PWS, (b, f) Ultra, (c, g) River and (d, h) Water data

The Coefficient of variation decreased quickly along with the sequences’ abundances. The distribution of bootstrapped abundance included zero when the abundances were really low. Dashed vertical lines showed the abundance thresholds for OTU delineation.

DOI: 10.7287/peerj.preprints.26872v1/supp-2

Download

Figure S3 The OTUs obtained by AOR approach in (a) PWS, (b) Ultra, (c) River and (d) Water data sets

The vertical dashed lines indicates the threshold set by bootstrap resampling. Different pipelines obtained close number of OTUs at these thresholds. Dots indicate the original results of corresponding OTU delineation methods.

DOI: 10.7287/peerj.preprints.26872v1/supp-3

Download

Figure S4 The MCC value in (a) PWS, (b) Ultra, (c) River and (d) Water data sets increased along with the threshold

After OTU delineation, all “qualified sequences” were remapped to OTUs with 97% similarity. Dots indicate the original results of corresponding OTU delineation methods.

DOI: 10.7287/peerj.preprints.26872v1/supp-4

Download

Figure S5 AOR resulted in less OTUs but comparable alpha diversity in PWS (a-d), Ultra (e-h), River (i-l) and Water (m-p) data

(a, e, i, m) Number of OTUs, (b, f, j, n) Chao1 indices, (c, g, k, o) Simpson indices and (d, h, l, p) Shannon indices per sample were calculated. Multiple comparison was performed using Wilcox test, p values were adjusted by FDR method.

DOI: 10.7287/peerj.preprints.26872v1/supp-5

Download

Figure S6 AOR resulted in more consistent beta diversity among methods in (a) PWS, (b) Ultra, (c) River and (d) Water data

Mantel r Statistics were obtained by comparing beta diversity distance matrices between each pair of analysis methods with (Red) original results, (Blue) AOR approach incorporated.

DOI: 10.7287/peerj.preprints.26872v1/supp-6

Download

Table S1 The construction of mock communities

DOI: 10.7287/peerj.preprints.26872v1/supp-7

Download

Table S2 The 87 references used in simulated data

DOI: 10.7287/peerj.preprints.26872v1/supp-8

Download

Table S3 The average error rates of the raw sequences reported by sequencing machine, QC sequences passing different quality control methods, the final qualified sequences for OTU delineation, and the qualified sequences pre-clustered with up to 1 differe

DOI: 10.7287/peerj.preprints.26872v1/supp-9

Download

Table S4 The number of sequences passed quality filtration using different methods

DOI: 10.7287/peerj.preprints.26872v1/supp-10

Download

Table S5 The abundance threshold of unreliable sequences

DOI: 10.7287/peerj.preprints.26872v1/supp-11

Download

Code for bootstrapping

DOI: 10.7287/peerj.preprints.26872v1/supp-12

Download

Supplemental Information

Figure S1 The OTUs obtained by AOR approach in Mock data

Figure S2 Coefficient of variation(a-d) and the 99% confidential intervals of bootstrapped abundance (e-h) in (a, e) PWS, (b, f) Ultra, (c, g) River and (d, h) Water data

Figure S3 The OTUs obtained by AOR approach in (a) PWS, (b) Ultra, (c) River and (d) Water data sets

Figure S4 The MCC value in (a) PWS, (b) Ultra, (c) River and (d) Water data sets increased along with the threshold

Figure S5 AOR resulted in less OTUs but comparable alpha diversity in PWS (a-d), Ultra (e-h), River (i-l) and Water (m-p) data

Figure S6 AOR resulted in more consistent beta diversity among methods in (a) PWS, (b) Ultra, (c) River and (d) Water data

Table S1 The construction of mock communities

Table S2 The 87 references used in simulated data

Table S3 The average error rates of the raw sequences reported by sequencing machine, QC sequences passing different quality control methods, the final qualified sequences for OTU delineation, and the qualified sequences pre-clustered with up to 1 differe

Table S4 The number of sequences passed quality filtration using different methods

Table S5 The abundance threshold of unreliable sequences

Code for bootstrapping

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article