Estimating and comparing microbial diversity in the presence of sequencing errors
- Published
- Accepted
- Subject Areas
- Biodiversity, Ecology, Mathematical Biology, Microbiology, Statistics
- Keywords
- Extrapolation, Hill numbers, Microbial diversity, Rarefaction, Sample coverage, Standardization, Good–Turing frequency theory
- Copyright
- © 2015 Chiu et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
- Cite this article
- 2015. Estimating and comparing microbial diversity in the presence of sequencing errors. PeerJ PrePrints 3:e1353v1 https://doi.org/10.7287/peerj.preprints.1353v1
Abstract
Estimating and comparing microbial diversity are statistically challenging due to limited sampling and possible sequencing errors for low-frequency counts, producing spurious singletons. The inflated singleton count seriously affects statistical analysis and inferences about microbial diversity. Previous statistical approaches to tackle the sequencing errors generally require different parametric assumptions about the sampling model or about the functional form of frequency counts. Different parametric assumptions may lead to drastically different diversity estimates. We focus on nonparametric methods which are universally valid for all parametric assumptions and can be used to compare diversity across communities. We develop here for the first time a nonparametric estimator of the true singleton count to replace the spurious singleton count. Our estimator of the true singleton count is in terms of the frequency counts of doubletons, tripletons and quadrupletons. To quantify microbial diversity, we adopt the measure of Hill numbers (effective number of taxa) under a nonparametric framework. Hill numbers, parameterized by an order q that determines the measures’ emphasis on rare or common species, include taxa richness (q=0), Shannon diversity (q=1), and Simpson diversity (q=2). Based on the estimated singleton count and the original non-singleton frequency counts, two statistical approaches are developed to compare microbial diversity for multiple communities. (1) A non-asymptotic approach based on standardizing sample size or sample completeness via seamless rarefaction and extrapolation sampling curves of Hill numbers. (2) An asymptotic approach based on a continuous diversity (Hill number) profile which depicts the estimated asymptotes of diversities as a function of order q. Replacing the spurious singleton count by our estimated count, we can greatly remove the positive biases associated with diversity estimates due to spurious singletons in the two approaches and make fair comparison across microbial communities, as illustrated in applying our method to analyze sequencing data from viral metagenomes.
Author Comment
This is a submission to PeerJ for review.
Supplemental Information
R codes
R codes for obtaining estimators of Hill numbers
Simulation results
Simulation results based on six species abundance models
Diversity analyses for the data sets in Allen et al. (2013)
Orange cells: original data and the Chao1 estimate for the original data. Yellow cells: empirical taxa richness and estimated asymptotes of diversities for the adjusted data, i.e., data with the original singleton count being replaced by the estimated value computed from Equation (5) of the main text, and SE is obtained by a bootstrap method. Green cells: taxa richness estimate from CatchAll (2012).