Estimating intraspecific genetic diversity from community DNA metabarcoding data

Vasco Elbrecht; Ecaterina Edith Vamos; Dirk Steinke; Florian Leese

doi:10.7287/peerj.preprints.3269v4

Estimating intraspecific genetic diversity from community DNA metabarcoding data

Vasco Elbrecht ^1,2, Ecaterina Edith Vamos¹, Dirk Steinke², Florian Leese^1,3

1 Aquatic Ecosytem Research, University of Duisburg-Essen, Essen, NRW, Germany

2 Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada

3 Centre for Water and Environmental Research (ZWU) Essen, University of Duisburg-Essen, Essen, NRW, Germany

DOI: 10.7287/peerj.preprints.3269v4

Published: 2018-03-23
Accepted: 2018-03-23

Subject Areas: Biogeography, Bioinformatics, Molecular Biology, Freshwater Biology
Keywords: metabarcoding, high-throughput sequencing, population genetics, haplotyping, ecosystem assessment, exact sequence variant (ESV), CO1

Copyright: © 2018 Elbrecht et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Elbrecht V, Vamos EE, Steinke D, Leese F. 2018. Estimating intraspecific genetic diversity from community DNA metabarcoding data. PeerJ Preprints 6:e3269v4 https://doi.org/10.7287/peerj.preprints.3269v4

Abstract

Background. DNA metabarcoding is used to generate species composition data for entire communities. However, sequencing errors in high throughput sequencing instruments are fairly common, usually requiring reads to be clustered into operational taxonomic units (OTU), losing information on intraspecific diversity in the process. While COI haplotype information is limited in resolution, it is nevertheless useful in a phylogeographic context, helping to formulate hypothesis on taxon dispersal.

Methods. This study combines sequence denoising strategies, normally applied in microbial research, with additional abundance-based filtering to extract haplotypes from freshwater macroinvertebrate metabarcoding data sets. This novel approach was added to the R package "JAMP" and can be applied to Cytochrome c oxidase subunit I (COI) amplicon datasets. We tested our haplotyping method by sequencing i) a single-species mock community composed of 31 individuals with different haplotypes spanning three orders of magnitude in biomass and ii) 18 monitoring samples each amplified with four different primer sets and two PCR replicates.

Results. We detected all 15 haplotypes of the single specimens in the mock community with relaxed filtering and denoising settings. However, up to 480 additional unexpected haplotypes remained in both replicates. Rigorous filtering removes most unexpected haplotypes, but also can discard expected haplotypes mainly from the small specimens. In the monitoring samples, the different primer sets detected 177 - 200 OTUs, each containing an average of 2.40 to 3.30 haplotypes per OTU. Population structures were consistent between replicates, and similar between primer pairs, depending on the primer length. A closer look at abundant taxa in the data set revealed various population genetic patterns, e.g. Taeniopteryx nebulosa and Hydropsyche pellucidula with a difference in north-south haplotype distribution, while Oulimnius tuberculatus and Asellus aquaticus display no clear population pattern but differ in genetic diversity.

Discussion. We developed a strategy to infer intraspecific genetic diversity from bulk invertebrate monitoring samples using metabarcoding data. It needs to be stressed that at this point metabarcoding-informed haplotyping is not capable of capture the full diversity present in such samples, due to variation in specimen size, primer bias and loss of sequence variants with low abundance. Nevertheless, for a high number of species intraspecific diversity was recovered, identifying potentially isolated populations and potential taxa for further more detailed phylogeographic investigation. While we are currently lacking large-scale metabarcoding data sets to fully take advantage of our new approach, metabarcoding-informed haplotyping holds great promise for biomonitoring efforts that not only seek information about biological diversity but also underlying genetic diversity.

Author Comment

Updated description of Table S1 and Scripts S1 (were both called Table S1 acidentially).

Supplemental Information

Figure S1: Schematic overview of errors affecting metabarcoding data and clustering / denoising strategies to reduce them

Overview of the metabarcoding process, with key biases potentially affecting sequence accuracy (shown in red). In the bulk sample (A) several species with different biomass (indicated by circle size) and distinct haplotypes (indicated by colour) are present. After tissue homogenization and DNA extraction the COI marker is amplified using PCR (B), which can not only skew sequence abundance but also fail to amplify taxa due to primer bias (Elbrecht & Leese, 2015) or insufficient sequencing depth in the case of underrepresented / rare taxa (Elbrecht, Peinert & Leese, 2017). In the process of HTS (C) many new false sequence variants are generated due to sequencing errors (Schirmer et al., 2015), chimera formation (Edgar et al., 2011) and mixing of multiplexed samples (Esling, Lejzerowicz & Pawlowski, 2015; Schnell, Bohmann & Gilbert, 2015). The impact of these errors is usually reduced by strict quality filtering and clustering of similar sequences into operational taxonomic units (OTUs). Normally, only the most abundant sequence in an OTU is considered and used to identify the respective species, which in turn means that information on genetic diversity is lost (Callahan, McMurdie & Holmes, 2017) (D). Recently alternative denoising strategies have been developed to remove sequences affected by errors from a dataset and retain the actual haplotype sequences present in a sample (Eren et al., 2015; Edgar & Flyvbjerg, 2015; Callahan et al., 2016; Amir et al., 2017). Figure based on Figure S1 in Callahan et al. 2016.

DOI: 10.7287/peerj.preprints.3269v4/supp-1

Download

Figure S2: Overview of the haplotyping strategy used here and their implementation in the JAMP R package

Detailed bioinformatic processing of metabarcoding to extract haplotype sequences using the JAMP R package. A) Metabarcoding raw data is processed and quality filtered. These steps are integrated in JAMP, but most other standard metabarcoding pipelines could be used as well. B) The processed and quality filtered samples from step A would be usually clustered into operational taxonomic units, but are here additionally filtered (retaining reads of only the expected amplicon length and discarding reads of low abundance) and then denoised. C) In denoising with usearch unoise3 the strictness of denoising is controlled by the alpha value (low alpha = less noise, however more true haplotypes get discarded). D) The denoised reads (=haplotypes) are clustered into OTUs grouped by similarity and the abundance of each haplotype for each sample is exported in a table. E) The haplotype table is additionally filtered using different thresholds, to reduce the presence of low abundant OTUs and haplotypes and increase data reliability. F) The final filtered haplotype table can be used for phylogeographic and population genetic analysis.

DOI: 10.7287/peerj.preprints.3269v4/supp-2

Download

Figure S3: Effect of different quality filtering (max ee) on reads of the single species mock sample

Effect of different expected error filtering thresholds on haplotype recovery (no denoising applied). All filtered reads are mapped against the expected haplotypes (black circles). Not all reads are shared between both replicates (indicated by A or B instead of a circle). The 15 expected haplotypes are shown in black, while unexpected ones are highlighted in gray or blue. Error bars show the standard deviation of relative read abundance between both replicates, for the respective haplotype.

DOI: 10.7287/peerj.preprints.3269v4/supp-3

Download

Figure S4: Effect of different alpha values in read denoising of the single-species mock sample

Effect of different haplotype recovery of in the single species mock sample, when using different alpha values with Unoise3 (as integrated in the JAMP package). Not all reads are shared between both replicates (indicated by A or B instead of a circle). The 15 expected haplotypes are shown in black, while unexpected ones are highlighted in gray or blue. Error bars show the standard deviation of relative read abundance between both replicates, for the respective haplotype.

DOI: 10.7287/peerj.preprints.3269v4/supp-4

Download

Figure S5: Bar plots of haplotype distribution within each OTU

Bar plots showing the haplotype composition of all 199 OTUs obtained with the BF2+BR2 primer combination. The OTU number is indicated above each bar, with the four taxa shown in Figure 2 being highlighted. Haplotypes are shown in different colours, with white bars indicating the proportion of sites where the respective OTU was not detected. Most OTUs were only present at a few sample sites.

DOI: 10.7287/peerj.preprints.3269v4/supp-5

Download

Figure S6: Detailed plots of four taxa from the denoised multi-species monitoring samples, showing haplotype maps & networks, similarity between replicates and sequence alignments

Figure S6: Detailed haplotype maps, networks and sequence alignment for all 4 primer combinations and replicates of selected taxa. a) Haplotype maps for both replicates for each of the four primer combinations. For A. aquaticus only the 10 most common haplotypes are shown in different colours (remaining ones in white). For each primer combination, the haplotypes in the map and network have the same corresponding colours. b) Haplotype networks for each primer pair. Each cross line represents one base pair difference between the respective haplotypes. Haplotypes present in just one replicate are indicated by A or B next to the network node. Dashed lines around a circle indicate novel haplotypes that were not available in the BOLD reference database. c) Quantification of similarity between both replicates, by plotting abundance of individual haplotypes of each sampling point against each other. The red line indicates the best fit (with significance and adjusted R² value given in each plot). d) Sequence alignment of all haplotypes, with mismatching nucleotides between sequences highlighted (green = T, red = A, yellow = G and blue = C). See the following pages for example plots of: Page 2: Taeniopteryx nebulosa Page 3: Hydropsyche pellucidula Page 4: Oulimnius tuberculatus Page 5: Asellus aquaticus

DOI: 10.7287/peerj.preprints.3269v4/supp-6

Download

Table S1: Finland haplotype table (for all 4 different primer combinations)

DOI: 10.7287/peerj.preprints.3269v4/supp-7

Download

Estimating intraspecific genetic diversity from community DNA metabarcoding data

Abstract

Author Comment

Supplemental Information

Figure S1: Schematic overview of errors affecting metabarcoding data and clustering / denoising strategies to reduce them

Figure S2: Overview of the haplotyping strategy used here and their implementation in the JAMP R package

Figure S3: Effect of different quality filtering (max ee) on reads of the single species mock sample

Figure S4: Effect of different alpha values in read denoising of the single-species mock sample

Figure S5: Bar plots of haplotype distribution within each OTU

Figure S6: Detailed plots of four taxa from the denoised multi-species monitoring samples, showing haplotype maps & networks, similarity between replicates and sequence alignments

Table S1: Finland haplotype table (for all 4 different primer combinations)

Scripts S1: Metabarcoding and denoising pipeline, and additional scripts used to produce the figures

Manuscript work file for providing feedback

Add your feedback

Supplemental Information

Figure S1: Schematic overview of errors affecting metabarcoding data and clustering / denoising strategies to reduce them

Figure S2: Overview of the haplotyping strategy used here and their implementation in the JAMP R package

Figure S3: Effect of different quality filtering (max ee) on reads of the single species mock sample

Figure S4: Effect of different alpha values in read denoising of the single-species mock sample

Figure S5: Bar plots of haplotype distribution within each OTU

Figure S6: Detailed plots of four taxa from the denoised multi-species monitoring samples, showing haplotype maps & networks, similarity between replicates and sequence alignments

Table S1: Finland haplotype table (for all 4 different primer combinations)

Scripts S1: Metabarcoding and denoising pipeline, and additional scripts used to produce the figures

Manuscript work file for providing feedback

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article