Development and validation of DNA metabarcoding COI primers for aquatic invertebrates using the R package "PrimerMiner"
- Published
- Accepted
- Subject Areas
- Biodiversity, Bioinformatics, Ecology, Genetics, Molecular Biology
- Keywords
- Primer Development, DNA metabarcoding, ecosystem assessment, data mining, primer bias, in silico PCR
- Copyright
- © 2016 Elbrecht et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2016. Development and validation of DNA metabarcoding COI primers for aquatic invertebrates using the R package "PrimerMiner" PeerJ Preprints 4:e2044v2 https://doi.org/10.7287/peerj.preprints.2044v2
Abstract
1) DNA metabarcoding is a powerful tool to assess biodiversity by amplifying and sequencing a standardized gene marker region. However, typically used barcoding genes, such as the cytochrome c oxidase subunit I (COI) region for animals, are highly variable. Thus, different taxa in communities under study are often not amplified equally well and some might even remain undetected due to primer bias. To reduce these problems, optimized metabarcoding primers for the typical communities found in certain geographic regions- and/or ecosystems are necessary.
2) We developed the R package PrimerMiner, which batch downloads DNA barcode gene sequences from BOLD and NCBI databases for specified target taxonomic groups and then applies sequence clustering to reduce biases introduced by the different number of available sequences per species. We downloaded COI data for the 15 most relevant freshwater invertebrate groups for stream ecosystem assessment and developed four primer sets with high base degeneracy based on that. Primer performance was tested by sequencing ten mock community samples each consisting of 52 freshwater invertebrate taxa. Additionally, we used PrimerMiner to evaluate the developed primers against other metabarcoding primers in silico.
3) The developed primers varied in amplification efficiency and the amount of detected taxa, yet all retrieved more taxa than standard Folmer barcoding primers. Additionally, the BF/BR primers amplified taxa very consistently, with the BF2+BR2 and BF2+BR1 primer combinations showing better amplification than a previously tested ribosomal marker (16S). Except for the BF1+BR1 primers all BF/BR primers combinations detected all 42 insect taxa present in the mock community samples. In silico evaluation of the developed primers demonstrates their suitability for metabarcoding of non-aquatic insect samples.
4) With PrimerMiner we provide a useful tool to obtain relevant sequence data for targeted primer development and evaluation. Our sequence datasets generated with the newly developed metabarcoding primers demonstrate that the design of optimized primers with high base degeneracy is superior to classical markers and enables us to detect almost 100% of animal taxa present in a sample using the standard COI barcoding gene. Therefore, the PrimerMiner package and the developed primers are useful beyond biodiversity assessment in aquatic ecosystems.
Author Comment
Revised version: Improved spelling / flow of the text, primer in silico evaluation data final + used scripts and supplementary tables added. Also primerMiner v0.6 including primer evaluation tools has been released https://github.com/VascoElbrecht/PrimerMiner/
Supplemental Information
Fig. S1: Overview of obtained spots per sample and amount of sequences lost in bioinformatic processing
A: Number of PE reads obtained for each sample, and proportion of PhiX and COI sequences without matching tags. Numbers above bars give the proportion of reads in percentage. B: Amount of sequences excluded in each major bioinformatics processing step for each sample. Size of the amplified region (not the fragment size) is given below in boxes for each primer combination.
Fig. S2: Overview of the base composition of the COI Folmer region for the 15 most important freshwater groups
The plot of group base composition was generated with PrimerMiner and used to develop the BF / BR primers and manually evaluate other common metabarcoding primers. The sequences for the Folmer binding regions (opaque colours) have been downloaded in February 2015 and had 26 bp clipping applied, as many clusters were affected by sequences which still contained the primer sequences. Sequences from the Folmer region were downloaded April 2015 not trimmed, as only the region amplified by the primers was used.
Fig. S3: Overview of used fusion sequencing primers
They contain standard illumina flow cell binds and primer binding sites as well as in line tags to distinguish between multiplexed samples. They can be used to amplify the target COI barcoding region and PCR products can directly be sequenced after PCR cleanup. We recommend using the parallel sequencing strategy outlined in Elbrecht & Lesse 2015, maximising sequence diversity for sequencing and doubling the amount of samples which can be tagged (up to 288). See Figure S4 for ideal tagging combinations.
Fig. S4: Matrix of similarities between all possible primer combinations using 5 bp inline tags
For some primers, several tagging-sequences are shown, due to nucleotide degeneracy in the tag sequences. Primer combinations which are similar at 4 sites (orange background) should be avoided for tagging as a single read error could lead to mistagging (blue squares). As we are using a parallel sequencing approach, also combinations like BF22+BR11 should be avoided, as both forward and reverse reads could occur together in sequencing read 1 or 2, possibly leading to mistagging. With the presented primer sets a total of 276 samples can be securely tagged (excluding the problematic primer sequences in blue squares). Number of good tagging combinations for each primer set (tagging possiblities are doubled when using parallel sequencing, see Elbrecht & Leese 2015)
Fig S5: Overview of obtained products after one-step PCR and magnet bead clean up
Fragment concentrations were measured using the Qbit HS kit.
Fig. S6: Overview of “missing” base pairs at the primer 3' end for sequencing datasets from this study as well as Elbrecht & Leese 2015 and Elbrecht et al. 2016
After library demultiplexing a random subset of 5.000 reads was extracted from each sample and sequences aligned in Geneious 8 using MAFFT. The primer sequence + 10 bp (5 bp for 16S) was extracted from each alignment and the mean deviation from the expected primer length plotted for each sample. The proportion of sequences with the expected length is given for each sample on the right of each plot. The error bars show the standard deviation (N=10).
Fig. S7: Length distribution and abundance of individual sequences assigned to each OTU for all primer combinations
The percentage value indicates the proportion of sequences which are at least 10 bp longer or shorter than the expected amplicon length.
Fig. S8: Evaluation of primer combinations
Preliminary data, error penalties subject to changes / kind of mismatch not jet implemented! Primer pairs with a combined penalty score of below 100 were considered working (= green), with pairs above that being considered to no or only poor amplification (=red). If primer binding sites in the sequences contained terminal gaps they were counted as missing data (=gray).
Tab. S1: Primers evaluated in silico
Metabarcoding and barcoding primers from the literature.
Tab. S2: 15 taxa (an sub grups, e.g. excluding terestiral Coleoptera) downloaded for development of the BF / BR primers with PrimerMiner
Tab. S3: OTU tables and taxonomic assignments
These tables were used to generate Figure 3
Scripts S1: Metabaroding pipeline scripts
Scripts to analyse the raw sequence data (includes scripts for making figures etc. does require Usearch and/or Vsearch)
Scripts S2: Scripts to extract haplotype data and haplotype sequences
For each replicate 52 different taxa were extracted in bulk. Haplotypes were extracted for each of the 10 samples and assembled: see file 160609_Haplotypes.fasta in this folder.
Manuscript file as a word document
Feel free to use this file with track changes if your would like to provide feedback to tis manuscript. We apreciate your feedback and support! = )