Kelpie: generating full-length ‘amplicons’ from whole-metagenome datasets
- Published
- Accepted
- Subject Areas
- Bioinformatics, Ecology, Genomics, Microbiology
- Keywords
- metagenomes, amplicons, targeted assembly, community structure, in-silico PCR
- Copyright
- © 2018 Greenfield et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2018. Kelpie: generating full-length ‘amplicons’ from whole-metagenome datasets. PeerJ Preprints 6:e27376v1 https://doi.org/10.7287/peerj.preprints.27376v1
Abstract
Introduction. Whole-metagenome sequencing can be a rich source of information about the structure and function of entire metagenomic communities, but getting accurate and reliable results from these datasets can be challenging. Analysis of these datasets is founded on the mapping of sequencing reads onto known genomic regions from known organisms, but short reads will often map equally well to multiple regions, and to multiple reference organisms. Assembling metagenomic datasets prior to mapping can generate much longer and more precisely mappable sequences but the presence of closely related organisms and highly conserved regions makes metagenomic assembly challenging, and some regions of particular interest can assemble poorly. One solution to these problems is to use specialised tools, such as Kelpie, that can accurately extract and assemble full-length sequences for defined genomic regions from whole-metagenome datasets.
Methods. Kelpie is a kMer-based tool that generates full-length amplicon-like sequences from whole-metagenome datasets. It takes a pair of primer sequences and a set of metagenomic reads, and uses a combination of kMer filtering, error correction and assembly techniques to construct sets of full-length inter-primer sequences.
Results. The effectiveness of Kelpie is demonstrated here through the extraction and assembly of full-length ribosomal marker gene regions, as this allows comparisons with conventional amplicon sequencing and published metagenomic benchmarks. The results show that the Kelpie-generated sequences and community profiles closely match those produced by amplicon sequencing, down to low abundance levels, and running Kelpie on the synthetic CAMI metagenomic benchmarking datasets shows similar high levels of both precision and recall.
Conclusions. Kelpie can be thought of as being somewhat like an in-silico PCR tool, taking a primer pair and producing the resulting ‘amplicons’ from a whole-metagenome dataset. Marker regions from the 16S rRNA gene were used here as an example because this allowed the overall accuracy of Kelpie to be evaluated through comparisons with other datasets, approaches and benchmarks. Kelpie is not limited to this application though, and can be used to extract and assemble any genomic region present in a whole metagenome dataset, as long as it is bound by a pairs of highly conserved primer sequences.
Author Comment
This is a submission to PeerJ for review.
Supplemental Information
Comparisons between EBI and Kelpie community profiles
Excel spreadsheet containing EBI-generated profile for ERP008951, Kelpie profiles generated from the EBI-filtered '16S' reads, and comparisons at Class and Order levels.
Full CSM OTU table produced from both amplicon and Kelpie-generated sequences
Full OTU table produced from both amplicon and Kelpie-generated sequences for all three coal seam microbiome samples, ordered by total abundance. Red amplicon counts indicated that mapping the WGS reads for the sample back to the consensus sequence for the OTU showed that it had less than 90% kMer coverage. The green amplicon counts show where the WGS reads for a sample gave 100% coverage of the OTU sequence. This table is derived from the ‘AE’ tabs in the Excel spreadsheet ‘Kelpie - CSM.xlsx’ which is available as Supplemental Table S6.
Full CSM OTU table produced from amplicon,Kelpie sequences and primer-bound sequences extracted from full metaSPAdes assemblies and assemblies of the filtered 16S region reads
Full OTU table produced from amplicon, Kelpie-generated sequences and primer-bounded sequences extracted from full metaSPAdes assemblies for each WGS sample file, and from assemblies of the filtered 16S region reads, for all three coal seam microbiome samples, ordered by total abundance. Red amplicon counts indicated that mapping the WGS reads for the sample back to the consensus sequence for the OTU showed that it had less than 90% kMer coverage. The green amplicon counts show where the WGS reads for a sample gave 100% coverage of the OTU sequence. This table is derived from the ‘AESS’ tabs in the Excel spreadsheet ‘Kelpie - CSM.xlsx’ which is available as Supplemental Table S6.
Organisms in CAMI Low Complexity dataset, with presence in assembled contigs and Kelpie extended amplicons
The #strain column is the stated number of strains present in the WGS reads, and the Abundance column is the total abundance for the specified organism (including all its strains). These tables are derived from the Excel spreadsheet ‘Kelpie - CAMI Low.xlsx’ which is available as Supplemental Table S7. Part (a) includes all organisms, and Part (b) excludes those organisms that were found not to have a 16S V4 region in their assembled contigs.
Organisms in CAMI Medium Complexity dataset, with presence in assembled contigs and Kelpie extended amplicons
The #strain column is the stated number of strains present in the WGS reads, and the Abundance column is the total abundance for the specified organism (including all its strains). These tables are derived from the Excel spreadsheet ‘Kelpie - CAMI Medium.xlsx’ which is available as Supplemental Table S8. Part (a) includes all organisms, and Part (b) excludes those organisms that were found not to have a 16S V4 region in their assembled contigs.
Kelpie - CSM
Source spreadsheet for the CSM-derived tables and figures.
Kelpie - CAMI Low
Source spreadsheet for the CAMI Low-derived tables and figures.
Kelpie - CAMI Medium
Source spreadsheet for the CAMI Medium-derived tables and figures.