Long-read viral metagenomics captures abundant and microdiverse viral populations and their niche-defining genomic islands

View article
Bioinformatics and Genomics

Main article text

 

Introduction

Materials & Methods

Construction of the mock viral community

Construction of the Western English Channel viral metagenome

Library preparation, amplification and sequencing

Maximizing the benefits of long read and short read assemblies

Validation of error correction of long reads in viral metagenomic data

Analysis of Tig404—a contig closely related to Pelagiphage HTVC010P

Estimating relative abundance and viral clusters of WEC viruses in viral metagenomes

Results & Discussion

Assembly of VirION reads successfully captured mock viral community genomes and retained relative abundance information

Combining VirION reads with short read data improves viral metagenomic assembly in an environmental virome

Assembly of VirION reads capture important, microdiverse populations previously missed by short-read data

Tig404—an example of how VirION reads improve viral metagenomics

Conclusions

Supplemental Information

Supplementary Methods

Describes the quality control of sequence data and methods used to evaluate (1) relative abundances of contigs; (2) whether long read assembly captured more microdiverse genomes; (3) recovery of genomic islands and their predicted functional composition at the population level; (4) evaluation of functional variance of viral genomic islands spanned by VirION reads.

DOI: 10.7717/peerj.6800/supp-1

Mock viral community member characteristics

Genomic characteristics of the six phages chosen for the mock viral community to develop and evaluate VirION protocols.

DOI: 10.7717/peerj.6800/supp-2

The numbers of phage genomes identified in this study using short, hybrid and error-corrected long read assembly of VirION reads, as identified by VirSorter (Roux et al., 2015)

For comparison important viral metagenomic studies (see references) and viruses from ‘RefSeq’. Prior to quantification of global relative abundances and (shared-protein) clustering, phage genomes were re-analysed using VirSorter to ensure uniformity of gene-calling, resulting in above classifications. Note: VirSorter Categories as follows: 1 and 4: “most confident” predictions (viral and lysogen, respectively); 2 and 5: “likely” predictions (viral and lysogen, respectively).

DOI: 10.7717/peerj.6800/supp-3

Student t-test results to identify significant differences between the number of circular viral contigs from short read only vs. hybrid assemblies

Student t-test results to identify significant differences between the number of circular viral contigs (as identified by VirSorter (Roux et al., 2015) from short read only vs. hybrid assemblies with VirION reads using metaSPAdes assemblies from triplicate random subsamples of short reads across different levels of sequencing depth. Significant differences are highlighted in bold.

DOI: 10.7717/peerj.6800/supp-4

Student t-test results to identify significant differences between the number of viral contigs from short read only vs. hybrid assemblies with VirION reads

Student t-test results to identify significant differences between the number of viral contigs (as identified by VirSorter (Roux et al., 2015) from short read only vs. hybrid assemblies with VirION reads using metaSPAdes assemblies from triplicate random subsamples of short reads across different levels of sequencing coverage. Significant differences are highlighted in bold.

DOI: 10.7717/peerj.6800/supp-5

Predicted genes located within 66 genomic islands spanned by VirION reads

For each spanning read, putative start and stop codons were estimated by hierarchical clustering and used as queries in a BLASTx alignment against the NR database. Genes with unknown function were removed and the remaining putatively classified genes were used to assess functional variance within viral genomic islands

DOI: 10.7717/peerj.6800/supp-6

Fragment length of LASL-amplified VirION reads before and after sequencing

(A) Bioanalyzer (Agilent) electropherogram showing the fragment length distribution of linker-amplified mock viral community DNA produced from 20 ng template DNA sheared to ∼8kbp. Amplicon length peaked at ∼5.4 Kbp, demonstrating PCR preference for amplification of shorter DNA fragments; (B) Read length distribution of VirION mock viral community amplicons (as shown in ‘A’; red dashed lines indicate approximate length of sheared template DNA); mean average read length was ∼4 kbp, likely due to preferential sequencing of shorter DNA fragments.

DOI: 10.7717/peerj.6800/supp-7

Evaluation of error correction of long-read assemblies using short read data

Impact of using short read sequencing to error correct overlap layout consensus-derived contigs with Pilon shows that approximate limits of the number of insertions and deletions that can be fixed is reached at ∼9 Gbp of short read data (median coverage of ∼70). Analysis was performed against the full contig set from Overlap layout consensus assembly of VirION reads from the Western English Channel ( n = 1,500).

DOI: 10.7717/peerj.6800/supp-8

Difference and 95% CI of median predicted protein length of different assembly types to evaluate the impact of sequencing error and error correction of VirION reads with short-read data

Median predicted protein length of 1,000 randomly selected proteins were calculated and compared to a similar treatment of proteins from a RefSeq v.8.4 Caudovirales database to measure effect size. This process was bootstrapped 1,000 times to provide 95% confidence intervals. The distributions on the graph represent distributions of differences in medians (Cumming, 2014) . The median effect size (bold number) and the 95% CI boundaries (black line under each distribution, and numbers in brackets) are shown.

DOI: 10.7717/peerj.6800/supp-9

Statistical significance of effects of assembly type on genomic island length and density

Effect size and bootstrapped median 95% CI intervals for impact of different assembly types on (A) genomic island length and (B) genomic island density (kbp of genomic island per kbp of genome). Values in boxes represent the median difference between 1,000 bootstrapped medians (95% CI). Green boxes represent significant ( p < 0.05) differences calculated with a Wilcoxon Rank Sum test.

DOI: 10.7717/peerj.6800/supp-10

Evaluation of genome-wide nucleotide diversity

The data point for long-read assembled contig tig404 (described in the main text) is highlighted; this virus belongs in the same viral cluster as pelagiphage HTVC010P, an abundant phage that fails to assemble in metagenomic datasets, potentially due to high microdiversity.

DOI: 10.7717/peerj.6800/supp-11

Alignment of the genome of HTVC010P with tig404 assembled using the VirION pipeline

Genomes were 89% identical at nucleotide in shared regions and both shared a conserved genomic island (green) bounded by structural proteins. Genome alignments were produced by Mauve (Darling et al., 2004) within the Geneious software (Kearse et al., 2012).

DOI: 10.7717/peerj.6800/supp-12

Top 50 most abundant viral contigs in a Western English Channel virome

Estimated relative abundances (number of recruited reads from the short-read dataset) of the Western English Channel viral contigs were calculated by competitive recruitment of short reads back to viral contigs derived from the VirION bioinformatics pipeline using FastViromeExplorer (Tithi et al., 2018). 60% of the top 50 most abundant viruses are detected only in the error-corrected overlap layout consensus assemblies.

DOI: 10.7717/peerj.6800/supp-13

The longest complete viral genome from our study was 316 kbp in length

H_NODE_6 was the longest recovered virus captured by scaffolding of a De Bruijn Graph assembly using VirION reads (red). Alignment of short read only contigs (blue) against the complete genome show the full length is only captured by the scaffolding approach, whereas the short-read approach results in a breakage at ∼205 kbp (grey box). Coverage and Shannon Entropy are both shown as median values of a 200 bp sliding window, with 100 bp overlap.

DOI: 10.7717/peerj.6800/supp-14

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Joanna Warwick-Dugdale performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

Natalie Solonenko performed the experiments.

Karen Moore performed the experiments, contributed reagents/materials/analysis tools.

Lauren Chittick performed the experiments.

Ann C. Gregory analyzed the data.

Michael J. Allen authored or reviewed drafts of the paper, approved the final draft.

Matthew B. Sullivan contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper, approved the final draft.

Ben Temperton conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

DNA Deposition

The following information was supplied regarding the deposition of DNA sequences:

Sequencing data and assemblies are available at the European Nucleotide Archive under the project accession number PRJEB27181.

Data Availability

The following information was supplied regarding data availability:

All code and analyses can be found in a GitHub repository: https://github.com/btemperton/long_read_viromics.

Funding

Major support was provided by a fellowship to Ben Temperton from the Bermuda Institute of Ocean Sciences as part of the BIOS-SCOPE program; the Royal Society and the Natural Environment Research Council (NERC) (NE/P008534/1 and NE/R010935/1 to Ben Temperton). Additional support was from a NERC Great Western Four+ (GW4+) Doctoral Training Partnership PhD to Joanna Warwick-Dugdale (NE/L002434/1) and the Gordon and Betty Moore Foundation (awards #3790 and 5488) to Matthew B. Sullivan. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

113 Citations 11,577 Views 2,174 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more