Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products

UC Davis Genome Center, University of California, Davis, Davis, California, United States
Department of Molecular and Cellular Biology, University of California, Davis, Davis, California, United States
Department of Plant Sciences, University of California, Davis, CA, USA
Department of Medical Microbiology and Immunology, University of California, Davis, Davis, California, United States
Department of Evolution and Ecology, University of California, Davis, Davis, California, United States
ithree institute, University of Technology Sydney, Sydney, NSW, Australia
DOI
10.7287/peerj.preprints.260v1
Subject Areas
Bioengineering, Bioinformatics, Computational Biology, Genomics, Microbiology
Keywords
Hi-C, microbial ecology, metagenomics, plasmids, synthetic microbial communities, Markov Clustering, metagenome assembly, strain differentiation, haplotype phasing, genome scaffolding
Copyright
© 2014 Beitel et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore RW, Eisen JA, Darling AE. 2014. Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ PrePrints 2:e260v1

Abstract

Metagenomics is a valuable tool for the study of microbial communities but has been limited by the difficulty of “binning” the resulting sequences into groups corresponding to the individual species and strains that constitute the community. Moreover, there are presently no methods to track the flow of mobile DNA elements such as plasmids through communities or to determine which of these are co-localized within the same cell. We address these limitations by applying Hi-C, a technology originally designed for the study of three-dimensional genome structure in eukaryotes, to measure the cellular co-localization of DNA sequences. We leveraged Hi-C data generated from a synthetic metagenome sample to accurately cluster metagenome assembly contigs into groups that contain nearly complete genomes of each species. The Hi-C data also reliably associated plasmids with the chromosomes of their host and with each other. We further demonstrated that Hi-C data provides a long-range signal of strain-specific genotypes, indicating such data may be useful for high-resolution genotyping of microbial populations. Our work demonstrates that Hi-C sequencing data provide valuable information for metagenome analyses that are not currently obtainable by other methods. This metagenomic Hi-C method could facilitate future studies of the fine-scale population structure of microbes, as well as studies of how antibiotic resistance plasmids (or other genetic elements) mobilize in microbial communities. The method is not limited to microbiology; the genetic architecture of other heterogeneous populations of cells could also be studied with this technique.

Supplemental Information

Figure 1. Hi-C insert distribution.

The distribution of genomic distances between Hi-C read pairs is shown for read pairs mapping to each chromosome. For each read pair the minimum path length on the circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded. The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each bin was recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 and plotted.

DOI: 10.7287/peerj.preprints.260v1/supp-1

Figure 2. Metagenomic Hi-C associations.

The log-scaled, normalized number of Hi-C read pairs associating each genomic replicon in the synthetic community is shown as a heat map (see color scale, blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.

DOI: 10.7287/peerj.preprints.260v1/supp-2

Figure 3. Contigs associated by Hi-C reads.

A graph is drawn with nodes depicting contigs and edges depicting associations between contigs as indicated by aligned Hi-C read pairs, with the count thereof depicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see legend) with node size reflecting contig size. Contigs below 5kb and edges with weights less than 5 were excluded. Contig associations were normalized for variation in contig size.

DOI: 10.7287/peerj.preprints.260v1/supp-3

Figure 4. Hi-C contact maps for replicons of Lactobacillus brevis.

Contact maps show the number of Hi-C read pairs associating each region of the L. brevis genome. The L. brevis chromosome (Lac0, a, Spearman rank correlation) and plasmids (Lac1, b; Lac2, c) show enrichment for local associations (bright diagonal band). Interactions between Lac1 and Lac0 (d) and Lac2 and Lac0 (e) are shown. All except Lac0 are log-scaled. Circularity of Lac0 became apparent after transforming data with the Spearman rank correlation (computed for each matrix element between the row and column sharing that element) in place of log transformation (a) indicated by the high number of contacts between the ends of the sequence. In all plots, pixels are sized to represent interactions between blocks sized at 1% of the interacting genomes. The number of HindIII restriction sites in each region of sequence is shown as a histogram on the left and top of each panel.

DOI: 10.7287/peerj.preprints.260v1/supp-4

Figure 5. Relationship of distance to degree of separation in Hi-C and mate-pair variant graphs.

The length of paths between random pairs of SNP sites in a SNP graph constructed from both Hi-C and mate-pair libraries of varying sizes (left; 5 kb, 10 kb, 20 kb, 40 kb), smoothed using locally-weighted regression.

DOI: 10.7287/peerj.preprints.260v1/supp-5

Supplementary Figure 1. Illustration of the signal provided by Hi-C for metagenome binning

Two bacterial cells are illustrated, each containing a single circular chromosome. For one genomic region in each of the two species, examples of associations that are likely (green; red is “not likely”) to be derived from Hi-C are illustrated.

DOI: 10.7287/peerj.preprints.260v1/supp-6

Supplementary Figure 2. Visualization of the impact of parameter choice on the quality of clustering solutions.

A small-multiples plot is showing 5x5 combinations of contact minimum (top to bottom; 0, 3, 5, 7, 9) and contig size minimum (left to right; 1,000, 8,000, 15,000, 22,000, 29,000) thresholds. For each parameter combination, line plots show the quality (y-axis) of clustering solutions performed for inflation values in the interval [1,2]. The quality of clustering solutions is measured in terms their true-positive rate (red), false-positive rate (green), positive predictive value (blue), and negative predictive value (black) are shown.

DOI: 10.7287/peerj.preprints.260v1/supp-7

Supplementary Figure 3. Hi-C contact frequency within L. brevis genome.

Contact frequency is visualized as a heat map, after normalization and application of the spearman rank correlation (matrix elements are the spearman correlation of the row and column of which they are the intersection). Circularity is apparent in the elevated contact between either end of the reference assembly sequence.

DOI: 10.7287/peerj.preprints.260v1/supp-8

Supplementary Figure 4. Hi-C contact map for Lactobacillus brevis plasmid 1.

Contact maps show the number of Hi-C read pairs associating each region of the L. brevis plasmid 1. Contact values are Spearman rank correlation transformed following normalization. Pixels are sized to represent interactions between blocks sized at 1% of the interacting sequence. A minimal signal of circularity is apparent with enrichment for contact between the minimum and maximum positions within the reference assembly.

DOI: 10.7287/peerj.preprints.260v1/supp-9

Supplementary Figure 5. Hi-C contact map for Lactobacillus brevis plasmid 2.

Contact maps show the number of Hi-C read pairs associating each region of the L. brevis plasmid 2. Contact values are Spearman rank correlation transformed following normalization. Pixels are sized to represent interactions between blocks sized at 1% of the interacting sequence. A signal indicative of circularity is not apparent.

DOI: 10.7287/peerj.preprints.260v1/supp-10

Supplementary Figure 6. Hi-C contact frequency within P. pentosaceus genome.

Contact frequency is visualized as a heat map, after normalization and application of the spearman rank correlation (matrix elements are the spearman correlation of the row and column of which they are the intersection). Circularity is apparent in the elevated contact between either end of the reference assembly sequence.

DOI: 10.7287/peerj.preprints.260v1/supp-11

Supplementary Figure 7. Variant graph illustration.

Two examples of variant graphs (non-data illustration). Variant nodes (circles) are linked by edges (light grey lines) derived from read pair data with small and medium (Graph I) or small, medium, and large (Graph 2) inserts. A path between two nodes (start, end) is illustrated and this path is shorter in the graph representing the dataset that includes larger-insert reads.

DOI: 10.7287/peerj.preprints.260v1/supp-12

Table 1. Species alignment fractions.

The number of reads aligning to each replicon present in the synthetic microbial community are shown before and after filtering, along with the percent of total constituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon, species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.

DOI: 10.7287/peerj.preprints.260v1/supp-13

Table 2. Markov clustering of metagenome assembly contigs using Hi-C data.

A range of inflation parameters were applied, and the precision and recall for the resulting clusters was calculated as described in the text. An inflation parameter of 1.1 produced a near perfect clustering of contigs by species.

DOI: 10.7287/peerj.preprints.260v1/supp-14

Table 3. Variant graph statistics.

Connectivity statistics are shown for variant graphs constructed from various simulated mate-pair (# kb, MP) and Hi-C read datasets. Graph constructed from all Hi-C data are compared to those constructed using only Hi-C read pairs with inserts over 1 kb. The Hi-C variant graphs are highly connected in contrast to the mate-pair graphs that have both lower connectedness and lower rates of variants occurring in the same connected components.

DOI: 10.7287/peerj.preprints.260v1/supp-15

Supplementary Table 1. SOAPdenovo assembly results.

Statistics are shown for three assemblies, including the simulated coverage and the number of contigs (and scaffolds) present in the assembly. Assembly quality is reflected in the count of misassembled contigs and scaffolds (“contig error” and “scaffold error”). The percent of the total reference sequence size constituted by each assembly is also shown.

DOI: 10.7287/peerj.preprints.260v1/supp-16

Supplementary Table 2. Species alignment fractions (expanded table).

The number of reads aligning to each replicon present in the synthetic microbial community are shown before and after alignment filtering, along with the percent of total constituted by each species. The GC content and restriction site (R.S.) counts of each replicon, species, and strain are shown. Total and fractional raw alignment counts adjusted by R.S. counts are also shown, constituting our best approximation of relative abundances of synthetic community members. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21. <!--[if !supportAnnotations]--> <!--[endif]-->

DOI: 10.7287/peerj.preprints.260v1/supp-17

Supplementary Table 3. Raw metagenomic Hi-C association counts.

The number of Hi-C read pairs associating each genomic replicon in the mock community is shown without normalization.

DOI: 10.7287/peerj.preprints.260v1/supp-18

Supplementary Table 4. Normalized association counts.

Shown are the counts of Hi-C read pairs associating each pair of replicons included in the synthetic community, normalized as described in the methods.

DOI: 10.7287/peerj.preprints.260v1/supp-19