PCR-amplified metagenomes are quantitative but include a significant amount of duplicated reads
A. Comparison of depth of coverage between unamplified (TruSeq, x-axis) and PCR-amplified (Nextera XT or Accel-NGS 1S Plus, y-axis) libraries. The average depth of coverage was computed for each contig as the average read depth normalized by the total size of the library. The 1:1 equivalence is indicated with a black line, while a linear best fit is shown in blue. For clarity, only 1,000 contigs randomly selected from each sample are plotted. Contigs with no reads mapped in the PCR-amplified library were not included. To be able to directly compare the two plots, only samples for which both a Nextera XT and 1S Plus libraries were available are included (Table S1). The subpanels show the correlation coefficient (Pearson and Spearman) of a sample-by-sample correlation between depth of coverage in unamplified and PCR-amplified libraries, either for all contigs or only for contigs ≥ 10kb with a depth of coverage ≥ 10x. B. Percentage of duplicated reads (y-axis) as a function of the number of PCR cycles performed during library creation (x-axis). Underlying data are availabe in Table S1.
Insert size and GC content distribution for all vs high-depth regions
A & B. Distribution of insert size for all regions (green) or only regions with high depth of coverage (orange) across PCR-amplified libraries. In panel A, all insert sizes were centered around 500bp to enable a more direct comparison between libraries. Panel B shows the same data without this transformation (i.e. raw insert size). C & D. Distribution of GC % for all regions (green) or only regions with high depth of coverage (orange). For panel C, each library GC% was centered around 50%, while panel D shows the same data without this transformation.
Assembly size and estimated error rates for different assembly pipelines
Comparison of the output of different assembly pipelines applied to PCR-amplified libraries. Panels A & B show the cumulative length of all contigs (A) or contigs ≥ 10kb (B) across assembly pipelines (x-axis). Panel C displays the cumulative length of contigs ≥ 10kb relative to the largest value for each library, i.e. as a percentage of the “best” assembly for this library (“best” being defined as the largest cumulative length of contigs ≥ 10kb). Panel D displays the distribution of estimated error rates across the different assembly pipelines, for the 25 libraries for which error rates could be estimated (Table S2 & S3). Norm.: Normalization, Dedup.: Deduplication, Meta: metaSPAdes, SC: single-cell SPAdes.
Description of samples and libraries analyzed
The first tab lists information about individual samples including the list of all libraries generated for each sample, and the second tab includes information about each library.
Samples including both unamplified and PCR-amplified libraries
List of the 25 PCR-amplified for which an unamplified dataset was available, alongside specific metrics that could be calculated using the unamplified dataset as reference, i.e. correlation of average depth of coverage of contigs, and percentage of contigs from the unamplified assembly detected in the PCR-amplified library. A contig was considered as detected if ≥ 1 read(s) from the PCR-amplified library mapped to it.
Results from the different assembly pipelines tested
The first tab lists the different steps and tools tested. The second tab includes the results of de novo genome assembly with the different pipelines for each PCR-amplified library. For the 25 PCR-amplified libraries for which an unamplified reference was available, this second tab also includes estimates of assembly errors for each assembly pipeline obtained with QUAST.