NOT PEER-REVIEWED
"PeerJ Preprints" is a venue for early communication or feedback before peer review. Data may be preliminary.
A newer version of this Preprint is available: View the latest version

Supplemental Information

Figure S1 : Run Times for Variant Calling Pipelines for SRP019719

Run times for Illumina exome samples from SRP019719 (n=15, Table S3). Run-times are displayed for the entire variant calling pipeline when using both GATK indel realignment and quality score recalibration (“Full Pipeline” - purple), indel realignment only (“Realign Only” - red), quality score recalibration only (“Recalibrate Only” - green), or neither (“No Preprocess” – blue). Run times are provided for the GATK UnifiedGenotyper (“UG”), GATK HaplotypeCaller (“HC”), and VarScan using 3 sets of parameters (see Methods). “VarScan-Cons” is the most conservative set of parameters. Variant calling was performed with varying amounts of concurrent usage, which seemed to most significantly affect the GATK HaplotypeCaller when analyzing data that was subject to quality score recalibration without indel realignment. We think this is due to a set of quality scores that required non-default parameters, and we don’t think it is safe to assume this trend will necessarily apply to most other datasets (for example, this was not an issue for the 1000 Genomes exome samples).

DOI: 10.7287/peerj.preprints.403v2/supp-1

Figure S2 : Number of Variants Called for GATK versus VarScan (SRP019719 – all chromosomes)

Number of SNPs (left) and indels (right) called for Illumina exome samples from SRP019719 (n=15). The number of variants called is displayed for four types of pre-processing: using both GATK indel realignment and quality score recalibration (“Full Pipeline” - purple), indel realignment only (“Realign Only” - red), quality score recalibration only (“Recalibrate Only” - green), or neither (“No Preprocess” – blue). Variant counts are provided for the GATK UnifiedGenotyper (“UG”), GATK HaplotypeCaller (“HC”), and VarScan using 3 sets of parameters (see Methods). UnifiedGenotyper and HaplotypeCaller variants are then divided into the set of all variants (“UG-all” and “HC-all”) and higher-quality variant calls where variants flagged as low quality have been removed (“UG-HQ” and “HC-HQ”). “VarScan-Cons” is the most conservative set of parameters for VarScan.

DOI: 10.7287/peerj.preprints.403v2/supp-2

Figure S3 : Distribution of ANNOVAR Annotations for SNP Variants (SRP019719)

Distribution of variant types defined in the ANNOVAR exome report for Illumina exome samples from SRP019719 (n=15). Variants are classified based upon population frequency and damaging prediction (see Methods). Low frequency variants (MAF < 0.01) that are displayed in orange if they are predicted to be damaging and are displayed in green if they are not predicted to be damaging. Unknown frequency variants are displayed in red if they are predicted to be damaging and are displayed in blue if they are not predicted to be damaging. Although all samples should contain some unknown frequency variants, a high proportion of unknown frequency variants are expected to correlate with a high false positive rate (for example, ideally, you would expect fewer unknown frequency variants than low frequency variants). Seven variant calling strategies were tested (GATK UnifiedGenotyper and HaplotypeCaller, with and without filtering low quality variants; VarScan with 3 sets of parameters, see Methods). “VarScan-Cons” is the most conservative set of parameters for VarScan. Each variant caller was also tested with 4 preprocessing conditions, corresponding to the colored boxes under the bar plot: variants called using both GATK indel realignment and quality score recalibration (purple), indel realignment only (red), quality score recalibration only (green), or neither (blue).

DOI: 10.7287/peerj.preprints.403v2/supp-3

Figure S4 : Distribution of ANNOVAR Annotations for Indel Variants in Coding Regions

A. Distribution of variant types defined in the ANNOVAR exome report for selected 1000 Genomes (1KG) exome samples (n=12). Variants are classified based upon population frequency and damaging prediction (see Methods). Unknown frequency variants are displayed in blue. There were no low frequency indels in these samples and damaging predictions primarily apply to SNPs. Although all samples should contain some unknown frequency variants, a high proportion of unknown frequency variants would ideally be expected to correlate with a high false positive rate: however, indel characterization has not been as comprehensive as SNP characterization, so this may be less fair metric for indels. Seven variant calling strategies were tested (GATK UnifiedGenotyper and HaplotypeCaller, with and without filtering low quality variants; VarScan with 3 sets of parameters, see Methods). “VarScan-Cons” is the most conservative set of parameters for VarScan. Each variant caller was also tested with 4 preprocessing conditions, corresponding to the colored boxes under the bar plot: variants called using both GATK indel realignment and quality score recalibration (purple), indel realignment only (red), quality score recalibration only (green), or neither (blue). B. Same as A, but Illumina exome samples from SRP019719 (n=15).

DOI: 10.7287/peerj.preprints.403v2/supp-4

Figure S5 : Targeted Exon Technical Replicate SNP Concordance (1KG – chr20)

A. Two subjects (NA18510 and NA18637) had two different targeted exon samples. Concordance between these two samples (per subject) is shown for all SNPs called on chromosome 20. Seven variant calling strategies were tested (GATK UnifiedGenotyper and HaplotypeCaller, with and without filtering low quality variants; VarScan with 3 sets of parameters, see Methods). “VarScan-Cons” is the most conservative set of parameters for VarScan. Each variant caller was also tested with 4 preprocessing conditions: variants called using both GATK indel realignment and quality score recalibration (“Full Pipeline” - purple), indel realignment only (“Realign Only” - red), quality score recalibration only (“Recalibrate Only” - green), or neither (“No Preprocess” - blue). B. Similar to A, except SNPs must occur within coding regions. C. Similar to B, except SNP must specifically occur within a member of the targeted gene panel.

DOI: 10.7287/peerj.preprints.403v2/supp-5

Figure S6 : Targeted Exon Technical Replicate Indels Concordance (1KG – chr20)

A. Two subjects (NA18510 and NA18637) had two different targeted exon samples. Concordance between these two samples (per subject) is shown for all SNPs called on chromosome 20. Seven variant calling strategies were tested (GATK UnifiedGenotyper and HaplotypeCaller, with and without filtering low quality variants; VarScan with 3 sets of parameters, see Methods). “VarScan-Cons” is the most conservative set of parameters for VarScan. Each variant caller was also tested with 4 pre-processing conditions: variants called using both GATK indel realignment and quality score recalibration (“Full Pipeline” - purple), indel realignment only (“Realign Only” - red), quality score recalibration only (“Recalibrate Only” - green), or neither (“No Preprocess” - blue). Unlike the SNP calls, coding and targeted variant concordance is not provided for indels because there are essentially no indels called within coding regions on chromosome 20.

DOI: 10.7287/peerj.preprints.403v2/supp-6

Figure S7 : Targeted Exon SNP Overlap (Coding Variants)

All coding variants on chromosome 20 were tabulated for 1000 Genomes (1KG) all targeted exon samples, keeping track of the samples in which the variants were called. Only coding variants on chromosome 20 were considered for the 1000 Genomes (1KG) samples, but all coding variant were considered for the SRP019719 samples. In order to simplify presentation of these results, we focused on the highest quality variant calls for each variant calling strategy: GATK UnifiedGenotyper with low-quality variants removed (UG-HQ, blue), GATK HaplotypeCaller with low-quality variants removed (HC-HQ, green), and VarScan using a custom set of conservative parameters (VarScan-Cons, red). Similarly, only variants subject to GATK indel realignment and quality score recalibration (“Full Pipeline”) are considered for this comparison. Unlike the SNP calls, coding concordance is not provided for indels because there are essentially no indels called within coding regions on chromosome 20. Similar to the 1KG exome results (Figure 6), all SNPs called by VarScan-Cons were also called with the GATK UnifiedGenotyper.

DOI: 10.7287/peerj.preprints.403v2/supp-7

Figure S8 : Distribution of ANNOVAR Annotations for Additional Variant Callers (Coding SNPs)

A. For each of the three datasets characterized in this study (1KG targeted exon, n=14; 1KG exome, n=12; SRP019719 exome, n=15), variants are classified based upon population frequency and damaging prediction (see Methods). Low frequency variants (MAF < 0.01) that are displayed in orange if they are predicted to be damaging and are displayed in green if they are not predicted to be damaging. Unknown frequency variants are displayed in red if they are predicted to be damaging and are displayed in blue if they are not predicted to be damaging. This figure displays the results for the seven variant calling strategies shown in Figure 3 and Figure S3, but the results are only shown for 2 preprocessing conditions, corresponding to the colored boxes under the bar plot: variants called using both GATK indel realignment and quality score recalibration (purple) or neither step (blue). Additionally, an unfiltered set of samtools variants are displayed for those pre-processing conditions for all 3 datasets. An unfiltered set of freebayes variants are displayed for the targeted exon dataset but not for any exome datasets because the proportion of unknown frequency variants is unacceptably high (and therefore we decided it didn’t justify further comparisons in the exome datasets).

DOI: 10.7287/peerj.preprints.403v2/supp-8

Figure S9 : Variant Caller Overlap for Coding SNPs

All coding variants were tabulated for all exome samples (1KG targeted exon n=14, left; 1KG exome n=12, middle; SRP019719 n=15, right), keeping track of the samples in which the variants were called. Only coding variants on chromosome 20 were considered for the 1000 Genomes (1KG) samples, but all coding variants were considered for the SRP019719 samples. In order to simplify presentation of these results, we focused on the highest quality variant calls for each variant calling strategy: GATK UnifiedGenotyper with low-quality variants removed (UG-HQ, blue), GATK HaplotypeCaller with low-quality variants removed (HC-HQ, green), and VarScan using a custom set of conservative parameters (VarScan-Cons, red). An unfiltered set of samtools variants are also shown for comparison. Similarly, only variants subject to GATK indel realignment and quality score recalibration (“Full Pipeline”) are considered for this comparison. Counts for variants that would have been identified using the union of HC-HQ, UG-HQ, and samtools are shown in underlined font.

DOI: 10.7287/peerj.preprints.403v2/supp-9

Table S1: Alignment Statistics for Targeted Exon Datasets

DOI: 10.7287/peerj.preprints.403v2/supp-10

Table S2: Alignment Statistics for 1KG Exome Samples

DOI: 10.7287/peerj.preprints.403v2/supp-11

Table S3: Alignment Statistics for SRP019719 Illumina Exome Samples

DOI: 10.7287/peerj.preprints.403v2/supp-12

Table S4: Run Times for 1KG Targeted Exon Samples

DOI: 10.7287/peerj.preprints.403v2/supp-13

Table S5: Run Times for 1KG Exome Samples

DOI: 10.7287/peerj.preprints.403v2/supp-14

Table S6: Run Times for SRP019719 Exome Samples

DOI: 10.7287/peerj.preprints.403v2/supp-15

Table S7: 1KG Exome Recovery of Validated SNPs

DOI: 10.7287/peerj.preprints.403v2/supp-16

Figure S8: Recovery of Targeted Exon SNPs in Exome Data for NA18510 (1KG - chr20)

DOI: 10.7287/peerj.preprints.403v2/supp-17

Figure S9: Recovery of Targeted Exon SNPs in Exome Data for NA18637 (1KG - chr20)

DOI: 10.7287/peerj.preprints.403v2/supp-18

Additional Information

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Charles D Warden conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.

Aaron W Adamson conceived and designed the experiments, wrote the paper, reviewed drafts of the paper.

Susan L Neuhausen conceived and designed the experiments, wrote the paper, reviewed drafts of the paper.

Xiwei Wu conceived and designed the experiments, wrote the paper, reviewed drafts of the paper.

Grant Disclosures

The following grant information was disclosed by the authors:

National Institutes of Health [Comprehensive Cancer Center Grant P30 CA33572]

Funding

This study is supported by National Institutes of Health [Comprehensive Cancer Center Grant P30 CA33572] and City of Hope National Medical Center institutional funding. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.


Add your feedback

Before adding feedback, consider if it can be asked as a question instead, and if so then use the Question tab. Pointing out typos is fine, but authors are encouraged to accept only substantially helpful feedback.

Some Markdown syntax is allowed: _italic_ **bold** ^superscript^ ~subscript~ %%blockquote%% [link text](link URL)
 
By posting this you agree to PeerJ's commenting policies
  Visitors   Views   Downloads