Detailed comparison of two popular variant calling packages for exome and targeted exon studies
- Published
- Accepted
- Subject Areas
- Bioinformatics, Genomics
- Keywords
- Variant Calling, Exome, Targeted Sequencing, GATK, VarScan, SNP, small indel
- Copyright
- © 2014 Warden et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
- Cite this article
- 2014. Detailed comparison of two popular variant calling packages for exome and targeted exon studies. PeerJ PrePrints 2:e403v1 https://doi.org/10.7287/peerj.preprints.403v1
Abstract
The Genome Analysis Toolkit (GATK) is often considered to be the “gold standard” for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but we felt that an adequate comparison against VarScan may have not yet been performed. More specifically, we compared four lists of variants called by GATK (using the UnifiedGenotyper and the HaplotypeCaller algorithms, with and without filtering low quality variants) and three lists of variants called using VarScan (with varying sets of parameters). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. We found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a high quality gene list, with high concordance (>97%) when compared to high quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84-88% of the high-quality GATK SNPs in the exome datasets. We also accessed the impact of pre-processing (e.g., indel realignment and quality score base recalibration using GATK). In most cases, these pre-processing steps had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. More broadly, we believe the metrics used for comparison in this study can be useful in accessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.
Supplemental Information
Figure S1 : Run Times for Variant Calling Pipelines for SRP019719
Run times for Illumina exome samples from SRP019719 (n=15, Table S3). Run-times are displayed for the entire variant calling pipeline when using both GATK indel realignment and quality score recalibration (“Full Pipeline” - purple), indel realignment only (“Realign Only” - red), quality score recalibration only (“Recalibrate Only” - green), or neither (“No Preprocess” – blue). Run times are provided for the GATK UnifiedGenotyper (“UG”), GATK HaplotypeCaller (“HC”), and VarScan using 3 sets of parameters (see Methods). “VarScan-Cons” is the most conservative set of parameters. Variant calling was performed with varying amounts of concurrent usage, which seemed to most significantly affect the GATK HaplotypeCaller when analyzing data that was subject to quality score recalibration without indel realignment. We think this is due to a set of quality scores that required non-default parameters, and we don’t think it is safe to assume this trend will necessarily apply to most other datasets (for example, this was not an issue for the 1000 Genomes exome samples).
Figure S2 : Number of Variants Called for GATK versus VarScan (SRP019719 – all chromosomes)
Number of SNPs (left) and indels (right) called for Illumina exome samples from SRP019719 (n=15). The number of variants called is displayed for four types of pre-processing: using both GATK indel realignment and quality score recalibration (“Full Pipeline” - purple), indel realignment only (“Realign Only” - red), quality score recalibration only (“Recalibrate Only” - green), or neither (“No Preprocess” – blue). Variant counts are provided for the GATK UnifiedGenotyper (“UG”), GATK HaplotypeCaller (“HC”), and VarScan using 3 sets of parameters (see Methods). UnifiedGenotyper and HaplotypeCaller variants are then divided into the set of all variants (“UG-all” and “HC-all”) and higher-quality variant calls where variants flagged as low quality have been removed (“UG-HQ” and “HC-HQ”). “VarScan-Cons” is the most conservative set of parameters for VarScan.
Figure S3 : Distribution of ANNOVAR Annotations for SNP Variants (SRP019719)
Distribution of variant types defined in the ANNOVAR exome report for Illumina exome samples from SRP019719 (n=15). Variants are classified based upon population frequency and damaging prediction (see Methods). Low frequency variants (MAF < 0.01) that are displayed in orange if they are predicted to be damaging and are displayed in green if they are not predicted to be damaging. Unknown frequency variants are displayed in red if they are predicted to be damaging and are displayed in blue if they are not predicted to be damaging. Although all samples should contain some unknown frequency variants, a high proportion of unknown frequency variants are expected to correlate with a high false positive rate (for example, ideally, you would expect fewer unknown frequency variants than low frequency variants). Seven variant calling strategies were tested (GATK UnifiedGenotyper and HaplotypeCaller, with and without filtering low quality variants; VarScan with 3 sets of parameters, see Methods). “VarScan-Cons” is the most conservative set of parameters for VarScan. Each variant caller was also tested with 4 preprocessing conditions, corresponding to the colored boxes under the bar plot: variants called using both GATK indel realignment and quality score recalibration (purple), indel realignment only (red), quality score recalibration only (green), or neither (blue).
Figure S4 : Distribution of ANNOVAR Annotations for Indel Variants in Coding Regions
A. Distribution of variant types defined in the ANNOVAR exome report for selected 1000 Genomes (1KG) exome samples (n=12). Variants are classified based upon population frequency and damaging prediction (see Methods). Unknown frequency variants are displayed in blue. There were no low frequency indels in these samples and damaging predictions primarily apply to SNPs. Although all samples should contain some unknown frequency variants, a high proportion of unknown frequency variants would ideally be expected to correlate with a high false positive rate: however, indel characterization has not been as comprehensive as SNP characterization, so this may be less fair metric for indels. Seven variant calling strategies were tested (GATK UnifiedGenotyper and HaplotypeCaller, with and without filtering low quality variants; VarScan with 3 sets of parameters, see Methods). “VarScan-Cons” is the most conservative set of parameters for VarScan. Each variant caller was also tested with 4 preprocessing conditions, corresponding to the colored boxes under the bar plot: variants called using both GATK indel realignment and quality score recalibration (purple), indel realignment only (red), quality score recalibration only (green), or neither (blue). B. Same as A, but Illumina exome samples from SRP019719 (n=15).
Figure S5 : Targeted Exon Technical Replicate SNP Concordance (1KG – chr20)
A. Two subjects (NA18510 and NA18637) had two different targeted exon samples. Concordance between these two samples (per subject) is shown for all SNPs called on chromosome 20. Seven variant calling strategies were tested (GATK UnifiedGenotyper and HaplotypeCaller, with and without filtering low quality variants; VarScan with 3 sets of parameters, see Methods). “VarScan-Cons” is the most conservative set of parameters for VarScan. Each variant caller was also tested with 4 preprocessing conditions: variants called using both GATK indel realignment and quality score recalibration (“Full Pipeline” - purple), indel realignment only (“Realign Only” - red), quality score recalibration only (“Recalibrate Only” - green), or neither (“No Preprocess” - blue). B. Similar to A, except SNPs must occur within coding regions. C. Similar to B, except SNP must specifically occur within a member of the targeted gene panel.
Figure S6 : Targeted Exon Technical Replicate Indels Concordance (1KG – chr20)
A. Two subjects (NA18510 and NA18637) had two different targeted exon samples. Concordance between these two samples (per subject) is shown for all SNPs called on chromosome 20. Seven variant calling strategies were tested (GATK UnifiedGenotyper and HaplotypeCaller, with and without filtering low quality variants; VarScan with 3 sets of parameters, see Methods). “VarScan-Cons” is the most conservative set of parameters for VarScan. Each variant caller was also tested with 4 pre-processing conditions: variants called using both GATK indel realignment and quality score recalibration (“Full Pipeline” - purple), indel realignment only (“Realign Only” - red), quality score recalibration only (“Recalibrate Only” - green), or neither (“No Preprocess” - blue). Unlike the SNP calls, coding and targeted variant concordance is not provided for indels because there are essentially no indels called within coding regions on chromosome 20.
Figure S7 : Targeted Exon SNP Overlap (Coding Variants)
All coding variants on chromosome 20 were tabulated for 1000 Genomes (1KG) all targeted exon samples, keeping track of the samples in which the variants were called. Only coding variants on chromosome 20 were considered for the 1000 Genomes (1KG) samples, but all coding variant were considered for the SRP019719 samples. In order to simplify presentation of these results, we focused on the highest quality variant calls for each variant calling strategy: GATK UnifiedGenotyper with low-quality variants removed (UG-HQ, blue), GATK HaplotypeCaller with low-quality variants removed (HC-HQ, green), and VarScan using a custom set of conservative parameters (VarScan-Cons, red). Similarly, only variants subject to GATK indel realignment and quality score recalibration (“Full Pipeline”) are considered for this comparison. Unlike the SNP calls, coding concordance is not provided for indels because there are essentially no indels called within coding regions on chromosome 20. Similar to the 1KG exome results (Figure 6), all SNPs called by VarScan-Cons were also called with the GATK UnifiedGenotyper.
Figure S8 : Distribution of ANNOVAR Annotations for Additional Variant Callers (Coding SNPs)
A. For each of the three datasets characterized in this study (1KG targeted exon, n=14; 1KG exome, n=12; SRP019719 exome, n=15), variants are classified based upon population frequency and damaging prediction (see Methods). Low frequency variants (MAF < 0.01) that are displayed in orange if they are predicted to be damaging and are displayed in green if they are not predicted to be damaging. Unknown frequency variants are displayed in red if they are predicted to be damaging and are displayed in blue if they are not predicted to be damaging. This figure displays the results for the seven variant calling strategies shown in Figure 3 and Figure S3, but the results are only shown for 2 preprocessing conditions, corresponding to the colored boxes under the bar plot: variants called using both GATK indel realignment and quality score recalibration (purple) or neither step (blue). Additionally, an unfiltered set of samtools variants are displayed for those pre-processing conditions for all 3 datasets. An unfiltered set of freebayes variants are displayed for the targeted exon dataset but not for any exome datasets because the proportion of unknown frequency variants is unacceptably high (and therefore we decided it didn’t justify further comparisons in the exome datasets).
Figure S9 : Variant Caller Overlap for Coding SNPs
All coding variants were tabulated for all exome samples (1KG targeted exon n=14, left; 1KG exome n=12, middle; SRP019719 n=15, right), keeping track of the samples in which the variants were called. Only coding variants on chromosome 20 were considered for the 1000 Genomes (1KG) samples, but all coding variants were considered for the SRP019719 samples. In order to simplify presentation of these results, we focused on the highest quality variant calls for each variant calling strategy: GATK UnifiedGenotyper with low-quality variants removed (UG-HQ, blue), GATK HaplotypeCaller with low-quality variants removed (HC-HQ, green), and VarScan using a custom set of conservative parameters (VarScan-Cons, red). An unfiltered set of samtools variants are also shown for comparison. Similarly, only variants subject to GATK indel realignment and quality score recalibration (“Full Pipeline”) are considered for this comparison. Counts for variants that would have been identified using the union of HC-HQ, UG-HQ, and samtools are shown in underlined font.