Detailed comparison of two popular variant calling packages for exome and targeted exon studies

Charles D Warden; Aaron W Adamson; Susan L Neuhausen; Xiwei Wu

doi:10.7287/peerj.preprints.403v1

Detailed comparison of two popular variant calling packages for exome and targeted exon studies

Charles D Warden ¹, Aaron W Adamson², Susan L Neuhausen², Xiwei Wu¹

1 Integrative Genomics Core, Department of Molecular and Cellular Biology, City of Hope, Duarte, California, United States

2 Department of Population Sciences, City of Hope National Medical Center, Duarte, California, United States

DOI: 10.7287/peerj.preprints.403v1

Published: 2014-06-04
Accepted: 2014-06-04

Subject Areas: Bioinformatics, Genomics
Keywords: Variant Calling, Exome, Targeted Sequencing, GATK, VarScan, SNP, small indel

Copyright: © 2014 Warden et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Warden CD, Adamson AW, Neuhausen SL, Wu X. 2014. Detailed comparison of two popular variant calling packages for exome and targeted exon studies. PeerJ PrePrints 2:e403v1 https://doi.org/10.7287/peerj.preprints.403v1

Abstract

The Genome Analysis Toolkit (GATK) is often considered to be the “gold standard” for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but we felt that an adequate comparison against VarScan may have not yet been performed. More specifically, we compared four lists of variants called by GATK (using the UnifiedGenotyper and the HaplotypeCaller algorithms, with and without filtering low quality variants) and three lists of variants called using VarScan (with varying sets of parameters). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. We found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a high quality gene list, with high concordance (>97%) when compared to high quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84-88% of the high-quality GATK SNPs in the exome datasets. We also accessed the impact of pre-processing (e.g., indel realignment and quality score base recalibration using GATK). In most cases, these pre-processing steps had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. More broadly, we believe the metrics used for comparison in this study can be useful in accessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.

Supplemental Information

Figure S1 : Run Times for Variant Calling Pipelines for SRP019719

Run times for Illumina exome samples from SRP019719 (n=15, Table S3). Run-times are displayed for the entire variant calling pipeline when using both GATK indel realignment and quality score recalibration (“Full Pipeline” - purple), indel realignment only (“Realign Only” - red), quality score recalibration only (“Recalibrate Only” - green), or neither (“No Preprocess” – blue). Run times are provided for the GATK UnifiedGenotyper (“UG”), GATK HaplotypeCaller (“HC”), and VarScan using 3 sets of parameters (see Methods). “VarScan-Cons” is the most conservative set of parameters. Variant calling was performed with varying amounts of concurrent usage, which seemed to most significantly affect the GATK HaplotypeCaller when analyzing data that was subject to quality score recalibration without indel realignment. We think this is due to a set of quality scores that required non-default parameters, and we don’t think it is safe to assume this trend will necessarily apply to most other datasets (for example, this was not an issue for the 1000 Genomes exome samples).

Supplemental Information

Figure S1 : Run Times for Variant Calling Pipelines for SRP019719

Figure S2 : Number of Variants Called for GATK versus VarScan (SRP019719 – all chromosomes)

Figure S3 : Distribution of ANNOVAR Annotations for SNP Variants (SRP019719)

Figure S4 : Distribution of ANNOVAR Annotations for Indel Variants in Coding Regions

Figure S5 : Targeted Exon Technical Replicate SNP Concordance (1KG – chr20)

Figure S6 : Targeted Exon Technical Replicate Indels Concordance (1KG – chr20)

Figure S7 : Targeted Exon SNP Overlap (Coding Variants)

Figure S8 : Distribution of ANNOVAR Annotations for Additional Variant Callers (Coding SNPs)

Figure S9 : Variant Caller Overlap for Coding SNPs

Table S1: Alignment Statistics for Targeted Exon Datasets

Table S2: Alignment Statistics for 1KG Exome Samples

Figure S3: Alignment Statistics for SRP019719 Illumina Exome Samples

Table S4: Run Times for 1KG Targeted Exon Samples

Table S5: Run Times for 1KG Exome Samples

Table S6: Run Times for SRP019719 Exome Samples

Table S7: 1KG Exome Recovery of Validated SNPs

Figure S8: Recovery of Targeted Exon SNPs in Exome Data for NA18510 (1KG - chr20)

Figure S9: Recovery of Targeted Exon SNPs in Exome Data for NA18637 (1KG - chr20)

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article