BALSA: Integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU

Department of Computer Science, The University of Hong Kong, Hong Kong, China
School of Science and Technology, The Open University of Hong Kong, Hong Kong, China
DOI
10.7287/peerj.preprints.373v1
Subject Areas
Bioinformatics, Computational Biology, Genomics, Computational Science
Keywords
Secondary analysis, Whole-genome sequencing, Whole-exome sequencing, GPU, Variant calling, Genomics, NGS, HPC
Copyright
© 2014 Luo et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Cite this article
Luo R, Wong Y, Law W, Lee L, Liu C, Cheung J, Lam T. 2014. BALSA: Integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU. PeerJ PrePrints 2:e373v1

Abstract

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 hours to process 50-fold whole genome sequencing (~750 million 100bp paired-end reads), or just 25 minutes for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses.

BALSA is available at: http://sourceforge.net/p/balsa

Author Comment

Paper submitted to PeerJ for peer-review.

Supplemental Information

Figure S1

Venn graphs illustrating the overlaps between 1) BALSA, 2) the six individual callers including Atlas2-SNP (Atlas), Freebayes, GATK HaplotypeCaller (HC), Samtools, GATK UnifiedGenotyper (UG) and Mutect, and 3) the known variants on SNP.

DOI: 10.7287/peerj.preprints.373v1/supp-2

Figure S2

Venn graphs illustrating the overlaps between 1) BALSA, 2) the six individual callers including Atlas2-SNP (Atlas), Freebayes, GATK HaplotypeCaller (HC), Samtools, GATK UnifiedGenotyper (UG) and Varscan, and 3) the known variants on Indel.

DOI: 10.7287/peerj.preprints.373v1/supp-3

Table S1, S2, S3, S4, S5, S6

DOI: 10.7287/peerj.preprints.373v1/supp-4