Information theoretic alignment free variant calling

Justin Bedo; Benjamin Goudey; Jeremy Wazny; Zeyu Zhou

doi:10.7287/peerj.preprints.2015v1

Information theoretic alignment free variant calling

Justin Bedo ^1,2, Benjamin Goudey^1,3, Jeremy Wazny¹, Zeyu Zhou^1,4

1 IBM Research -- Australia, Carlton, VIC, Australia

2 Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia

3 Centre For Epidemiology and Biostatistics, The University of Melbourne, Parkville, VIC, Australia

4 School of Mathematics and Statistics, The University of Melbourne, Parkville, VIC, Australia

DOI: 10.7287/peerj.preprints.2015v1

Published: 2016-05-03
Accepted: 2016-05-03

Subject Areas: Bioinformatics, Computational Biology
Keywords: alignment free, variant, assembly free, genome, bacteria, feature extraction

Copyright: © 2016 Bedo et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Bedo J, Goudey B, Wazny J, Zhou Z. 2016. Information theoretic alignment free variant calling. PeerJ Preprints 4:e2015v1 https://doi.org/10.7287/peerj.preprints.2015v1

Abstract

While traditional methods for calling variants across whole genome sequence data rely on alignment to an appropriate reference sequence, alternative techniques are needed when a suitable reference does not exist. We present a novel alignment and assembly free variant calling method based on information theoretic principles designed to detect variants have strong statistical evidence for their ability to segregate samples in a given dataset. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffixes of length k as a multinomial distribution. We then estimate which of these contexts are stable intra-sample and varying inter-sample using a statistic based on the Kullback–Leibler divergence. The utility of the variant calling method was evaluated through analysis of a pair of bacterial datasets and a mouse dataset. We found that our variants are highly informative for supervised learning tasks with performance similar to standard reference based calls and another reference free method (DiscoSNP++). Comparisons against reference based calls showed our method was able to capture very similar population structure on the bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common analysis tasks for organisms that are too diverse to be mapped back to a single reference sequence.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

Accessions for the bacterial isolates used in this study

DOI: 10.7287/peerj.preprints.2015v1/supp-1

Download