Next generation cluster editing

Thomas Bellitto; Tobias Marschall; Alexander Schönhuth; Gunnar W Klau

doi:10.7287/peerj.preprints.1301v1

Next generation cluster editing

Thomas Bellitto¹, Tobias Marschall², Alexander Schönhuth³, Gunnar W Klau ³

1 Combinatorics and Algorithms Team, Laboratoire Bordelais de Recherche en Informatique, Bordeaux, France

2 Max Planck Institute for Informatics, Saarbrücken, Germany

3 Life Sciences, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands

DOI: 10.7287/peerj.preprints.1301v1

Published: 2015-08-13
Accepted: 2015-08-13

Subject Areas: Bioinformatics, Computational Biology, Algorithms and Analysis of Algorithms, Data Science
Keywords: next-generation sequencing, graph theory, bioinformatics, cluster editing

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Bellitto T, Marschall T, Schönhuth A, Klau GW. 2015. Next generation cluster editing. PeerJ PrePrints 3:e1301v1 https://doi.org/10.7287/peerj.preprints.1301v1

Abstract

Genomic structural variations play key roles in genetic diversity and disease. Despite recent advances in structural variation discovery, many variants are yet to be discovered. Midsize insertions and deletions pose particularly involved algorithmic challenges. The recent CLEVER algorithm addressed these challenges with a statistical model on cliques in a graph whose nodes are read alignments and whose edges arise from a statistical test on length and overlap of read alignments. However, the resulting read alignment clusters tend to be too small and are heavily overlapping, which leads to losses in recall performance rates. Here we present a model based on weighted cluster editing, which alleviates these issues: clusters are provably non-overlapping and tend to be larger. In order to render the inherent optimization problem tractable on all read alignments of a genome, we present a novel, principled heuristic, which runs in time linear in the length of the genome. The heuristic is based on an exact polynomial-time algorithm for weighted cluster editing in one-dimensional point graphs. We demonstrate that the new model improves recall rates achieved by CLEVER.

Author Comment

This work has been presented at the German Conference on Bioinformatics 2015.