Simultaneous gene finding in multiple genomes

Author and article information
Abstract
As whole genome sequencing is taking on ever-increasing dimensions, the new challenge is the accurate and consistent annotation of entire clades of genomes. We address this problem with a new approach to comparative gene finding that takes a multiple genome alignment of closely related species and simultaneously predicts the location and structure of protein-coding genes in all input genomes, thereby exploiting negative selection and sequence conservation. The model prefers potential gene structures in the different genomes that are in agreement with each other, or – if not – where the exon gains and losses are plausible given the species tree. We formulate the multi-species gene finding problem as a binary labeling problem on a graph. The resulting optimization problem is NP hard, but can be efficiently approximated using a subgradient-based dual decomposition approach. The proposed method was tested on a whole-genome alignment of 12 Drosophila species and its accuracy evaluated on D. melanogaster. The method is being implemented as an extension to the gene finder AUGUSTUS.
Cite this as
2015. Simultaneous gene finding in multiple genomes. PeerJ PrePrints 3:e1296v1 https://doi.org/10.7287/peerj.preprints.1296v1Author comment
This work has been presented at the German Conference on Bioinformatics 2015.
Sections
Additional Information
Competing Interests
The authors declare that they have no competing interests.
Author Contributions
Stefanie König conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper.
Lars Romoth performed the experiments, performed the computation work, reviewed drafts of the paper.
Lizzy Gerischer performed the computation work, reviewed drafts of the paper.
Mario Stanke conceived and designed the experiments, analyzed the data, wrote the paper, performed the computation work, reviewed drafts of the paper.
Data Deposition
The following information was supplied regarding the deposition of related data:
Source code and data sets are available for download at
http://bioinf.uni-greifswald.de/augustus/
Funding
S.K. was supported by a scholarship from the Studienstiftung des deutschen Volkes and L.R. by Research Training Group 1870 from the German Research Foundation (DFG). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.