Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology

Lucas D Wittwer; Ivana Piližota; Adrian M Altenhoff; Christophe Dessimoz

doi:10.7287/peerj.preprints.421v1

Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology

Lucas D Wittwer^1,2, Ivana Piližota¹, Adrian M Altenhoff^1,2,3, Christophe Dessimoz ¹

1 Genetics, Evolution & Environment; Computer Science, University College London, London, United Kingdom

2 Department of Computer Science, ETH Zurich, Zurich, Switzerland

3 Swiss Institute of Bioinformatics, Zurich, Switzerland

DOI: 10.7287/peerj.preprints.421v1

Published: 2014-06-25
Accepted: 2014-06-25

Subject Areas: Bioinformatics, Computational Biology, Evolutionary Studies, Genomics, Computational Science
Keywords: all-against-all, sequence alignment, homology, orthology, smith-waterman

Copyright: © 2014 Wittwer et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Wittwer LD, Piližota I, Altenhoff AM, Dessimoz C. 2014. Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology. PeerJ PrePrints 2:e421v1 https://doi.org/10.7287/peerj.preprints.421v1

Abstract

Orthology inference and other sequence analyses across multiple genomes typically start by performing exhaustive pairwise sequence comparisons, a process referred to as “all-against-all”. As this process scales quadratically in terms of the number of sequences analysed, this step can become a bottleneck, thus limiting the number of genomes that can be simultaneously analysed. Here, we explored ways of speeding-up the all-against-all step while maintaining its sensitivity. By exploiting the transitivity of homology and, crucially, ensuring that homology is defined in terms of consistent protein subsequences, our proof-of-concept decreased the time complexity by ~75% while recovering >99.6% of all homologs identified by the full all-against-all procedure on empirical sequences from bacteria and fungi. In comparison, state-of-the-art k-mer approaches are orders of magnitude faster but only recover 3-14% of all homologous pairs. We also outline ideas to further improve the speed and recall of the new approach. An open source implementation is provided as part of the OMA standalone software at http://omabrowser.org/standalone .

Author Comment

This manuscript is currently undergoing peer-review at PeerJ. It is meant to be a contribution to the GNOME 2014 symposium.

Supplemental Information

List of species included in the analyses

A table containing the list of species used in the comparisons, including scientific name, taxonomic ID, primary source, and version number.

DOI: 10.7287/peerj.preprints.421v1/supp-1

Download

Detailed comparisons with kClust and UCLUST

Detailed comparisons with kClust and UCLUST in terms of recall and runtime, for several sets of input parameters.

DOI: 10.7287/peerj.preprints.421v1/supp-2

Download