SHsearch: a method for fast remote homology detection

Data Science, Careem GmbH, Berlin, Germany
DOI
10.7287/peerj.preprints.27111v1
Subject Areas
Bioinformatics, Artificial Intelligence, Data Mining and Machine Learning, Data Science
Keywords
Data Mining, Statistical Analysis, Markov Models
Copyright
© 2018 Baddar
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Baddar M. 2018. SHsearch: a method for fast remote homology detection. PeerJ Preprints 6:e27111v1

Abstract

Remote homology detection is the problem of detecting homology in cases of low sequence similarity. It is a hard computational problem with no approach that works well in all cases. Methods based on profile hidden Markov models (HMM) often exhibit relatively higher sensitivity for detecting remote homologies than commonly used approaches. However, calculating similarity scores in profile HMM methods is computationally intensive as they use dynamic programming algorithms. In this paper we introduce SHsearch: a new method for remote protein homology detection. Our method is implemented as a modification of HHsearch: a remote protein homology detection method based on comparing two profile HMMs. The motivation for modification was to reduce the run time of HHsearch significantly with minimal sensitivity loss. SHsearch focuses on comparing the important submodels of the query and database HMMs instead of comparing the complete models. Hence, SHsearch achieves a significant speedup over HHsearch with minimal loss in sensitivity. On SCOP 1.63, SHsearch achieved 88X speedup with 8.2% loss in sensitivity with respect to HHsearch at error rate of 10%, which deemed to be an acceptable tradeoff.

Author Comment

This paper proposed a new method for homo-logy detection for distant proteins and DNA sequences based on Hidden Markov Models. The proposed method tries to reduce the running time significantly by reducing model complexity without sacrificing retrieval accuracy significantly.

Supplemental Information

SCOP 20 file for proteins Sequences

Each data point represent a protein sequence

DOI: 10.7287/peerj.preprints.27111v1/supp-1