MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure

Weina Li; Jiadong Ren

doi:10.7287/peerj.preprints.26471v1

Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

NOT PEER-REVIEWED

"PeerJ Preprints" is a venue for early communication or feedback before peer review. Data may be preliminary.

MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure

Weina Li , Jiadong Ren

College of Information Science and Engineering, Yanshan University, Qinhuangdao, China

DOI: 10.7287/peerj.preprints.26471v1

Published: 2018-01-30
Accepted: 2018-01-30

Subject Areas: Computational Biology, Data Mining and Machine Learning
Keywords: sequence position table, sequence database index, biological sequence, continuous sequence pattern, sequence pattern mining

Copyright: © 2018 Li et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Li W, Ren J. 2018. MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure. PeerJ Preprints 6:e26471v1 https://doi.org/10.7287/peerj.preprints.26471v1

Abstract

A significant approach for the discovery of biological regulatory rules of genes, protein and their inheritance relationships is the extraction of meaningful patterns from biological sequence data.The existing algorithms of sequence pattern discovery, like MSPM and FBSB, suffice their low efficiency and accuracy. In order to deal with this issue, this paper presents a new algorithm for biological sequence pattern mining abbreviated MpBsmi based on the data Index Structure.The MpBsmi algorithm employs a sequence position table abbreviated ST and a sequence database index structure named DB-Index for data storing, mining and pattern expansion. The ST and DB-Index of single items are firstly obtained through scanning sequence database once. Then a new algorithm for fast support counting is developed to mine the table ST to identify the frequent single items. Based on a recursive connection strategy, the frequenct patterns are expanded and the expanded table ST is updated by scanning the DB-Index. The fast support counting algorithm is used for obtaining the frequent expansion patterns. Finally, a new pruning techniqueis developed for extended pattern to avoid the generation of unnecessarily large number of candidate patterns. The experiments results on multiple the classical protein sequence from the Pfam database validate the performance of the proposed algorithm including the accuracy, stability and scalability. It is showed that the proposed algorithm has achieved the better space efficiency, stability and scalability comparing with MSPM, FBSB which are the two main algorithms for biological sequence mining.

Author Comment

This is a submission to PeerJ for review.

Supplemental Information

raw

DOI: 10.7287/peerj.preprints.26471v1/supp-1

Download

code

DOI: 10.7287/peerj.preprints.26471v1/supp-2

Download

Add your feedback

Before adding feedback, consider if it can be asked as a question instead, and if so then use the Question tab. Pointing out typos is fine, but authors are encouraged to accept only substantially helpful feedback.

Some Markdown syntax is allowed: _italic_ **bold** ^superscript^ ~subscript~ %%blockquote%% [link text](link URL)

By posting this you agree to PeerJ's commenting policies

Questions

Ask a question

Learn more about Q&A

Links

Add a link

Content

Alert

Just enter your email

Supplemental Information

raw

code

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article