Detect ‘protein word’ based on unsupervised word segmentation

Liang Wang; Kaiyong Zhao

doi:10.7287/peerj.preprints.1433v1

Detect ‘protein word’ based on unsupervised word segmentation

Liang Wang ¹, Kaiyong Zhao²

1 NLP research lab, Sogou Tech Inc, Beijing, China

2 Department of Computer Science, Hong Kong Baptist University, Hong Kong, China

DOI: 10.7287/peerj.preprints.1433v1

Published: 2015-10-14
Accepted: 2015-10-14

Subject Areas: Bioinformatics
Keywords: secondary structure, word segmentation, protein word

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.

Cite this article: Wang L, Zhao K. 2015. Detect ‘protein word’ based on unsupervised word segmentation. PeerJ PrePrints 3:e1433v1 https://doi.org/10.7287/peerj.preprints.1433v1

Abstract

Unsupervised word segmentation methods were applied to analyze the protein sequence. Protein sequences, such as ‘MTMDKSELVQKA …..’, were used as input to these methods. Segmented ‘protein word’ sequences, such as ‘MTM DKSE LVQKA’, were then obtained. We compare the ‘protein words’ produced by unsupervised segmentation and the protein secondary structure segmentation. An interesting finding is that the unsupervised word segmentation is more efficient than secondary structure segmentation in expressing information. Our experiment also suggests there may be some ‘protein ruins’ in current noncoding regions.

Author Comment

This is a preprint submission to PeerJ.