This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ PrePrints) and either DOI or URL of the article must be cited.
Currently, monomeric fluorescent proteins (FP) are ideal markers for protein tagging. The prediction of oligomeric states is helpful for enhancing live biomedical imaging. Computational prediction of FP oligomeric states can accelerate the effort of protein engineering to create monomeric FPs by saving time and money. To the best of our knowledge, this study represents the first computational model for predicting and analyzing FP oligomerization directly from their amino acid sequences. An exhaustive dataset consisting of 397 unique FP oligomeric states was compiled from the literature. FP were described by 3 classes of protein descriptors including amino acid composition, dipeptide composition and physicochemical properties. The oligomeric states of FP was predicted using decision tree (DT) algorithm and results demonstrated that DT provided robust performance with accuracies in ranges of 79.97-81.72% and 80.76-82.63% for the internal (e.g. 10-fold cross-validation) and external sets, respectively. This approach was also benchmarked with other common machine learning algorithms such as artificial neural network, support vector machine and random forest. A thorough analysis of amino acid sequence features was conducted to provide informative insights into FP oligomerization, which may aid in engineering novel monomeric fluorescent proteins. The following differentiating characteristics of monomeric and oligomeric fluorescent proteins were derived from DT: (i) substitution of any amino acid to Glu led to the reduction of aggregated proteins and (ii) oligomerization of FP appears to be stabilized by several hydrophobic contacts. Datasets and R source code are available at http://dx.doi.org/10.6084/m9.figshare.1348575.