Prediction of amyloidogenicity based on the n-gram analysis
- Published
- Accepted
- Subject Areas
- Bioinformatics, Computational Biology
- Keywords
- n-gram, amyloid, random forest, prediction, feature selection
- Copyright
- © 2016 Burdukiewicz et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
- Cite this article
- 2016. Prediction of amyloidogenicity based on the n-gram analysis. PeerJ Preprints 4:e2390v1 https://doi.org/10.7287/peerj.preprints.2390v1
Abstract
Amyloids are proteins associated with the number of clinical disorders (e.g., Alzheimer's, Creutzfeldt-Jakob's and Huntington's diseases). Despite their diversity, all amyloid proteins can undergo aggregation initiated by 6- to 15-residue segments, called hot spots. To find the patterns defining the hot-spots, we trained predictors of amyloidogenicity, using n-grams and random forest classifiers, based on data collected in the AmyLoad database. Only the most informative n-grams, selected by our Quick Permutation Test, were considered. Since the amyloidogenicity may not depend on the exact sequence of amino acids but on more general properties of amino acids, we tested 524,284 reduced amino acid alphabets of different lengths (three to six letters) to find the alphabet providing the best performance in cross-validation. The predictor based on this alphabet, called AmyloGram, was benchmarked against the most popular tools for the detection of amyloid peptides using an external data set and obtained the highest values of performance measures (AUC: 0.90, MCC: 0.63). Our results showed sequential patterns in the amyloids, which are strongly correlated with hydrophobicity, a tendency to form ß-sheets and rigidity of amino acid residues. Among the most informative n-grams of AmyloGram we identified 15 that were already confirmed experimentally. AmyloGram is available as a web-server: www.smorfland.uni.wroc.pl/amylogram/. The code and results are publicly available at: www.github.com/michbur/prediction_amyloidogenicity_ngram/.
Author Comment
This is an article which has been accepted for the "GCB 2016 Conference".