Prediction of amyloidogenicity based on the n-gram analysis

Department of Genomics, University of Wrocław, Wrocław, Poland
Faculty of Pure and Applied Mathematics, Wrocław University of Science and Technology, Wrocław, Poland
Institute of Biotechnology, Brandenburg University of Technology Cottbus-Senftenberg, Senftenberg, Germany
Department of Microbiology, Wrocław Medical University, Wrocław, Poland
Department of Biomedical Engineering, Wrocław University of Science and Technology, Wrocław, Poland
DOI
10.7287/peerj.preprints.2390v1
Subject Areas
Bioinformatics, Computational Biology
Keywords
n-gram, amyloid, random forest, prediction, feature selection
Copyright
© 2016 Burdukiewicz et al.
Licence
This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.
Cite this article
Burdukiewicz MJ, Sobczyk P, Rödiger S, Duda-Madej A, Mackiewicz P, Kotulska M. 2016. Prediction of amyloidogenicity based on the n-gram analysis. PeerJ Preprints 4:e2390v1

Abstract

Amyloids are proteins associated with the number of clinical disorders (e.g., Alzheimer's, Creutzfeldt-Jakob's and Huntington's diseases). Despite their diversity, all amyloid proteins can undergo aggregation initiated by 6- to 15-residue segments, called hot spots. To find the patterns defining the hot-spots, we trained predictors of amyloidogenicity, using n-grams and random forest classifiers, based on data collected in the AmyLoad database. Only the most informative n-grams, selected by our Quick Permutation Test, were considered. Since the amyloidogenicity may not depend on the exact sequence of amino acids but on more general properties of amino acids, we tested 524,284 reduced amino acid alphabets of different lengths (three to six letters) to find the alphabet providing the best performance in cross-validation. The predictor based on this alphabet, called AmyloGram, was benchmarked against the most popular tools for the detection of amyloid peptides using an external data set and obtained the highest values of performance measures (AUC: 0.90, MCC: 0.63). Our results showed sequential patterns in the amyloids, which are strongly correlated with hydrophobicity, a tendency to form ß-sheets and rigidity of amino acid residues. Among the most informative n-grams of AmyloGram we identified 15 that were already confirmed experimentally. AmyloGram is available as a web-server: www.smorfland.uni.wroc.pl/amylogram/. The code and results are publicly available at: www.github.com/michbur/prediction_amyloidogenicity_ngram/.

Author Comment

This is an article which has been accepted for the "GCB 2016 Conference".