Prediction of amyloidogenicity based on the n-gram analysis

Michał Jan Burdukiewicz; Piotr Sobczyk; Stefan Rödiger; Anna Duda-Madej; Paweł Mackiewicz; Małgorzata Kotulska

doi:10.7287/peerj.preprints.2390v1

Prediction of amyloidogenicity based on the n-gram analysis

Michał Jan Burdukiewicz ¹, Piotr Sobczyk², Stefan Rödiger³, Anna Duda-Madej⁴, Paweł Mackiewicz¹, Małgorzata Kotulska⁵

1 Department of Genomics, University of Wrocław, Wrocław, Poland

2 Faculty of Pure and Applied Mathematics, Wrocław University of Science and Technology, Wrocław, Poland

3 Institute of Biotechnology, Brandenburg University of Technology Cottbus-Senftenberg, Senftenberg, Germany

4 Department of Microbiology, Wrocław Medical University, Wrocław, Poland

5 Department of Biomedical Engineering, Wrocław University of Science and Technology, Wrocław, Poland

DOI: 10.7287/peerj.preprints.2390v1

Published: 2016-08-24
Accepted: 2016-08-24

Subject Areas: Bioinformatics, Computational Biology
Keywords: n-gram, amyloid, random forest, prediction, feature selection

Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Burdukiewicz MJ, Sobczyk P, Rödiger S, Duda-Madej A, Mackiewicz P, Kotulska M. 2016. Prediction of amyloidogenicity based on the n-gram analysis. PeerJ Preprints 4:e2390v1 https://doi.org/10.7287/peerj.preprints.2390v1

Abstract

Amyloids are proteins associated with the number of clinical disorders (e.g., Alzheimer's, Creutzfeldt-Jakob's and Huntington's diseases). Despite their diversity, all amyloid proteins can undergo aggregation initiated by 6- to 15-residue segments, called hot spots. To find the patterns defining the hot-spots, we trained predictors of amyloidogenicity, using n-grams and random forest classifiers, based on data collected in the AmyLoad database. Only the most informative n-grams, selected by our Quick Permutation Test, were considered. Since the amyloidogenicity may not depend on the exact sequence of amino acids but on more general properties of amino acids, we tested 524,284 reduced amino acid alphabets of different lengths (three to six letters) to find the alphabet providing the best performance in cross-validation. The predictor based on this alphabet, called AmyloGram, was benchmarked against the most popular tools for the detection of amyloid peptides using an external data set and obtained the highest values of performance measures (AUC: 0.90, MCC: 0.63). Our results showed sequential patterns in the amyloids, which are strongly correlated with hydrophobicity, a tendency to form ß-sheets and rigidity of amino acid residues. Among the most informative n-grams of AmyloGram we identified 15 that were already confirmed experimentally. AmyloGram is available as a web-server: www.smorfland.uni.wroc.pl/amylogram/. The code and results are publicly available at: www.github.com/michbur/prediction_amyloidogenicity_ngram/.

Author Comment

This is an article which has been accepted for the "GCB 2016 Conference".