Feature selection on a dataset of protein families: from exploratory data analysis to statistical variable importance

Eugenio Del Prete; Serena Dotolo; Anna Marabotti; Angelo Facchiano

doi:10.7287/peerj.preprints.2157v1

Javascript is disabled in your browser. Please enable Javascript to view PeerJ.

NOT PEER-REVIEWED

"PeerJ Preprints" is a venue for early communication or feedback before peer review. Data may be preliminary.

Tenth annual “Bioinformatics and Computational Biology in Campania” Collection thumbnail

Highlighted in Tenth annual “Bioinformatics and Computational Biology in Campania” Collection

Feature selection on a dataset of protein families: from exploratory data analysis to statistical variable importance

Eugenio Del Prete ^1,2, Serena Dotolo^1,3, Anna Marabotti^1,4, Angelo Facchiano¹

1 Institute of Food Sciences, National Research Council, Avellino, Italy

2 Department of Sciences, University of Basilicata, Potenza, Italy

3 Department of Biochemistry, Biophysics and General Pathology, Second University of Naples, Napoli, Italy

4 Department of Chemistry and Biology "Adolfo Zambelli", University of Salerno, Fisciano (SA), Italy

DOI: 10.7287/peerj.preprints.2157v1

Published: 2016-06-24
Accepted: 2016-06-24

Subject Areas: Bioinformatics, Computational Biology
Keywords: protein classification, predictive statistical models, sparse PCA, protein structure features, R environment

Copyright: © 2016 Del Prete et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Preprints) and either DOI or URL of the article must be cited.

Cite this article: Del Prete E, Dotolo S, Marabotti A, Facchiano A. 2016. Feature selection on a dataset of protein families: from exploratory data analysis to statistical variable importance. PeerJ Preprints 4:e2157v1 https://doi.org/10.7287/peerj.preprints.2157v1

Abstract

Proteins are characterized by several typologies of features (structural, geometrical, energy). Most of these features are expected to be similar within a protein family. We are interested to detect which features can identify proteins that belong to a family, as well as to define the boundaries among families. Some features are redundant: they could generate noise in identifying which variables are essential as a fingerprint and, consequently, if they are related or not to a function of a protein family. We defined an original approach to analyze protein features for defining their relationships and peculiarities within protein families. A multistep approach has been mainly performed in R environment: getting-cleaning data, exploratory data analysis and predictive modeling for classification. Ten protein families have been chosen by their CATH classification (different architectures), with rules over the number of structures, the length of the sequence and the choice of the chain. Properties investigated are secondary structures, hydrogen bonds, accessible surface areas, torsion angles, packing defects, number of charged residues, free energy of folding, volume and salt bridges. Kernel density estimation helps in discovering unusual multimodal profiles. Pearson correlation highlights statistical links between pairwise variables and Pearson distance provides a dendrogram with a clusterization of the features. PCA clusterizes the protein families and it detects outliers, sparse PCA performs a feature selection. Many classification algorithms have been used: decision trees (classical, boosting and bagging), SVMs (flexible discriminant analysis), centroid (nearest shrunken). The interest is on variable importance estimation. A 10-fold x 10 cross validation has been applied over the training set. Accuracy, K coefficient, sensitivity and specificity have been calculated for each methods. From the density plots, the percentage of mostly buried residues is significantly different for each family. Dissimilarity dendrogram shows separated clusters for secondary structures, torsion angles, defects and geometrical features. From the features network, torsion angles and surface variables result as peripheral (i.e. redundant) from the core of the graph. PCA biplot gives a good clustering for the protein families and sparse PCA confirm dendrogram results. Unifying all the results, these features are typical for our dataset: helix, strand, coil, turn, hydrogen bond, polar and charged accessible surface area, volume and residue buried for the most part. Random forest algorithm has the best performance values. Graphical multivariate procedures are good tools for the characterization of possible fingerprints about the protein families. Predictive models for classification and variable importance estimation help in performing feature selection. The work can be improved by the use of multivariate regression models and the increase of the protein families number.

Author Comment

This is an abstract of the presentation at the BBCC2015 conference

Add your feedback

Before adding feedback, consider if it can be asked as a question instead, and if so then use the Question tab. Pointing out typos is fine, but authors are encouraged to accept only substantially helpful feedback.

Some Markdown syntax is allowed: _italic_ **bold** ^superscript^ ~subscript~ %%blockquote%% [link text](link URL)

By posting this you agree to PeerJ's commenting policies

Questions

Ask a question

Learn more about Q&A

Links

Add a link

Content

Alert

Just enter your email

Add your feedback

Top referrals unique visitors

Share this preprint

Metrics

Download article