Accurate identification of protein–protein interactions (PPI) is the key step in understanding proteins’ biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein–protein interactions. We start with the carefully filtered data on protein complexes available for
Systems biology and bioinformatics study interactions between various biocomponents of living cells that spans across multiple spatial and temporal scales. The goal is to understand how the complex phenomena arise given the properties of building blocks. Specifically, proteins are characterised in multiple scales: first in the microscale, by their local post-translational modifications; second, by the interactions with metabolites and small chemical molecules (inhibitors); third in the mesoscale, by the three-dimensional structure of active sites, or interaction interfaces; fourth in the macroscale, by the global 3D structure that comprises the macromolecular complexes; and finally in the time-scale, by their dynamical properties related to the changes of their structure, or physico-chemical properties upon participating in the given biophysical process. Such variety of scales, each linked with different biological function is rooted in their complex and spatio-temporal network of interactions with other smaller biomolecules (metabolites, ligands), comparable size proteins, RNA molecules, and much larger DNA macrochains. Starting from reliable information on single, binary interactions it is possible to reconstruct the whole interaction network, therefore providing further insight into proteins’ biological functions on the whole-cell level.
In this paper, we focus on protein–protein interactions. We develop ensemble learning method for identification of proteins binary interactions by predicting residue-residue interactions. Moreover, we demonstrate how to integrate sequence information from the lower scales into the higher scale machine learning predictor. In our approach, binary interaction between two proteins is predicted by considering all possible interactions between their sequence segments using level-I predictor. An output of this phase is the matrix of scores with the size corresponding to the proteins’ lengths. Given the threshold on the likelihoods, this represent the whole network of all possible residue contacts that could be made during the complex formation. Later, we transform the scores matrix into a fixed length input vector suitable for further statistical data analysis (aggregated values over columns, diagonals, etc.), and we identify the network properties (e.g., sizes of connected components) using interaction graph. This data is used by the level-II predictor, which integrates information similarly to a human expert.
Recently, several machine learning algorithms were applied to predicting protein interactions. Our study takes similar route to
Our method differs from the previous approaches in several ways. First, we treat residue level predictions only as the input for identification of protein level interactions. The information flows only bottom-up: from the residue level to the protein level. In
We compare our results with PPI predictors targeted at yeast which exploit global sequence properties (i.e., not considering local residue interactions). One such methods was developed by
Another sequence-based method was proposed by
Other popular protein sequence representations used for predicting protein–protein interactions include Pseudo Amino Acid Composition (
Many other algorithms solving the same problem, but operating on different principles, were developed. One of the possible examples is PIPE predictor
Here we would like to stress one of the most important outcomes of our study. Namely, the performance results reported in various publications are generally not comparable directly. They tend to differ in collected data methodology, definition of positives and negatives and evaluation procedures. In our work, we focused first on preparing the collection of high quality interactions data and removing any bias from the further evaluation procedure. To obtain a meaningful comparison between different methods, we reimplemented several schemas for aggregating features of protein sequences (including the method of
The performance scores obtained in our study are much lower than usually reported in the literature. We claim that this is an effect of carefully balancing positives and negatives in our datasets and the rigorous evaluation strategy in which a predictor is always tested on proteins unseen during the training phase. What we are measuring is the ability to predict real protein compatibility and not just relative proteins’ reactivity. This task is much harder, and we demonstrate that popular methods using sequence-derived features do not perform well in this context. Our result confirms previous methodological studies in this area (
We extracted three dimensional (3D) structures of all yeast protein hetero-complexes from Protein Data Bank (PDB) (
On the residue level, from 3D structures it was possible to extract all interacting pairs of residues. We considered any two residues from two distinct proteins as interacting if they were located in Euclidean space within a distance threshold of 4 Å from each other. On the protein level, we identified two proteins as an interacting pair if there was at least one residue level microscopic interaction between them. We were interested only in heterodimers, i.e., interactions occurring between two different proteins. While predicting homodimers is also valuable, we decided to leave out homodimers and focus on heterodimers for two reasons: (a) hetero-interactions can contribute a lot to our understanding and the reconstruction of the true protein interaction network (PIN), (b) in our data homodimers occurred much more frequently and there was a risk that they will dominate heterodimers completely.
The first step of our procedure was to build a training dataset for level-I predictor. We employed the sliding window technique to extract fragments of protein sequence. In this work we refer to it as extraction window.
Positive examples in the training set were formed from pairs of fragments in which central residues interact, and additionally certain number of other residues within a specified distance from the central one interact. We refer to the required number of interacting residues as interaction threshold and to the maximal distance from the central residue as the maximal neighbour interaction distance. By introducing this restriction we deliberately focused on strong interactions, filtering out the weaker ones, which could be just noise in the crystallisation process.
In the data preparation phase, we fixed the maximal neighbour interaction distance at 10 residues. We have chosen this value following the studies of
Let us observe that the values of the maximal neighbour interaction distance and extraction window size are not necessary the same. One can imagine identifying residue level positives using larger interaction distance, and then encoding features of only a few central residues using small extraction window, and vice versa. Indeed, in our work we tested different sizes of extraction window ranging from 3 to 31 independent of the fixed interaction distance.
In our data extracted from PDB we had no natural source of negative residue interactions. Pairing sequence fragments fully randomly could result in a lot of noise and false negatives. Therefore, we decided to extract non-interacting pairs of fragments only from interacting protein pairs. Fragment pairs without any interacting residues or fragment pairs in which one fragment has some interactions but the other has none were considered negatives. This way it was guaranteed that at least one of the fragments did not come from the interface region. The number of potential negatives was much larger than the number of positives, therefore we decided to keep the imbalance ratio at 3:1. The required amount of negatives was therefore sampled at random. This is a common practice in machine learning since most of the algorithms perform poorly on datasets with large class imbalance (see
The second step of our methodology was preparing data for training level-II predictor. We used the same dataset of interacting protein pairs from PDB database as positives, and generated negative examples. The construction of high quality negative examples is very difficult. Common methods for generating negatives include drawing random pairs of biomolecules from all known proteins found in a specific organism (
Let
While there exist
Find vertex
Find vertex
There is no edge (
Distance
If
Add edge (
else:
Such schema of constructing negative protein pairs set is unbiased, i.e., the protein composition of the positives and the negatives remains identical. Every single protein has the same number of positive and negative interactions. This forces the trained classifier to predict meaningful biophysical interactions rather than predicting general reactivity (the relative number of interactions) of a single protein. Otherwise, the best results would be achieved by a predictor, which predicts that the two proteins interact if each of them has a lot of interactions in general, regardless of their actual compatibility. It’s also important that our algorithm favours protein pairs which are remote to each other in the interaction network which reduces the risk of introducing false negatives.
The last step of data generation procedure is splitting samples between training and testing sets. In order to truly evaluate our method in a realistic setup, we split the benchmarking dataset at the protein level, not at the residue level. This made our goal more difficult as compared to previous works that often used residue-level splitting of benchmarking dataset. The schematic depiction of the train-test split is given by
Numbers of examples used in each step are given in parentheses.
Level-I predictor was trained to recognise interacting pairs of fragments. This should be the equivalent of detecting compatible protein patches on the surface of a protein. During prediction, for each possible pair of fragments from two different proteins a prediction is made, and the likelihood estimation for all against all pairs of fragments are stored in the interaction matrix. The Level-II predictor uses the output of level-I predictor, predicting binary interactions between two proteins using the aggregated features, i.e., the complementarity between their surface patches.
We trained the level-I predictor on interacting sequence fragments of proteins from the training set and tested it on the testing set. Input for level-I predictor consisted of pairs of sequence fragments of the length of extraction window.
The following sets of features were considered:
Raw sequence—raw sequence of amino acids encoded numerically.
HQI8—sequence of amino acids encoded with High Quality Indices (
DSSP structure—secondary structure of the protein extracted from PDB complex with DSSP software. It was limited to the three basic symbols: E—
PSIPRED structure—secondary structure of the protein predicted from sequence with PSIPRED software. It was limited to the three basic symbols: E—
As the core classifier, we evaluated two popular machine learning methods: Random Forest and Support Vector Machine. Both algorithms are commonly used in bioinformatics and are considered the best off-the-shelf classifiers (
To infer a binary interaction between two proteins, we consider all possible interactions between their sequence segments as predicted by level-I predictor. An output of this phase is a matrix of likelihoods with the dimension equal to the multiplied proteins’ lengths. Each prediction score is a real number between 0 and 1. Sample matrices of scores for a positive and negative case are presented in
White colour corresponds to score 0.0, black colour corresponds to score 1.0.
To transform the 2D matrix into an input vector suitable for level-II predictor, we extracted the following features (numbers in parentheses denote the number of values in the final feature vector):
the mean and variance of values over the matrix (2),
the sums of values in 10 best rows and 10 best columns (20),
the sums of values in 5 best diagonals of the original and the transposed matrix (10),
the sum of values on intersections of 10 best rows and 10 best columns (1),
the histogram of scores distributed over 10 bins (10),
graph features: fraction of nodes in the 3 largest connected components (3).
The graph features require further explanation. Predicted contacts between residues were represented as a bipartite graph. Nodes in the graph represented residues and edges represented predicted contact. To make the graph more realistic biologically, for each node we left only 3 strongest outgoing edges. We set the value of this threshold (3) following the observation that in our PDB structures the mean number of interactions of a single interacting residue is between 2 and 3. In the trimmed graph we calculated fractions of nodes contained in 3 largest connected components. Those values were also appended to the feature vector.
The performance of the level-II predictor was evaluated through a variant of stratified 30-fold cross-validation performed on the protein level. Each fold contained
Let
For each fold
Build a set
Build a set
Build a set
Train the classifier on
Collect all the predictions for
The above described procedure differs from the standard cross-validation, since the number of observations in constructed test sets vary slightly, but this variance is small, and does not influence the estimated performance. Such evaluation schema does not allow for any information leak: the datasets are always balanced, and the classifier is tested on previously unseen proteins.
As classification methods for level-II predictor we used Random Forest and Support Vector Machine with parameters tuned through a grid search.
We compared our ensemble method with various sequence feature aggregation schemas that are commonly applied in machine learning predictors of proteins interactions. To make the benchmarking results comparable between different algorithms, we used the same classification method (Random Forest) and evaluation procedure (modified 30-fold cross-validation on the testing set) as for level-II predictor. We benchmarked the following feature aggregation schemas:
AAC—Amino Acid Composition (
PseAAC—Pseudo Amino Acid Composition (
2-grams (
QRC—Quasiresidue Couples (
Variation of Liu’s protein pair features (
We evaluated carefully all subsequent steps of our method to choose optimal features and parameter values. Then we compared performance of level-II predictor with popular sequence encoding schemas. In our experiments we stuck to the rule that during classifier training we can use all the information available in PDB complexes, but in the evaluation phase only information derived from the sequence is allowed. This was to demonstrate that our method can be employed successfully in a situation when only protein sequences are known.
The first task was to decide on the set of optimal features for the level-I predictor. We have chosen ROC AUC as the performance metric. To have a complete picture, we measured both the performance of level-I predictor and the performance of level-II based on the trained level-I predictor. We tested two sources of secondary structure information—DSSP and PSIPRED—separately, but the evaluation was done using PSIPRED secondary structure (results on DSSP are given in parentheses for completeness). Initial experimentation was done with Random Forest, then we tested whether it is possible to gain anything by replacing it with a carefully tuned SVM.
Interaction threshold was set to 15. Extraction window size was set to 21. A+B denotes feature vector constructed by concatenating two sets of features. RF—Random Forest, 300 trees, maximum tree depth 15, SVM—Support Vector Machine, RBF kernel,
Classifier | Features | Lvl-I AUC | Lvl-II AUC |
---|---|---|---|
|
Raw sequence | 0.64 | 0.59 |
HQI8 | 0.70 | 0.59 | |
PSIPRED structure | 0.67 | 0.63 | |
PSIPRED structure + Sequence | 0.69 | 0.60 | |
PSIPRED structure + HQI8 | 0.72 | 0.56 | |
DSSP structure | 0.72 (0.87) | 0.70 | |
DSSP structure + Sequence | 0.73 (0.87) | 0.65 | |
DSSP structure + HQI8 | 0.74 (0.85) | 0.64 | |
SVM | DSSP structure | 0.59 (0.84) | 0.57 |
Evaluation results of the predictor on different features for 5 interaction thresholds are presented in
DSSP-extracted secondary structure was used for constructing feature vector. Extraction window size was set to 21. For level-I Random Forest, 300 trees, maximum tree depth 7 was used. For level-II Random Forest with 300 trees, maximum tree depth 7 was used. Main scores were calculated for PSIPRED-predicted secondary structure, values in parentheses concern scores for DSSP secondary structure.
Threshold | Lvl-I AUC | Lvl-II AUC |
---|---|---|
0 | 0.67 (0.84) | 0.67 |
5 | 0.67 (0.85) | 0.67 |
10 | 0.69 (0.86) | 0.68 |
15 | 0.72 (0.87) | 0.70 |
20 | 0.75 (0.88) | 0.64 |
Random Forest was used as the classifier.
The final decision was selecting the optimal extraction window size.
After fixing the parameters, we wanted to choose classification algorithm for level-II predictor and compare performance of our multi-level representation with representations based on the aggregated protein sequence. Results are presented in
Clf | Features | Accuracy | Precision | Recall | AUC |
---|---|---|---|---|---|
SVM | Lvl-II pred ( |
0.55 | 0.58 | 0.55 | 0.57 |
AAC | 0.54 | 0.56 | 0.66 | 0.54 | |
PseAAC | 0.54 | 0.55 | 0.61 | 0.55 | |
2grams | 0.55 | 0.56 | 0.64 | 0.55 | |
QRC | 0.51 | 0.53 | 0.59 | 0.53 | |
Liu’s dev (HQI8) | 0.55 | 0.57 | 0.60 | 0.56 | |
Liu’s dev (original) | 0.55 | 0.57 | 0.60 | 0.56 | |
|
|
|
|
|
|
AAC | 0.54 | 0.57 | 0.54 | 0.56 | |
PseAAC | 0.53 | 0.55 | 0.52 | 0.55 | |
2grams | 0.53 | 0.56 | 0.49 | 0.55 | |
QRC | 0.50 | 0.52 | 0.43 | 0.51 | |
Liu’s dev (HQI8) | 0.55 | 0.58 | 0.55 | 0.60 | |
Liu’s dev (original) | 0.56 | 0.59 | 0.57 | 0.59 |
Our results stress the link between secondary structure and residue contacts. Compatibility of structural motives is important for contact forming. To get more insights from our results, we extracted feature importances from the trained Random Forest model of level-I predictor. Relative importance of an attribute in a decision tree is expected fraction of examples split by nodes based on this attribute. In Random Forest this value is averaged over all trees. Relative feature importances for the input vector constituted by secondary structure annotation are given by
To understand what is happening on these positions, we calculated frequency of particular secondary structure patterns occurring between the residues of interacting and non-interacting fragments. Calculated frequencies are presented in
Pattern | Interacting | Non-interacting |
---|---|---|
0.13 | 0.09 | |
0.04 | 0.05 | |
0.16 | 0.17 | |
0.06 | 0.03 | |
0.08 | 0.09 | |
0.23 | 0.30 | |
0.17 | 0.08 | |
0.03 | 0.04 | |
0.13 | 0.17 | |
0.14 | 0.03 | |
0.09 | 0.09 | |
0.18 | 0.31 | |
0.15 | 0.08 | |
0.06 | 0.05 | |
0.18 | 0.18 | |
0.05 | 0.03 | |
0.08 | 0.10 | |
0.21 | 0.31 |
From our experiments it is clear that using the real secondary structure for classifier training is preferable over the predicted secondary structure, even if later only predicted secondary structure will be available. This limits the noise introduced during the training phase allowing the predictor to focus on the important patterns.
What is surprising at first glance is that encoding both secondary structure and protein sequence in a single feature vector improves the prediction on level-I but leads to a worse prediction quality on level-II. It is true that protein sequence contains more information than just secondary structure, but, on the other hand, including more attributes increases the risk of overfitting. This is the case of so-called bias–variance trade-off, where increasing the model complexity potentially decreases classifier bias but at the same time increases classifier variance (
We may speculate that this phenomena comes directly from the hierarchical nature of biological systems. Biological processes have to be robust and predictable, harnessing the dynamic of physical molecules in a constructive way. It is easier to build the upper layers of a complex system using constrained and standardized building blocks. Secondary structure motives are such stable building blocks, more constrained than the protein sequence. In that sense secondary structure may be seen as an useful data compression optimized evolutionary to perform certain functions. Our classifier performs better with a more compact, specialized representation than with the full sequence information, which requires too many free parameters to estimate.
In our work we focused on predicting binary protein interaction and the constructed residue-level predictor served only to generate data for level-II predictor. We optimised level-II prediction quality and demonstrated that optimising level-I only could lead to overfitting on level-II. In fact, for the threshold values greater than 0, our level-I predictor was not trained to identify single residue contacts anymore; it focused on discovering only strongly interacting fragments important for inter-protein interactions. To assess the importance of this difference in task formulation, we calculated ROC AUC score for level-I predictor not on our filtered segment pairs data but on real contact matrices of the test proteins. The predictor obtained 0.65 AUC—a score lower than on the filtered data and lower than level-II predictor score. This means that level-I predictor produces useful information for predicting protein–protein interactions but is not necessarily a good predictor of specific residue contacts.
After examining predicted contact maps visually, we realised that they consist of characteristic horizontal and vertical lines. This means that certain sequence fragments are predicted to be very active and likely to interact while the others are classified as inactive. We considered the possibility that our predictor focused on predicting solvent exposure of the fragments. To verify this hypothesis we calculated relative solvent accessibility for each residue (
It is possible that our level-I predictor reflects more specific protein properties such as the location of active sites. Improving our method would require an in-depth analysis of these matters. Since solvent accessibility is not correlated with the current predictions perhaps it is possible to include this information to further improve accuracy.
We draw the reader’s attention to the fact that the performance of popular protein representation strategies evaluated on our data was generally much lower than results reported in the literature. One of the reasons may be a relatively small size of our dataset—it might not contain enough examples for a classifier to learn complex patterns. The other explanation is the way we constructed positives and negatives. In our case every protein occurred in the same number of positive and negative pairs. Moreover, we performed cross-validation with splits on the protein level, which means that no single protein occurred simultaneously in training and testing set. In such conditions any method which makes a good prediction of general proteins’ reactivity but does not consider their actual compatibility performs poorly. This observation is consistent with results obtained by
Even though our evaluation procedure was carefully designed to reduce certain biases, we still do not expect it to reflect the reality of protein–protein interaction prediction perfectly. The most important issue is the proportion of positives to negatives in our datasets. On the protein level, we used balanced sets with the same number of positives and negatives. In reality the number of negatives, i.e., pairs of non-interacting proteins is much larger than the number of positives.
Classifier performance is calculated from a confusion matrix consisting of the following terms: TP—true positives (positives classified as positives), TN—true negatives (negatives classified as negatives), FP—false positives (negatives classified as positives), FN—false negatives (positives classified as negatives). The correction involves multiplying terms based on negative examples in the data set—TN and FP—by constant
In the light of these results, we believe that current methods for predicting protein–protein interactions from sequences, including ours, are not mature enough for large-scale application. The goal of reconstructing full protein interaction network is right now beyond reach. However, the developed predictors could still be applied in a variety of contexts. One of the possible applications is the initial screening of candidates for more costly
In this work we presented a method for constructing a multi-level classifier for protein–protein interactions. We demonstrated that the information present at the lower level can be successfully propagated to the upper level to make reasonable predictions. No additional features other than protein secondary structure predicted from sequence were required.
Our goal, predicting actual compatibility between two proteins regardless of their relative reactivity, forced us to collect high quality data and develop a rigorous evaluation procedure. We have taken into account properties of protein interaction network to construct balanced negatives. During the evaluation we carefully separated training and testing proteins to avoid information leak. We demonstrated that our method is working under such conditions better than popular sequence feature aggregation schemas.
There is still much room for further improvements regarding classification accuracy. We plan to include additional features both at the residue level and at the protein level to see if our model can benefit from them. Another direction that we want to explore is expanding the model to include proteins from organism other than yeast, and evaluating it on bigger datasets.
We hope that our work will inspire further discussion regarding evaluation strategies for protein interaction predictors. We believe that deeper understanding of those matters would allow the comparison of different methods in a more systematic manner which would be beneficial for the research done in this area.
The authors declare there are no competing interests.
The following information was supplied regarding the deposition of related data:
Github: