Multi-level machine learning prediction of protein–protein interactions in Saccharomyces cerevisiae

View article
PeerJ

Main article text

 

Introduction

Materials

Residue level positives and negatives

Protein level positives and negatives

  • Let G1 be a graph representing positive examples. Denote V = v1, …, vn as the set of its vertices. Each vertex in V represents a protein and each edge vi, vj represents an interaction. Let [Deg(v1), …, Deg(vn)] be a vector containing degrees of vertices from V. Let G2 be a graph of negative interactions. At first it has vertices identical to G1 and no edges.

  • While there exist v such that Deg(v) > 0:

    1. Find vertex v with the largest Deg(v).

    2. Find vertex u if exist such that:

      1. There is no edge (v, u) in G1.

      2. u has as large Deg(v) as possible.

      3. Distance d(u, v) in G1 is as large as possible.

    3. If u exist:

      1. Add edge (u, v) to G2.

      2. Deg(v)←Deg(v) − 1

      3. Deg(u)←Deg(u) − 1

    4. else: Deg(v)←0

Train-test split

Methods

Level-I features

  • Raw sequence—raw sequence of amino acids encoded numerically.

  • HQI8—sequence of amino acids encoded with High Quality Indices (Saha, Maulik & Plewczynski, 2012). These features were constructed by clustering AAindex database (Kawashima et al., 2007). Each amino acid is described by 8 values representing its physicochemical and biochemical properties.

  • DSSP structure—secondary structure of the protein extracted from PDB complex with DSSP software. It was limited to the three basic symbols: E—β-sheet, H—α-helix, C—coil.

  • PSIPRED structure—secondary structure of the protein predicted from sequence with PSIPRED software. It was limited to the three basic symbols: E—β-sheet, H—α-helix, C—coil.

Level-II features

  • the mean and variance of values over the matrix (2),

  • the sums of values in 10 best rows and 10 best columns (20),

  • the sums of values in 5 best diagonals of the original and the transposed matrix (10),

  • the sum of values on intersections of 10 best rows and 10 best columns (1),

  • the histogram of scores distributed over 10 bins (10),

  • graph features: fraction of nodes in the 3 largest connected components (3).

Evaluation of the level-II predictor

  1. Let O={(p11,p21),(p12,p22),,(p1n,p2n)} be a set of all protein pairs.

  2. For each fold FO:

    1. Build a set P composed of all proteins occurring in F: P={x:y(x,y)F(y,x)F}.

    2. Build a set AO composed only of pairs consisted of proteins occurring in P: A={(x,y):xPyP}.

    3. Build a set BO composed only of pairs consisted of protein not occurring in F but occurring in the testing set: B={(x,y):xPyP(x,y)O}.

    4. Train the classifier on B set, and test it on A.

  3. Collect all the predictions for A-sets, and calculate performance metrics.

Protein sequence feature aggregation

  1. AAC—Amino Acid Composition (Nanni, Lumini & Brahnam, 2014). Feature set is the set of frequencies of all amino acids in the sequence.

  2. PseAAC—Pseudo Amino Acid Composition (Chou, 2001). Feature set consists of the standard AAC features with k-th tier correlation factors added. We calculate those correlations on HQI8 indices.

  3. 2-grams (Nanni, Lumini & Brahnam, 2014). Feature set comprises of frequencies of all 400 ordered pairs of amino acids in the sequence.

  4. QRC—Quasiresidue Couples (Guo & Lin, 2005). A set of AAIndices is chosen. For each index d combined values of this property d for a given amino acid pair are summed up for all the pair’s occurrences over the full protein sequence. Occurrences for pairs of residues separated from each other by 0, 1, 2…m residues. In effect, one obtains QRCd vectors of length 400 × m. In this model we also use HQI8 indices.

  5. Variation of Liu’s protein pair features (Liu, 2009). The method starts from encoding each amino acid in a protein sequence with 7 chosen physicochemical properties, thus obtaining 7 feature vectors for each sequence. For each feature vector its “deviation” is calculated: γdj=1ndndi=1xij×x(d+i),jj=1,,7d=1,,L where xij is the value of descriptor j for amino acid at position i in sequence P, n is the length of protein sequence P, and d is the distance between residues in the sequence. For the purpose of the comparison, we tested this method with the original 7 amino acid indices used by Liu, as well as with HQI8 features. We tested different values of L from 5 to 30 in a quick cross-validation experiment on our data and chose L = 9 as yielding the best results.

Results and Discussion

Experimental results

Role of secondary structure

Quality of residue contacts prediction

Quality of protein interactions prediction

Conclusions

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Julian Zubek and Marcin Tatjewski performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.

Adam Boniecki and Maciej Mnich analyzed the data, contributed reagents/materials/analysis tools.

Subhadip Basu conceived and designed the experiments, wrote the paper.

Dariusz Plewczynski conceived and designed the experiments, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper.

Data Deposition

The following information was supplied regarding the deposition of related data:

Github: http://zubekj.github.io/mlppi/.

Funding

This article is funded by the European Union from financial resources of the European Social Fund, Project PO KL “Information technologies: Research and their interdisciplinary applications”; 2014/15/B/ST6/05082 and 2013/09/B/NZ2/00121 grants from the Polish National Science Centre; COST BM1405 and BM1408 EU actions. Subhadip Basu’s research was partially supported by the FASTTRACK grant (SR/FTP/ETA-04/2012) by DST, Government of India. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

14 Citations 4,115 Views 1,002 Downloads

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more