Automatic single and multilabel enzymatic function prediction by machine learning
 Published
 Accepted
 Received
 Academic Editor
 Alfonso Valencia
 Subject Areas
 Bioinformatics, Computational Biology, Genomics, Computational Science
 Keywords
 Enzyme classification, Singlelabel, Multilabel, Structural information, Amino acid sequence, SmithWaterman algorithm
 Copyright
 © 2017 Amidi et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
 Cite this article
 2017. Automatic single and multilabel enzymatic function prediction by machine learning. PeerJ 5:e3095 https://doi.org/10.7717/peerj.3095
Abstract
The number of protein structures in the PDB database has been increasing more than 15fold since 1999. The creation of computational models predicting enzymatic function is of major importance since such models provide the means to better understand the behavior of newly discovered enzymes when catalyzing chemical reactions. Until now, singlelabel classification has been widely performed for predicting enzymatic function limiting the application to enzymes performing unique reactions and introducing errors when multifunctional enzymes are examined. Indeed, some enzymes may be performing different reactions and can hence be directly associated with multiple enzymatic functions. In the present work, we propose a multilabel enzymatic function classification scheme that combines structural and amino acid sequence information. We investigate two fusion approaches (in the feature level and decision level) and assess the methodology for general enzymatic function prediction indicated by the first digit of the enzyme commission (EC) code (six main classes) on 40,034 enzymes from the PDB database. The proposed singlelabel and multilabel models predict correctly the actual functional activities in 97.8% and 95.5% (based on Hammingloss) of the cases, respectively. Also the multilabel model predicts all possible enzymatic reactions in 85.4% of the multilabeled enzymes when the number of reactions is unknown. Code and datasets are available at https://figshare.com/s/a63e0bafa9b71fc7cbd7.
Introduction
The evergrowing PDB database contains more than 110,000 proteins that are characterized by different properties including their structure, biological function, chemical composition, or solubility in solvents. Protein classification is important since it allows estimating the properties of novel proteins according to the group to which they are predicted to belong. Enzymes are a type of proteins that are classified according to the chemical reactions they catalyze into six primary classes, oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases. The classes are denoted by the enzyme commission (EC) number (NCIUBMB, 1992) and have been determined based on experimental evidence. Systematic annotation, reliability, and reproducibility of protein functions are discussed in Valencia (2005). Classification of enzymes is a central issue because it helps understanding enzymatic behavior during chemical reactions. While the vast majority of enzymes have been found to perform particular reactions, a nonnegligible number of enzymes can perform different reactions and can hence be directly associated with multiple enzymatic functions (Guyon et al., 2006).
During the last decade, various machine learning techniques have been proposed for both singlelabel and multilabel enzyme classification on different datasets. Among singlelabel classification studies, some (Dobson & Doig, 2005) used only structural information and achieved an accuracy of 35% for topranked prediction using support vector machine (SVM) with a oneagainstone voting scheme on 498 enzymes from the PDB database. Applying SVM on sequence features has been done by Mohammed & Guda (2015) and achieved an accuracy of 98.39% after training on 150,000+ enzymes with 10fold crossvalidation. Osman & ChoongYeun Liong (2010) extracted only gene or amino acid sequence information and applied neural networks obtaining an accuracy of 72.94% after training the networks on 1,200 enzymes from the PDB database and testing on 2,000 others. Volpato, Adelfio & Pollastri (2013) achieved 96% accuracy with a 10fold crossvalidation scheme on 6,081 entries of the ENZYME database. Sequence structure and amino acid information were also used by des Jardins et al. (1997), Kumar & Choudhary (2012) and Lee et al. (2007), who obtained testing accuracies ranging from 74% to 88.2% using the SwissProt database. Combination of sequence, structure, and chemical properties of enzymes was also explored by Borgwardt et al. (2005) using kernel methods and SVM on the BRENDA database and achieved an accuracy of 93% with sixfold crossvalidation on information extracted through protein graph models. Multilabel classification using different methods such as RAkELRF and MLKNN (Wang et al., 2014) or MULAN (Zou et al., 2013) was performed on single and multilabeled enzymes. In particular, the latter was assessed on enzymes from the SwissProt database based on their amino acid composition and their physicochemical properties and involved the use of positionspecific scoring matrices. In the best scenario, a macroaveraged precision of 99.31% was obtained on a set of 2,840 multifunctional enzymes after 10fold crossvalidation. A summary of other alignmentfree methods used to predict enzyme classes is presented in Table 1.
No. proteins  Information  Parameters  Classification method  Level  Work  

1,371  3D structure  3DHINT potential  LDA  QSAR  ANN  0–1  Concu et al. (2009c) 
4,755  Moments, entropy, electrostatic, HINT potential  MLP  Concu et al. (2009b)  
2,276  3DQSAR  Concu et al. (2009a)  
26,632  Global binding descriptors  SVM  1–3  Volkamer et al. (2013)  
211,658  Structural  GRAVY  1  Dave & Panchal (2013)  
3,095  Sequence  PseAAC, SAAC, GM  MLkNN  Zou & Xiao (2016)  
9,832  FunD, PSSM  OETkNN  1–2  Shen & Chou (2007)  
300,747  Interpro signatures  BRkNN  1–4  Ferrari et al. (2012) 
Other work on enzyme classification includes the use of information stemming from topological indices (Munteanu, GonzalezDiaz & Magalhaes, 2008), peptide graphs (Concu et al., 2009b), and also includes the machinelearning based ECemble method (Mohammed & Guda, 2015).
In this paper, a new feature extraction and classification scheme is presented that combines both structural and amino acid sequence information aiming to improve standard classifiers that use only one type of information. Building upon previous work (Amidi et al., 2016), we investigate a more sophisticated combination approach and assess the performance of the scheme in singlelabel and multilabel classification tasks. Stateoftheart accuracy is observed as compared to the methods reviewed in the survey by Yadav & Tiwari (2015).
Methods
Feature extraction
Proteins are chains of amino acids joined together by peptide bonds. As the threedimensional (3D) configuration of the amino acids chain is a very good predictor of protein function, there has been many efforts in extracting an appropriate representation of the 3D structure (Lie & Koehl, 2014). Since many conformations of this chain are possible due to the possible rotation of the peptide bond planes relative to each other, the use of rotation invariant features is preferred over features based on Cartesian coordinates of the atoms. In this study, the two torsion angles of the polypeptide chain were used as structural features. The two torsion angles describe the rotation of the polypeptide backbone around the bonds between N − C_{α} (angle φ) and C_{α} − C (angle ψ). The probability density of the torsion angles φ and ψ ∈ [−180°, 180°] was estimated by calculating the 2D sample histogram of the angles of all residues in the protein. When the protein consisted of more than one chain, the torsion angles of all chains were included together into the feature vector. Smoothness in the density function was achieved by moving average filtering, i.e., by convoluting the 2D histogram with a uniform kernel. The range of angles was discretized using 19 × 19 bins centered at 0° and the obtained matrix of structural features was linearized to a 361dimensional feature vector for each enzyme representing structural information (X_{SI}).
Although structure relates to amino acid sequence, additional information can be extracted directly from the protein sequences. Assessment of similarities between amino acid sequences of enzymes is usually performed by sequence alignment. The Smith–Waterman sequence alignment algorithm (Smith & Waterman, 1981) has been preferred over the Needleman–Wunsch algorithm (Needleman & Wunsch, 1970) due to the assessment of sequence similarity based on local alignment (in contrast to the global alignment previously performed), which enables possible deletions, insertions, substitutions, matches and mismatches of arbitrary lengths. Optimizing local alignment allows to take into consideration mutations that might have happened in amino acid sequences. The similarity of each pair of sequences i and j can be quantified using the scoring matrix that is produced by the sequence alignment algorithm. For two sequences i and j, the highest score in the previous matrix, which reflects the success of alignment, is used as similarity criterion S(i, j). Amino acid sequence information is represented in two distinct ways. First, for each one of the six classes, the similarity matrix S of a sequence to all training samples of that class is calculated and summarized as a histogram vector with 10 bins. The sixhistogram vectors are then concatenated into a 60dimensional feature vector which is denoted as X_{AA}. Second, the class probabilities (F_{AA})_{x}^{j} of a given enzyme j are expressed as the maximum similarities S within each class normalized over all classes: $$\text{\hspace{1em}For\hspace{0.17em}each\hspace{0.17em}class\hspace{0.17em}}x,\text{\hspace{0.17em}\hspace{0.17em}}{({f}_{\text{AA}})}_{x}^{j}=\frac{\underset{\begin{array}{c}k\in \text{\hspace{0.17em}training\hspace{0.17em}}{\displaystyle \cap \text{\hspace{0.17em}}}\text{EC\hspace{0.17em}}x\\ k\ne j\end{array}}{\text{max}}S(k,j)}{{\displaystyle \sum _{l=1}^{6}\underset{\begin{array}{c}k\in \text{\hspace{0.17em}training\hspace{0.17em}}{\displaystyle \cap \text{\hspace{0.17em}}}\text{EC\hspace{0.17em}}l\\ k\ne j\end{array}}{\text{max}}}S(k,j)}$$
Classification and fusion
Two classification techniques have been investigated, the nearest neighbor (NN) and SVM. NN is preferred for its simplicity and its small computation time whereas SVM is useful to find nonlinear separation boundaries. The classifiers are trained using a number of annotated examples and then tested on novel enzymes. Two types of classification models have been produced: singlelabel models for the enzymes performing unique reactions and multilabel models for the multifunctional enzymes. Both structural (SI) and amino acid sequence (AA) information is related to the enzymatic activity. In order to take into consideration these two properties, fusion of information is performed in two different ways, in the feature level and in the decision level.
The concept of the featurelevel fusion is to concatenate the two sets of 361 structural and 60 amino acid sequence features before performing classification. The featurelevel fusion approach is illustrated in Fig. 1.
The decisionlevel fusion approach associates class probabilities for SI obtained by SVM (Platt, 1999) f_{SI}^{SVM} or NN (Atiya, 2005) f_{SI}^{NN} with class probabilities for AA (f_{AA}) through a heuristic fusion rule. The applied fusion rule performs weighted averaging of class probabilities using unequal weights. Thus, the corresponding fused class probability is given by (1 − α)(f_{SI}) + α(f_{AA}). An optimized α is empirically obtained for each classification method by maximizing the accuracy over the training data (Amidi et al., 2016). A single class is assigned in the singlelabel classification (Fig. 2) based on the maximum probability.
This hard decision rule cannot be applied to the multilabel scenario. In order to obtain a soft decision, a multilabel classifier is applied on the fused class probabilities to produce the final decision outputs. In particular, the sixdimensional class probabilities (fused from AA and SI) are introduced into a multilabel SVM or multilabel NN, which computes a sixdimensional binary vector where the cth feature is equal to 1 if the predicted enzyme belongs to class c and 0 otherwise.
Performance assessment
The data have been randomly split into 80% for training and validation and 20% for independent testing. The training/validation set has been divided into five random folds that have been used to determine the optimal parameters. More particularly, the parameters were optimized by a standard fivefold crossvalidation based on classification accuracy in the singlelabel classification problem and on the subset accuracy in the multilabel classification problem. Upon optimization, the parameters were fixed and remained the same throughout all experiments. Then for both classification problems, performance has been assessed by applying the methods on the independent testing set using the fixed parameters.
Singlelabel classification
The performance of singlelabel classification has been assessed based on the confusion matrix whose elements C(x, y) with x, y ∈ ⟦1, 6⟧, indicate the number of enzymes that belong to class x and are predicted as belonging to class y. Two metrics are based on this definition: the overall accuracy that evaluates the proportion of correctly classified enzymes among the total number of enzymes and the balanced accuracy that avoids inflated performance estimates on imbalanced datasets. They are defined by: $$\text{Overall\hspace{0.17em}Accuracy}=\frac{{\displaystyle \sum _{x=1}^{6}C}(x,x)}{{\displaystyle \sum _{x,y=1}^{6}C}(x,y)}\text{\hspace{1em}\hspace{1em}and\hspace{1em}\hspace{1em}Balanced\hspace{0.17em}Accuracy}=\frac{1}{6}\cdot {\displaystyle \sum _{x=1}^{6}\frac{C(x,x)}{{\displaystyle \sum _{y=1}^{6}C}(x,y)}}$$
Multilabel classification
In the case of multilabel classification, the labels of an enzyme i are represented by a sixdimensional binary vector L_{i} where the value 1 at a position j ∈ ⟦1, 6⟧ indicates the positivity of class j and 0 otherwise. Also, we denote with N the total number of enzymes, as well as L_{i}^{true} and L_{i}^{pred} the sets of true and predicted labels of enzyme i, respectively. The performance of the multilabel classifiers cannot be assessed using the exact same definitions as for the singlelabel classifiers. Various multilabel metrics defined in previous works (Zhang & Zhou, 2006; Tsoumakas & Katakis, 2007; Madjarov et al., 2012) have been considered in our study. Here, we introduce the Kronecker delta δ, the symmetric difference Δ, the binary union ∪ and intersection ∩ operations, as well as the l_{1}norm  . The following metrics have been chosen to assess the performance of our new method:

Hammingloss assesses the frequency of misclassification of a classifier on a given set of enzymes. This index is averaged over all classes and all enzymes. Also, we will note 1Hammingloss the complementary of this indicator so that the worstcase value is 0 and the best is 1. Conversely, to the Hammingloss index, the latter assesses the average over all enzymes of the proportion of binary class memberships that are correctly predicted. $$\text{HammingLoss}=\frac{1}{N}{\displaystyle \sum _{i=1}^{N}\frac{1}{6}}\left{L}_{i}^{\text{pred}}\Delta {L}_{i}^{\text{true}}\right$$
Accuracy averages over all enzymes the Jaccard similarity coefficient of the predicted and true sets of labels. This index reflects the averaged proportion of similar class membership between those two sets. $$\text{Accuracy}=\frac{1}{N}{\displaystyle \sum _{i=1}^{N}\frac{{L}_{i}^{pred}\cap {L}_{i}^{true}}{{L}_{i}^{pred}\cup {L}_{i}^{true}}}$$
Precision, recall, and F1 score, which have been adapted for multilabel classification. The two first metrics respectively reflect the proportion of detected positives that are effectively positive, and the proportion of positives samples that are correctly detected. Finally, the F1 score balances the information provided by these two indexes through the computation of an harmonic mean. $$\begin{array}{l}\text{Precision}=\frac{1}{N}{\displaystyle \sum _{i=1}^{N}\frac{{L}_{i}^{\text{pred}}\cap {L}_{i}^{\text{true}}}{\left{L}_{i}^{\text{pred}}\right}}\text{\hspace{1em}Recall}=\frac{1}{N}{\displaystyle \sum _{i=1}^{N}\frac{{L}_{i}^{\text{pred}}\cap {L}_{i}^{\text{true}}}{\left{L}_{i}^{\text{true}}\right}}\\ \text{\hspace{1em}F}1=\frac{2}{N}{\displaystyle \sum _{i=1}^{N}\frac{{L}_{i}^{\text{pred}}\cap {L}_{i}^{\text{true}}}{\left{L}_{i}^{\text{pred}}\right+\left{L}_{i}^{\text{true}}\right}}\end{array}$$

Subset accuracy considers that a given enzyme is correctly classified if and only if all class memberships are correctly predicted. This metric is the strictest of this study, since it requires the sets of true and predicted labels to be identical in order for an enzyme to be considered as correctly classified.
$$\text{Subsetaccuracy}=\frac{1}{N}{\displaystyle \sum _{i=1}^{N}\delta}\left({L}_{i}^{\text{pred}},{L}_{i}^{\text{true}}\right)$$

Macroprecision, recall and F1 compute, respectively precision, recall, and F1score separately for each class, and then average the values over the six classes. These indexes are crucial for us, as they highlight the performance of our method on smallpopulated labels. In the following definitions, TP_{j} and FP_{j}, respectively represent the number of true positives and false positives, and Precision_{j} and Recall_{j} are those associated to label j ∈ ⟦1, 6⟧, considered as binary class.
$$\begin{array}{l}\text{Mprecision}=\frac{1}{6}{\displaystyle \sum _{j=1}^{6}\frac{{\text{TP}}_{j}}{{\text{TP}}_{j}+{\text{FP}}_{j}}}\text{\hspace{1em}Mrecall}=\frac{1}{6}{\displaystyle \sum _{j=1}^{6}\frac{{\text{TP}}_{j}}{{\text{TP}}_{j}+{\text{FN}}_{j}}}\\ \text{\hspace{1em}MF1}=\frac{2}{6}{\displaystyle \sum _{j=1}^{6}\frac{{\text{Precision}}_{j}\times {\text{Recall}}_{j}}{{\text{Precision}}_{j}+{\text{Recall}}_{j}}}\end{array}$$

Microprecision, recall, and F1 are similar to the singlelabel definition of those three quantities, whereas here they rely on the values of the sum over all classes of true positives, false positives, and false negatives. The micro indexes indicate whether the majority of the enzymes are correctly classified, regardless if they belong to low or highpopulated classes.
$$\begin{array}{l}\text{mprecision}=\frac{{\displaystyle \sum _{j=1}^{6}{\text{TP}}_{j}}}{{\displaystyle \sum _{j=1}^{6}{\text{TP}}_{j}}+{\displaystyle \sum _{j=1}^{6}{\text{FP}}_{j}}}\text{\hspace{1em}mrecall}=\frac{{\displaystyle \sum _{j=1}^{6}{\text{TP}}_{j}}}{{\displaystyle \sum _{j=1}^{6}{\text{TP}}_{j}}+{\displaystyle \sum _{j=1}^{6}{\text{FN}}_{j}}}\\ \text{\hspace{1em}mF1}=2\cdot \frac{\text{mprecision}\times \text{mrecall}}{\text{mprecision}+\text{mrecall}}\end{array}$$
Data
The method has been applied on data from the PDB database that include one set of singlelabeled enzymes (Table 2) and one set of multilabeled enzymes (Table 3).
Class  EC 1  EC 2  EC 3  EC 4  EC 5  EC 6 

Name  Oxidoreductase  Transferase  Hydrolase  Lyase  Isomerase  Ligase 
Number  7,256  10,665  15,451  2,694  1,642  1,543 
Number of classes  2  3  4  

EC numbers  1  1  1  1  2  2  2  2  3  3  3  4  1  1  1  1 
2  3  4  5  3  4  5  6  4  5  6  5  2  2  4  2  
3  4  5  4  
5  
Number of enzymes  62  44  14  2  217  160  45  15  82  23  73  28  1  7  6  4 
Note:
The total number of enzymes with 2, 3 and 4 labels each are 765, 14, 4, respectively.
Results
Singlelabel classification
Classification via decisionlevel fusion has been performed using α = 0.95 for the SVM method and α = 0.99 for the NN method. Overall and balanced accuracies obtained with each method on the testing set are detailed in Table 4.
Type  SI  AA  Decision fusion  Feature fusion  

Classifier  SVM  NN  NN  SVM  NN  SVM  NN 
Overall accuracy  0.830  0.828  0.976  0.977  0.978  0.942  0.878 
Balanced accuracy  0.755  0.788  0.968  0.966  0.968  0.910  0.856 
The decisionlevel fusion classification increased the overall accuracy by 0.2% compared to the best results obtained by either AA only or SI only. The balanced accuracy achieved is 96.8%, which is the same as the one achieved by NN classification using AA only. Also, SVM classification via featurelevel fusion achieves 11.2% higher overall accuracy than classification via SI only but 3.4% less overall accuracy than NN classification on AA only. In general, classification using SVM tends to achieve better overall accuracy than with NN (0.2% and 6.4%, respectively for SI only and featurelevel fusion), whereas except for the featurelevel fusion, NN tends to achieve better balanced accuracy than SVM (0.2% and 3.3%, respectively for decisionlevel fusion and SI only).
Multilabel classification
As described in the methods’ section, the optimal fusion parameter α was empirically determined for each dataset (single or multifunctional) and fusion scheme. The optimal values are shown in Fig. 3 from which it can be seen that the values of α for the decisionlevel fusion in multilabel classification (α = 0.69, 0.73, 0.76, 0.80, respectively for the SVM–NN, SVM–SVM, NN–NN, and NN–SVM methods) are approximately 20% smaller compared to the values obtained in singlelabel classification (α = 0.95, 0.99, respectively for the SVM and NN methods). This shows that structural information plays a more significant role in differentiating enzymatic activity in the case of multilabeled enzymes than in singlelabel classification (which is mostly based on aminoacid sequence information).
Figure 3 shows the subset accuracy for the testing set obtained for each approach in multilabel classification. Both SVM and NN classifiers achieved approximately 10% less subset accuracy when using only SI than when using only AA. Combining SI and AA according to the featurelevel fusion scheme leads to intermediate values (between the ones achieved by only SI and AA) of subset accuracy. However, the combination of information based on the decisionlevel fusion scheme increased the subset accuracy by up to 1.3% compared to the best approach using AA only. The best results were obtained with the SVM–NN classification scheme. The overlap and discrepancy in correct predictions using SI (SVM), AA (NN) and the decisionlevel fusion scheme with SVM–NN are illustrated in Fig. 4.
We observed that 65.6% of the enzymes in the testing test were correctly predicted by all compared approaches (SI only, AA only, and decisionlevel fusion). Also, out of 29 enzymes correctly predicted by AA but not by SI, 28 are also correctly predicted by the SVM–NN decisionlevel fusion scheme. This shows that the decisionlevel fusion incorporates the relevant information provided by AA, which was missed by SI. Conversely, out of five enzymes correctly predicted by SI and not by AA, two of them are correctly predicted by the decisionlevel fusion scheme. This could be related to the chosen values of α that assigns a larger weight to the class probabilities calculated by AA than the ones extracted from SI.
Computation of 1Hammingloss for each approach is shown in Fig. 5. All decisionlevel fusion schemes achieved higher values than the approaches using only AA or only SI. The decisionlevel scheme that performed best in terms of Hammingloss is SVM–SVM with an increase of 1.8% compared to AA (NN). The comparison of 1Hammingloss per class for each best method (SI only, AA only and decisionlevel fusion) is shown in Table 5.
Classifier  1Hammingloss per class  

EC 1  EC 2  EC 3  EC 4  EC 5  EC 6  
SI SVM only  0.962  0.834  0.860  0.822  0.943  0.962 
AA NN only  0.962  0.930  0.898  0.885  0.962  0.987 
Decision fusion SVM–SVM  0.968  0.917  0.949  0.943  0.968  0.987 
Note:
The best classification performance is indicated in bold for each class.
The SVM–SVM method achieves for each class except for the transferases up to 5.8% higher 1Hammingloss than the maximum accuracy achieved by the best classifier of a single type of information (SI or AA). There is an increase in the performance regardless of the size of the class. In particular, classification of a large class such as the hydrolases had a 5.1% increase in 1Hammingloss, whereas small classes like the lyases and isomerases were classified respectively with 5.8% and 0.6% better performance after fusion than with SI or AA only.
Table 6 shows the results of the 10 methods, according to all of the metrics that have been assessed for multilabel classification. With respect to all the indexes, we observe that the decisionlevel fusion schemes outperform those carrying only one type of information. More particularly, each of the SVM–SVM, SVM–NN, and NN–SVM techniques provide a distinct advantage in the process of multilabel classification. First of all, the SVM–SVM scheme is best in terms of 1Hammingloss with a testing value of 95.5%, which surpasses other methods by at least 1% margin. Also, this scheme proves to be the best in terms of the three definitions of recall, meaning that if an enzyme belongs to a certain class, SVM–SVM will be the more likely to detect it. In terms of predicting exact matches of the true labels, the SVM–NN method will be the best one to consider with a testing value of 85.4%, which is at least 1.3% ahead of the performance achieved considering only one type of information. One of the most impressive rises in performance stems from the NN–SVM method, which proves to outperform SI and AA methods by +4.1%, +2.6%, and +4.6% in terms of precision, Mprecision and mprecision, respectively. Not only does it show that the relevance of class predictions is improved overall, but also and more importantly that smallpopulated classes benefit from this progression as well.
Type  SI  AA  Decision fusion  Feature fusion  

Classifier  SVM  NN  SVM  NN  SVM  NN  SVM  NN  
SVM  NN  SVM  NN  
Alpha  0.73  0.69  0.80  0.76  
Hammingloss  0.103  0.119  0.064  0.063  0.045  0.054  0.054  0.063  0.083  0.098  
Accuracy  0.790  0.800  0.883  0.885  0.906  0.898  0.879  0.889  0.823  0.831  
Precision  0.857  0.829  0.901  0.906  0.942  0.918  0.947  0.907  0.889  0.856  
Recall  0.825  0.831  0.908  0.908  0.924  0.920  0.885  0.911  0.847  0.856  
F1 score  0.835  0.829  0.904  0.906  0.928  0.919  0.893  0.908  0.859  0.855  
Subset accuracy  0.688  0.739  0.834  0.841  0.847  0.854  0.841  0.847  0.726  0.783  
Macro  Precision  0.921  0.744  0.940  0.941  0.962  0.945  0.967  0.903  0.927  0.806 
Recall  0.741  0.777  0.881  0.871  0.887  0.879  0.854  0.881  0.791  0.787  
F1  0.801  0.758  0.902  0.897  0.921  0.905  0.905  0.889  0.844  0.794  
Micro  Precision  0.864  0.822  0.904  0.907  0.943  0.919  0.953  0.904  0.901  0.857 
Recall  0.829  0.832  0.910  0.910  0.925  0.922  0.885  0.913  0.850  0.857  
F1  0.846  0.827  0.907  0.908  0.934  0.921  0.918  0.909  0.875  0.857 
Note:
The best classification performance (based on different criteria) is indicated in bold for each technique.
The code was written in Matlab and Python languages and is freely and publicly available at https://figshare.com/s/a63e0bafa9b71fc7cbd7. Running on a single Intel Xeon X5650 processor, the average prediction time of the enzymatic function(s) of a new enzyme was less than 3 s. Computations were achieved using high performance computing (HPC) resources from the “mesocentre” computing center of Ecole Centrale de Paris (http://www.mesocentre.ecp.fr) supported by CNRS.
Discussion and Conclusion
The results of both singlelabel and multilabel classifications showed that the combination of information leads to more accurate enzyme class prediction than the individual structural or amino acid descriptors. Among fusion approaches, the decisionlevel fusion performed better than the featurelevel fusion. In the multilabel case, the SVM–NN fusion scheme achieved the best subset accuracy by predicting correctly the labels of 85.4% of the enzymes. The NN–NN fusion scheme also performed well (84.7%) and required the least computational time during the training phase. Structural information seems to be more important in the case of multilabel classification than in singlelabel, since the optimal relative weight of amino acid sequence features during fusion was found to be smaller in multilabeled enzymes (α ∈ [0.69, 0.80]) compared to singlelabeled enzymes (α ∈ [0.95, 0.99]).
In all examined cases, AA was more informative than SI in respect to the prediction of enzymatic activity. The same trend has been observed in a study of Zou et al. (2013) which showed an increase of 0.81% with sequence related features, compared to structural features. However, it should be noted that we examined only general functional characteristics indicated by the first digit of EC code. A study assessing the relationship between function and structure (Todd, Orengo & Thornton, 2001) revealed 95% conservation of the fourth EC digit for proteins with up to 30% sequence identity. Similarity, Devos & Valencia (2000) concluded that enzymatic function is mostly conserved for the first digit of EC code whereas more detailed functional characteristics are poorly conserved.
The single and multilabel classification models have been trained and tested on enzymes assumed to perform single or multiple reactions, correspondingly. However, the singlelabel enzymes might be associated with other reactions not detected yet and in fact be multilabel. In order to assess the method in a more general scenario, we mixed both single and multilabel information during training phase and observed a slight improvement in prediction accuracy. Specifically, we chose to examine the NN–NN fusion scheme because of its small computation time, and merged SI and AA probabilities obtained by both datasets I and II. This model achieved 89.2% subset accuracy and 95.8% accuracy for the multilabel dataset (by crossvalidation) indicating an increase of 4.5% and 2.1% in respect to the results obtained with the NN–NN scheme trained only on multilabeled data (shown in Figs. 3 and 5). This also corresponds to an increase of 3.8% and 0.3%, respectively, from the best fusion schemes.
Moreover, since it is unknown for new (testing) enzymes if they perform unique reactions, they have to be treated as multilabel. In order to estimate the performance of the singlelabel model in the case of unknown enzymes, we tested the best singlelabel classifier (i.e., the NN on the decision level) on the multilabel dataset. For 93.0% of the enzymes the model predicted correctly one of their actual labels, whereas the prediction of all actual labels (by selecting the classes with the highest probability scores) was correct in 44.8% of the enzymes.
Furthermore, we investigated techniques dealing with imbalanced classes but did not observe any conclusive outcome. In particular, ADASYN improved overall accuracy on the singlelabel dataset by 0.1% but reduced balanced accuracy by 1.1%.
In conclusion, computational models calculated from experimentally acquired annotations of large datasets provide the means for fast, automated, and reproducible prediction of functional activity of newly discovered enzymes and thus can guide scientists in deciphering metabolic pathways and in developing potent molecular agents. Future work includes the representation of the whole 3D geometry using additional structural attributes and the incorporation of deep learning architectures that have proven to be powerful tools in supervised learning. The main advantage of deep learning techniques is the automatic exploitation of features and tuning of performance in a seamless fashion, that optimizes conventional analysis frameworks.