To increase transparency, PeerJ operates a system of 'optional signed reviews and history'. This takes two forms: (1) peer reviewers are encouraged, but not required, to provide their names (if they do so, then their profile page records the articles they have reviewed), and (2) authors are given the option of reproducing their entire peer review history alongside their published article (in which case the complete peer review process is provided, including revisions, rebuttal letters and editor decision letters).
Please take a look at the attached PDF - where I have fixed minor typos and marked with '?' possible alternative word choices. These issues can be resolved while in production.
I'm confident this work, along with your other publication this year, will be valuable to others embarking on the application of CNNs in structural biology.
I previously recommended using a truly non-redundant data set to assess the algorithm. This has still not been done. However, I accept that large data sets are required for the deep learning methods used here. The text has been rewritten to clarify this issue.
My other suggestions have been adequately addressed.
I am therefore pleased to support publication now.
In summary, although you have satisfactorily addressed many of the requested revisions in the previous round, there are two outstanding problems with the work.
1.Response to request for additional work to evaluate performance on a non-redundant training and test set.
In Reviewer 1's review, they point out that whilst you have included new statements describing the use of the PISCES pdbnraa dataset, the statistics you quote in the revised text suggest that this dataset still contains similar sequences.
I also highlight the following text (line 202):
"Analysis of protein datasets is often performed after removal of redundancy, such that the remaining entries do not overreach a pre-arranged threshold of sequence identity. In this particular work the author chose not to employ data filtering strategies, since the pattern analysis method is based on structure similarity and not sequence similarity."
This statement is concerning because it suggest you are not aware of the relationship between sequence similarity and structural similarity. Indeed, it is widely accepted that sequences with a high degree of similarity are more likely to have a high degree of structural similarity. It follows that in order for an ML model trained on structure to be properly evaluated, it should be tested on proteins which have no significant degree of sequence similarity in order to verify that the method has captured structural features of the enzyme classes, and has not simply learnt evolutionary relationships.
You further state in line 218-211 that the PISCES pdbnraa dataset was used. Indeed - the pdbnraa.gz file contains the number of chains you quote, but as reviewer 1 suggests, pdbnraa.gz dataset in fact only excludes PDB IDs with chains that are 100% identical to another in the set.
In order to respond properly to the request for additional experiments, please use a PISCES dataset with a more stringent cut-off for sequence similarity (e.g. 20% - as requested by Reviewer 1).
2. Response to editorial revision #1:
" Please properly introduce the problem, and provide an analysis of your predictor's performance in context of the confounding factors in enzyme function prediction: that 1) a structural class may yield several enzyme classes, and 2) (highlighted by reviewer 1) that a structural class is not a predictor of enzymatic function. Indeed, it would be extremely valuable if a method were available that could distinguish functional and non-functional structures, but your experimental design does not include an H0 dataset to assess this."
In your response you the request above, you essentially stated that consideration of protein structural class was not relevant for the paper. I do not agree.
In order for your paper to be acceptable for publication it must:
properly state all relevant aspects of the machine learning task, or provide relevant references
provide sufficient information to communicate the unique issues to the problem domain.
My intention with the original revision request was that you provide a more complete introduction by fully explaining the relevant domain issues of the problem that you are addressing. The key aspects are, as pointed out by other reviewers, that the structural features that you employ in this work are informative for both EC number *and* structural class, but since it is well known (as you yourself state) that many enzymes with different EC numbers are members of the same structural class, but other enzymes have distinct classes, you should highlight these issues in the introduction, *and* specifically examine such cases in your results to assess whether your ML protocol has successfully coped with these conflicting signals.
3. Other work.
I suggested you include some discussion of how your method compares with other work. In your rebuttal you stated that due to differences in method and feature vectors used, other methods could not be properly compared with the method you described. I do not agree.
The point of benchmarking is to compare results of different methodologies, and providing the same training and test set is used, and the same performance statistics are computed, it is straightforward to compare radically different ML strategies. It is of course important, however, that the training and testing sets are sufficiently non-redundant that they do not introduce bias that favours any method. If you perform the additional experiments as requested, it should be possible to demonstrate the efficacy of your structure based approach as compared against methods that rely on sequence similarity to determine functional class.
I still think that changing the scales in Figure 3 would help. Solid blue plots are of no value.
Similarly, I still think that rescaling Figure 5 so that the plots are not nearly all blue would be helpful.
My main previous recommendation was to see whether the method works without redundancy. Unfortunately, I still think that the additional work to address this issue is unsatisfactory. The work is redone using a subset that is claimed to be non-redundant. Non-redundant is not defined, however. A reasonable definition would be no two pairs of proteins have more than 20% sequence identity when aligned. Using the PISCES site with the “highest resolution structure available and then the best R-values” would be a resolution of 1.6Å and an R-value of 0.25. Even with a sequence identity cut-off of 90%, there are still 7085 PDB files with these criteria on PISCES, as of 23/2/17. I don’t understand where the subset of 23242 proteins comes from. In any event, there must still be much redundancy if that many proteins are used. I therefore don’t think the issue of how well the method works in the absence of redundancy has been addressed.
I still think that section 2.2 would benefit from more explanation.
I think the method should be tested on a true non-redundant data set, with no protein pairs having a sequence identity of >20% when aligned.
All my other suggestions have been satisfactorily addressed.
Whist your manuscript describes a novel method, the major concern of both reviewers is that the two convolutional models' performance were not rigorously evaluated. Reviewer 1 provides detailed explanation and questions regarding the data set, whilst Reviewer 2 also asks whether a hold out set was used (ie a set that is used only for evaluation, after cross-validated training has been performed). On behalf of both reviewers, I therefore request you revise and repeat the experiment as requested (non-redundant PDB dataset, independent & non-homologous test set).
In addition to this additional experiment, please address their comments, as well as my own below:
0. In your introduction you begin by describing metagenomics as "the field which combines the study of nucleotide sequences with their structure,". This definition is incorrect, and in fact, what you describe is typically the result of 'structural genomics' efforts where protein structures are systematically expressed and their structures determined without further biochemical characterisation.
It is worth noting that "Structural Metagenomics" has been proposed a term: see Adam Godzik's talk at Metagenomics 2007 http://www.calit2.net/newsroom/article.php?id=1135
1. Statement of question. This paper is of interest to both biologists and computer scientists. It is essential that you provide a clear description of the domain isssues in this classification problem that is both accurate and accessible to a computer scientist with only passing familiarity with macromolecular structure and evolution.
Revision: Please properly introduce the problem, and provide an analysis of your predictor's performance in context of the confounding factors in enzyme function prediction: that 1) a structural class may yield several enzyme classes, and 2) (highlighted by reviewer 1) that a structural class is not a predictor of enzymatic function. Indeed, it would be extremely valuable if a method were available that could distinguish functional and non-functional structures, but your experimental design does not include an H0 dataset to asssess this.
2. In your results, you only take the top-most hit as the prediction. R1 suggests you quote confusion matrices. I strongly suggest you take this one step further and provide ROC curves (rank sensitivity/specificity plots) and area-under-the-curve (AUC) values, since these statistics are normally presented to depict performance. It is also customary to highlight specific cases where your methodology performed (or failed to perform) well.
Other work: Whilst we appreciate there are very few structure based EC classifiers available, it is important to discuss how this method is distinct from other structure based approaches (beyond the use of deep-learning), and to give an indication of relative performance. It would also be relevant to examine how the features your approach employs compare to those applied in other protein structure classification tasks.
Further work: you state that you are currently exploring other encoding schemes based on volumetric and topologically distinct representations of structure, but that these experiments have not yielded models that are able to generalise. If in the revised manuscript you still wish to discuss further work, then you should also provide some analysis to support these observations.
Finally, please review PeerJ's rules regarding figures and data. In particular, ensure that all figures are properly labelled, and common terminology is used. e.g. in figure 4 'mean centred histogram' is not a typical name for a 2D heat map); and the horizontal and vertical bins should be annotated. R1 in particular notes that figure 4's colourscheme also makes it difficult to interpret.
There are a few small errors with English.
Zacharaki has developed a new method to predict the EC class of a protein structure, known to be an enzyme. The results are superficially excellent (~90% accuracy), but I think this is due to using a flawed dataset, to a large extent. The author uses all the enzyme PDB files. As many of these files are from the same enzyme (e.g. there are 1734 lysozymes), the method will usually work simply by recognising itself. For example, if the unknown protein is a lysozyme and there is a lysozyme in the training set, then all an algorithm needs to do is to spot a copy of itself in the training set (k-mean algorithms will do this). If a dataset like this is used, then there is no need to develop any kind of new algorithm, since PSI-BLAST will already work perfectly well. It may well, however, work poorly on any structure that does not have a homologue in the training set – hence it is overfit.
I therefore think that the work should be redone using a data set that has no pairs of similar proteins. A sequence cut-off of no more than 20% sequence identity could be used. The author says herself: “Assessment of the relationship between function and structure (Todd et al., 2001) revealed 95% conservation of the fourth EC digit for proteins with up to 30% sequence identity.” I think that is why the method works – if proteins are present with high sequence identity, then they have the same EC number.
It is assumed that the protein is an enzyme in advance (with a single EC number). The work would be much more powerful if it was coupled with predicting whether it is an enzyme, since then it could be applied to any protein structure.
If these changes are made, and the model has a good performance, it could be publishable, as the method is original.
The numbers of proteins in each EC class look very odd. The pdb (http://www.rcsb.org/pdb/browse/jbrowse.do?t=3&useMenu=no) gives these frequencies for each class:
• 1: Oxidoreductases - [ 12950 Entities ]
• 2: Transferases - [ 20733 Entities ]
• 3: Hydrolases - [ 28864 Entities ]
• 4: Lyases - [ 5705 Entities ]
• 5: Isomerases - [ 2852 Entities ]
• 6: Ligases - [ 2747 Entities ]
Why are the numbers in Table 1 so different? Ligases should be the rarest, but here they are the most frequent.
It is informative to report a confusion matrix for each model, i.e. numbers of false positives, false negatives etc., rather than just accuracy.
“The assessment of structure/function relationship however is hampered by the lack of a unified functional classification scheme of the protein universe.” What about SCOP, CATH or DALI?
Figure 1 is hard to understand. More explanation in the legend would help. What is Bnorm, for example?
The axes in Figure 2 should be labelled.
Figure 2 is simply low resolution Ramachandran plots for each amino acid. I don’t think it is helpful to put every amino acid on the same scale of 0,4. Rare amino acids will then have few visible features, so very rare (e.g. Asx) are solid blue.
Clarify in Table 1 that the correlation columns are the k-mean results.
Do the colours in Figure 4 largely reflect the abundance of each amino acid in each class? For example, does EC2 have a lot of Ala, thus making the plot nearly all yellow? It would probably be clearer to rescale for each plot, as all yellow or all blue is unhelpful.
Say what the 23 rows in Figure 4 are.
There are some issues with the experimental design:
1) the experimental data contains redundancy/ That is, some proteins are highly similar at sequence level or structure level. The authors shall remove highly redundant proteins from both the training and validation set. For example, they can use 30% or 40% sequence identity as cutoff to exclude redundancy.
2) is there any redundancy between the training and validation set?
There is no detailed comparison with the other methods using the same set of test proteins. It is hard to judge if the proposed method indeed improves the state of the art or not.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.