All reviews of published articles are made public. This includes manuscript files, peer review comments, author rebuttals and revised materials. Note: This was optional for articles submitted before 13 February 2023.
Peer reviewers are encouraged (but not required) to provide their names to the authors when submitting their peer review. If they agree to provide their name, then their personal profile page will reflect a public acknowledgment that they performed a review (even if the article is rejected). If the article is accepted, then reviewers who provided their name will be associated with the article itself.
Thank you for doing one more round of revisions.
# PeerJ Staff Note - this decision was reviewed and approved by Keith Crandall, a PeerJ Section Editor covering this Section #
Both reviewers appreciate the extensive revisions you have performed. However, one reviewer feels the manuscript remains fairly inaccessible to a broader biological audience. I agree with this assessment, and I would like to ask you to go over the manuscript one more time and see if there are ways to make the presentation more accessible.
I will leave the extent to which you want to carry out such revisions up to you. I agree with the 2nd reviewer that it is in your interest to write an article that is broadly accessible and has an impact on its field. I do not expect that a revision will have to be re-reviewed.
In my initial review I raised a number of concerns about the writing style and presentation of the results, finding it to initially be a very dense paper for a biological audience. In this revision, the authors have completely overhauled the manuscript. In the current version, the writing is clear, the problem is introduced well, the conclusions are clear, software is made available (with a working link), and the figures (particularly the addition of a toy example) are much improved. I see no other major concerns to note at this point
Responses to my initial comments/concerns were both very thorough and courteous. I believe that the revision to be a much improved manuscript, and thank the authors for taking the time and care to address those initial concerns so clearly and comprehensively.
In their revised version, the authors have made a major effort to improve the presentation of their paper. However, as pointed already out by both reviewers in the first round, the paper remains quite hard to access:
- It contains valid scientific work, even if the application to small protein fragments appears limiting as compared to the formal development beyond the PCFG-CM approach.
- The presentation is very formal. Much of the notations is far beyond what is needed to follow the manuscript, but they make the article inaccessible to a large fraction of the potentially interested readership (e.g. researchers interested in protein sequence modeling and annotation).
- The discussion of the sample applications in terms of, e.g., discriminative power and average precision remains quite superficial, concentrating on the comparision of a few numbers characterizing global performance.
- Some of my questions have been answered in the rebuttal letter, but these answers have not necessarily found a clear way into the manuscript (e.g. the selection of data sets, scanning e.g. SwissProt or PDB with the Prosite motifs gives much more positive hits; the selection of some model details like the cardinalities of VN and VT etc). The motivation and generality behind these choices remains thus unclear in the paper.
So the impression is that the almost the entire interest of the authors went into the formal development of the PCFG-CM framework, and little into the biological problem. From my point of view (but I might be the wrong reviewer for this kind of manuscript, not having a background in computer science), this is a pitty. Early papers suggesting very similar mathematical structures - PCFG with structural constraints – written by authors like Eddy, Durbin, Haussler and others in the 1990s, are much more accessible to a broader audience. These papers have, without doubt, changed our way to bioinformatically look at RNA.
I sincerely think that a less formal presentation of the material would facilitate the access to this work. However, the style of presentation should be a choice of authors.
Both reviewers have concerns about the accessibility of your manuscript to a biological audience. Reviewer 2 also raises important concerns regarding the applicability of the methods to realistic protein sequences. Finally, I agree with Reviewer 1 that the code should be made publicly available.
1. It’s not immediately clear to me who the audience of this paper is. Though I’m interested and fairly well read in protein evolution, sequence-structure constraints, etc. there are a number of terms and concepts presented in this manuscript that are new to me and writing assumes that readers have a pretty far ranging prior background. While this of course may (and surely does to some extent) reflect limitations of this humble reviewer’s knowledge, the authors are nevertheless missing an opportunity to more clearly explain the value of this research more broadly to people who may be interested in these concepts (i.e. me). “Grammar” in general will of course be familiar to most but care should be taken to explain the usefulness of this concept to protein sequences/evolution. “Context-free grammar” on top of that is a term that many/most will likely have not heard of or will be only vaguely aware of. Other concepts such as “syntactic trees” and “parse trees” are just kind of thrown out there and I’m doubtful that most readers will immediately know what the authors are talking about. A toy example presented in a figure or two would really work wonders to teach the importance of these concepts and make the authors advances more clear and more widely applicable.
2. Of all the writing, the abstract in particular is dense and difficult to parse. This is going to be the “sales pitch” for most potential readers and I strongly encourage the authors to clean this up and simplify it with clear problem statements and proposed solutions.
3. There were a few typos/grammar issues throughout but it wasn’t anything too bad. That said, the abstract (in addition to being dense) had much worse English than the remainder of the manuscript and really starts out on the wrong note and should be thoroughly edited in addition to being simplified. A few examples throughout that I came across:
a. An overall poor start in terms of English/grammar is that the first sentence of the abstract is a bit unwieldy and I believe grammatically incorrect. “Learning language of protein…” == “Learning the language of protein…”.
b. Still in that first sentence of the abstract… having 2 “which” clauses in a sentence is clunky. This could be changed by giving a full sentence to defining context free grammars as I believe few readers will know/understand this term. It’s all to say that a sentence like this could slide by more freely in the middle of a manuscript but this reviewer is of the opinion that the abstract is the portion of the manuscript that should have the most careful editing / precision of language to hook potential readers
c. Also in the abstract “Within the framework…” == “Within this framework…”. “The” is a bit ambiguous as if there were one and only one whereas “this” specifies that we are discussing the authors contribution.
d. I’d avoid any mention of heavy concepts like parse trees in the abstract entirely if possible because these will need to be defined.
e. Page 2 lines 57/58 “which do not outperform significantly HMMs” == “which do not significantly outperform HMMs”
f. Page 12 line 346: “amino acid sequence of” == “amino acid sequences of”
g. Page 17 line 467: “Complex character…” == “The complex character…”
h. Page 17 line 497: “…may even more benefit…” == “…may benefit even more…”
4. Code should be made publicly available
1. Figure 2: actual structures here would be nice in addition to the simplified diagrams
2. The mathematical abbreviations in Tables 2 and 3 headings could be simplified to increase readability. I had to keep flipping back to where these terms were defined to understand the differences in the columns. A more straightforward textual description might be easier to grasp for those who won’t be diving fully into the methods
3. Page 1 Line 28: “a near infinite number of sequences”. Perhaps it’s a mis-understanding on my behalf but I don’t see that an infinite number of sequences can exist, unless I suppose they are of infinite length but for a finite alphabet and bounded size it’s an of course astronomical and near infinite number but not infinite.
4. The introduction is very well written and seems to have been edited far more closely than the abstract. Though an orienting “summary” paragraph at the end of the introduction might be helpful to add to again focus the readers on the structure of the paper, the problem, the solution, etc.
5. The methods are extraordinarily dense (Pages 4-11). Which may suit some readers fine and I don’t think should necessarily be cut down. But I note it here only that this reviewer was unable to properly evaluate them and more generally would encourage the authors to make sure that their results are written predicated on the fact that few readers will have read / fully grasped their methods.
6. Conclusions/Discussion could probably be combined into a single section. Or more typically if both occur the conclusion is typically a shorter summary paragraph rather than the longer of the two. I thought that the conclusion section was well written, fairly addressed the limitations, and hypothesized paths forward for this research and this should probably be put in the discussion.
1. I really struggled after two read-throughs to pin-point the problem that the authors were addressing. Learning the language of protein sequences seems to be the goal, but this is of course a very abstract goal and hard for the reader to grasp in practical terms. While working towards this goal, the authors do present results that are concrete and I was kind of surprised when they came given that this information is not clearly laid out in the abstract. A clear practical goal/application seems to be to use contact information to supplement sequence information in order to better discriminate protein family members but nowhere is this made entirely clear. Being able to better define protein families using these methods could for instance have a large practical benefit of increasing the accuracy of protein homology search (a very valuable application). Page 2 lines 37-53 are the clearest articulation of this vision in the context of broader research but this definition of the problem appears nowhere in the abstract and really seems limited in the overall arc of the paper. If I’m correct that this seems to be the main practical benefit/goal of such research, this really needs to be made more explicit and scattered throughout the abstract/intro/results/discussion. If this is one of the main practical goals, well then it would seem apt to compare the methods developed here to the more limited profile-HMMs in the results (or are profile-HMMs fully equivalent to the grammars estimated without contact constraints? Entirely possible that this is the case, but if so, I think that’s another point of confusing terminology as I believe the term profile-HMM will be far more widespread to the likely readers of this paper than context-free/-sensitive grammars. The solution would simply to be to reiterate the fact that those methods are equivalent, but I’m still not sure that they are).
2. I still don’t fully get the “Descriptive power” section. Which again just relates to some lack of clarity in the writing or assumption of prior knowledge that I simply do not have. What precisely is even being classified/described here with the precision scores? True positive contacts from the structure? It’s not clear how their method converts probabilities into binary predicted contacts. Which is to say, it might be described there but again in a very abstract way that I have difficulty extracting using their terminology. It seems that this information comes directly from parse trees but if the link is kind of clear and obvious well then perhaps a diagram or figure again could show this link to someone who is unfamiliar with some of these concepts (i.e. has never heard of a parse tree, as I suspect some/many readers may be in a similar situation to myself). Additionally the value of that “descriptive power” problem seems a bit weak. Which is to say training a model with contact information helps better predict contacts? I’m not sure if I’m getting all of that properly, but it seems a bit tautological and not entirely surprising. The authors should clarify/expand why I’m being unfair in that regard.
1.Table 1: It’s unclear to me why the number of nneg varies for the different protein fragment targets. Is this a consequence of the different lengths? 829 negative sequences were used, and it is said that these are cut into matching lengths for the positive set (are these cuts overlapping? i.e. is a 100amino acid sequence with target length 20 cut into 5 sequences or 80?) so this would make sense but could be made a bit more methodologically explicit.
Dyrka et al. approach an important and topical issue, the relationship between structural constraints and protein sequences, using the concepts of context-free grammars. Notably, they show how to incorporate protein contact constraints into context-free grammars to develop methods that can better discriminate protein family members and re-capitulate the structural properties of the protein families in question. Overall, I find that the research is solid and can think of few/no actual analytical objections to any of the research presented that would require any re-analysis or further experiments. However, the biggest qualms I have with the paper are that it presents a very high bar of assumed prior knowledge and a lack of clear goals that severely limits comprehension of their results and advances.
The authors present the description of sequences of protein fragments using probabilistic context-free grammars (PCFG) using contact map constraints. This approach is successful in statistical models of RNA since the contacts in RNA secondary structure naturally fit the framework of PCFG (nested but no crossing contacts). These methods have not been applied much to proteins, which are more frequently described via profile HMM. The latter cannot include non-local interactions, and thus structural constraints for contacts bringing together residues distant in the primary structure. The authors propose PCFG to overcome these limitations.
While the generalization beyond HMM is an important question, I have a number of basic remarks concerning the manuscript. In general, it is written quite clearly, but also in a very formal language accessible mostly to computer scientists. It seems more written for the proceedings of a computer science conference than for a life-science journal. An effort to give motivations behind formal constructions should be made.
se below under "Validity of findings"
1) The authors propose PCFG as structurally informed probabilistic models for protein sequences. I do not understand the motivations for this choice, since contact maps of proteins have many characteristics, which do not seem compatible with the construction rule of CFG: Crossing contacts like in alpha helices (e.g. (i,i+4) and (i+1,i+5)) or parallel beta sheets (e.g. (i,j), (i+1,j+1),…, (i+l),(j+l)), residues with multiple contacts two distinct regions of the protein instead of one-to-one contacts. So the target might be small fragments like the ones shown in Fig. 2 and analyzed in the paper (by the way, Fig. 2 is basically unreadable, a little graphical effort might be helpful there). But wouldn’t the effort of developing PCFG be large for fragments of very specific structure? I miss an overall motivation and justification of the model setup, and a serious discussion of the limitations of application. More general graphical models or Markov random fields (MRF) seem more adapted to the complex structure of protein contact maps.
2) In the definition of te PCFG used, it remains unclear why V_T is not directly identified with \Sigma? Probably the use of three symbols for V_T ans four for V_N makes the model more flexible (more parameters) than PCFG used for RNA, but why values 3 and 4 for these cardinalities? The justification seems to be the computational complexity of the problem, but is there any argument that less symbols would be worse in performance? Also profile HMM do not have this kind of multiplicity, since each match state has a single amino-acid emission matrix. The precise model settings remain thus somehow unmotivated.
3) Even if the number of elements of V_T, V_N remains restricted, the PCFG framework currently does not seem to be applicable to sequences longer than 20-30 amino acids, and alignments have to be gapless. This seems quite a restriction. MRF are currently used for hundreds of amino acids in multiple-sequence alignments with gaps.
4) In the tests on protein fragments, very few positive sequences (24-160) are used. Why so few? The fact that performance decreases with npos is counter-intuitive: One would expect that model parameters are more precisely estimated when samples are larger. Larger samples would also, in principle, make average-precision results less noisy.
5) The entire protein space is approximated by 829 single sequences of length 300-500, constructed from old data in 2006. Uniprot contains more than 100 million sequences, Pfam lists more than 16,000 protein families, the majority of them containing structurally resolved examples in the PDB. So I would guess that the negative set is an extremely rough approximation of protein sequence space, but the resulting sample is already three orders of magnitude larger than the corresponding npos values.
I am afraid that I miss some basic point, but even after reading the manuscript several times I continue to miss it. I would therefore suggest the authors to substantially revise their manuscript to make their work more accessible and the motivations and limitations – in particular for biological applications – more clear.
All text and materials provided via this peer-review history page are made available under a Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.