Estimating probabilistic context-free grammars for proteins using contact map constraints

Interactions between amino acids that are close in the spatial structure, but not necessarily in the sequence, play important structural and functional roles in proteins. These non-local interactions ought to be taken into account when modeling collections of proteins. Yet the most popular representations of sets of related protein sequences remain the profile Hidden Markov Models. By modeling independently the distributions of the conserved columns from an underlying multiple sequence alignment of the proteins, these models are unable to capture dependencies between the protein residues. Non-local interactions can be represented by using more expressive grammatical models. However, learning such grammars is difficult. In this work, we propose to use information on protein contacts to facilitate the training of probabilistic context-free grammars representing families of protein sequences. We develop the theory behind the introduction of contact constraints in maximum-likelihood and contrastive estimation schemes and implement it in a machine learning framework for protein grammars. The proposed framework is tested on samples of protein motifs in comparison with learning without contact constraints. The evaluation shows high fidelity of grammatical descriptors to protein structures and improved precision in recognizing sequences. Finally, we present an example of using our method in a practical setting and demonstrate its potential beyond the current state of the art by creating a grammatical model of a meta-family of protein motifs. We conclude that the current piece of research is a significant step towards more flexible and accurate modeling of collections of protein sequences. The software package is made available to the community.

work, we establish a framework for learning probabilistic context-free grammars for protein sequences from syntactic trees partially constrained using amino acid contacts obtained from wet experiments or computational predictions, whose reliability has substantially increased recently.
Within the framework, we implement the maximum-likelihood and contrastive estimators of parameters for simple yet practical grammars.Tested on samples of protein motifs, grammars developed within the framework showed improved precision in recognition and higher fidelity to protein structures.The framework is applicable to other biomolecular languages and beyond wherever knowledge of non-local dependencies is available.Keywords: probabilistic contextfree grammar, syntactic tree, structural constraints, protein sequence, protein contact map, maximum-likelihood estimator, contrastive estimation 1 Introduction

Grammatical modeling of proteins
The essential biopolymers of life, nucleic acids and proteins, share the basic characteristic of the languages: infinite number of sequences can be expressed with a finite number of monomers.In the case of proteins, merely 20 amino acids species (letters) build millions of sequences (words or sentences) folded in thousands of different spatial structures playing various functions in living organisms (semantics).Physically, the protein sequence is a chain of amino acids linked by peptide bonds.The physico-chemical properties of amino acids and their interactions across different parts of the sequence defines its spatial structure, which in turn determines biological function to great extent.Similarly to words of the natural language, protein sequences may be ambiguous (the same amino acid sequence folds into different structures depending on the environment), and often include non-local dependencies and recursive structures [Searls, 2013].
Not surprisingly the concept of protein language dates back to at least 1960s [Pawlak, 1965], and since early applied works in 1980s [Brendel andBusse, 1984, Jimenez-Montao, 1984] formal grammatical models have gradually gained importance in bioinformatics [Searls, 2002, 2013, Coste, 2016].Most notably, Hidden Markov Models (HMM), which are weakly equivalent to probabilistic regular grammars, became the main tool of protein sequence analysis.Profile HMM are commonly used for defining protein families [Sonnhammer et al., 1998, Finn et al., 2015] and for searching similar sequences [Eddy, 1998, 2011, Soeding, 2005, Remmert et al., 2012]; and more expressive HMM are developed [Coste andKerbellec, 2006, Bretaudeau et al., 2012].Yet, their explanatory power is limited since, as regular level models, they cannot capture non-local interactions, which occur between amino acids distant in sequence but close in the spatial structure of the protein.Many of these interactions have a character of nested, branched and crossing dependencies, which in terms of grammatical modelling requires context-free (CF) and context-sensitive (CS) level of expressiveness [Searls, 2013].However, grammatical models beyond regular levels have been rather scarcely applied to protein analysis (a comprehensive list of references can be found in [Dyrka et al., 2013].This is in contrast to RNA modeling, where CF grammatical frameworks are well developed and power some of the most successful tools [Sakakibara et al., 1993, Eddy and Durbin, 1994, Knudsen and Hein, 1999, Sükösd et al., 2012]. One difficulty with modeling proteins is that interactions between amino acids are often less specific and more collective in comparison to RNA.Moreover, the larger alphabet made of 20 amino acid species instead of just 4 bases in nucleic acids, combined with high computational complexity of CF and CS grammars, impedes inference, which may lead to solutions which does not outperform significantly HMMs [Dyrka andNebel, 2009, Dyrka et al., 2013].Yet, some studies hinted that CF level of expressiveness brought an added value in protein modeling when grammars fully benefiting from CF nesting and branching rules were compared in the same framework to grammars effectively limited to linear (regular) rules [Dyrka, 2007, Dyrka et al., 2013].Good preliminary results were also obtained on learning sub-classes of CF grammars to model protein families, showing the interest of taking into account long distance correlations in comparison to regular models [Coste et al., 2012[Coste et al., , 2014].
An important advantage of CF and CS grammars is that parse trees they produce are human readable descriptors.In RNA modeling, the shape of parse trees can be used for secondary structure prediction [Dowell and Eddy, 2004].In protein modeling, it was suggested that the shape of parse trees corresponds to protein spatial structure [Dyrka and Nebel, 2009], and that they can also convey biologically relevant information [Sciacca et al., 2011, Dyrka et al., 2013].

Grammar estimation with structural constraints
In this piece of research the focus is on learning probabilistic context-free grammars (PCFG) [Booth, 1969].Learning PCFG consists in estimating the unfixed parameters of the grammar with the aim of concentrating probability mass from the entire space of possible sequences and their syntactic trees to the target population, typically represented by a sample.The problem is often confined to assigning probabilities to fixed production rules of a generic underlying non-probabilistic CFG [Lari and Young, 1990].Typically the goal is to estimate the parameters to get a grammar maximizing the likelihood of the (positive) sample, while, depending on the target application, other approaches also exists.For example, the contrastive estimation aims at obtaining grammars discriminating target population from its neighbourhood [Smith and Eisner, 2005].
The training sample can be made of a set of sequences or a set of syntactic trees.In the former case, all derivations for each sentence are considered valid.Given the underlying non-probabilistic CFG, probabilities of rules can be estimated from sentences in the classical Expectation Maximization framework (e.g. the Inside-Outside algorithm [Baker, 1979, Lari andYoung, 1990]), however, the approach is not guaranteed to find the globally optimal solution [Carroll and Charniak, 1992].
Heuristic methods applied for learning PCFG from positive sequences include also iterative biclustering of bigrams [Tu and Honavar, 2008], and genetic algorithms using a learnable set of rules [Kammeyer and Belew, 1996, Keller and Lutz, 1998, 2005] or a fixed covering set of rules [Tariman, 2004, Dyrka andNebel, 2009].
Much more information about the language is conveyed in the syntactic trees.If available, a set of trees (a treebank) can be directly used to learn a PCFG [Charniak, 1996].Usability of structural information is highlighted with the result showing that a large class of non-probabilistic CFG can be learnt using unlabeled syntactic trees (called also skeletons) of the training samples [Sakakibara, 1992].Algorithms for learning probabilistic CF languages, which exploits structural information in syntactic trees, have been proposed [Sakakibara et al., 1993, Eddy and Durbin, 1994, Carrasco et al., 2001, Cohen et al., 2014].An interesting middle way between plain sequences and syntactic trees are partially bracketed sequences, which constrain the shape of the syntactic trees (the skeletons) but not node labels.The approach was demonstrated to be highly effective in learning natural languages [Pereira and Schabes, 1992].It was also applied to integrating uncertain information on pairing of nucleotides of RNA [Knudsen, 2005].In this approach the modified bottom-up parser penalizes probability on derivations inconsistent with available information on nucleotide pairing in such way that the amount of the penalty is adjusted according to confidence of the structural information.

Protein contact constraints
To our knowledge constrained sets of syntactic trees have never been applied for estimating PCFG for proteins.In this research we propose to use spatial contacts between amino acids distant in sequence as a source of constraints.Indeed, an interaction between amino acids, which forms a dependency, usually requires a contact between them, defined as spatial proximity.Until recently, extensive contact maps were only available for proteins with experimentally solved structures, while individual interactions could be determined through mutation-based wet experiments.
Currently, reasonably reliable contact maps can also be obtained computationally from large collective alignments of evolutionary related sequences.The rationale for the contact prediction is that if amino acids at a pair of positions in the alignment interact then a mutation at one position of the pair often requires a compensatory mutation at the other position in order to maintain the interaction intact.Since only proteins maintaining interactions vital for function successfully endured the natural selection, an observable correlation in amino acid variability at a pair of positions is expected to indicate interaction.However, standard correlations are transitive and therefore cannot be immediately used as interaction predictors.The break-through was achieved recently by Direct Coupling Analysis (DCA) [Weigt et al., 2009], which disentangles direct from indirect correlations by inferring a model on the alignment which can give information on the interaction strengths of the pairs.There are different DCA methods based on how the model, which is usually a type of the Markov Random Field, is obtained [Morcos et al., 2011, Jones et al., 2012, Ekeberg et al., 2013, Kamisetty et al., 2013, Seemayer et al., 2014, Baldassi et al., 2014].The state-of-the-art DCA-based meta-algorithms achieve mean precision in the range 42-74% for top L predicted contacts and 69-98% for top L/10 predicted contacts, where L is the protein length [Wang et al., 2017].Precision is usually lower for shorter sequences and especially for smaller alignments, however a few top hits may still provide relevant information [Daskalov et al., 2015].

Structure of the document
The rest of the document is organized as follows.Section 2 introduces the main contribution of this work.First, a novel PCFG-CM framework for learning PCFG with the Contact-Map constraints is established, for which the maximum-likelihood and contrastive estimators are defined (section 2.1).
Second, a special instance of the problem, both simple and practical, is considered: suitable forms of the grammar and the contact constraints are defined, a variant of the bottom-up chart parser is proposed, and effective calculations of one of the contrastive estimators for the proposed form of grammar are given (section 2.2).The setup of experimental evaluation of the PCFG-CM approach for the special instance is described in section 2.3.Section 3 presents sample data and results of evaluation.Eventually, section 4 concludes the document with discussion of the results, limitations and perspectives for future work.

Basic notations
Let Σ be a non-empty finite set of atomic symbol (representing for instance amino acid species).
The set of all finite strings over this alphabet is denoted by Σ * .Let |x| denote the length of a string x.The set of all strings of length n is denotes by Unlabeled syntactic tree An unlabeled syntactic tree (UST) u for x is an ordered rooted tree such that the leaf nodes are labeled by x, which is denoted as yield(u) = x, and the non-leaf nodes are unlabeled.Let U * denotes the set of all USTs that yield a sequence in Σ * , let U n = {u ∈ U * : yield(u) ∈ Σ n }, where n is a positive integer, and let Σ is defined as above, V is a finite set of non-terminal symbols (also called variables) disjoint from Σ, v 0 ∈ V is a special start symbol, and R is a finite set of rules rewriting from variables into strings of variables and/or terminals R = {r which defines an ordered parse tree y starting from the root node labeled by v 0 .In each step, by applying a rule r i : v j → α 1 . . .α k , tree y is extended by adding edges from the already existing left-most node labeled v j to newly added nodes labeled α 1 to α k .Therefore there is a one-to-one correspondence between derivation r and parse tree y.Derivation r is complete if all leaf nodes of the corresponding (complete) parse tree y are labeled by symbols in Σ. Sets Y * , Y n and Y x are defined as for the USTs.For a given parse tree y, u(y) denotes the unlabeled syntactic tree obtained by removing the non-leaf labels on y.Given a UST u, let Y G (u) be the set of all parse trees for grammar G such that u(y setting for each rule v k → α its probability to be chosen to rewrite v k with respect to other rules The probability of parse tree y using the probability measure induced by G is given by the probability of the corresponding derivation r = r 1 . . .r l : G is said to be consistent when it defines probability distribution over Y * : and the probability of UST u ∈ U x given G is: Since Y x and Y G (u) define each a partition of Y * for x ∈ Σ * and for u ∈ U * , a consistent grammar G defines also a probability distribution over Σ * and U * .

Contact constraints
Most proteins sequences fold into complex spatial structures.Two amino acids at positions i and j in the sequence x are said to be in contact if distance between their coordinates in spatial structure d(i, j) is below a given threshold τ .A full contact map for a protein of length n is a binary symmetric matrix m full = (m i,j ) n×n such that m i,j = [d(i, j) < τ ], where [x] is the Iverson bracket.Usually only a subset of the contacts is considered (cf section 1.3).A (partial) contact map for a protein of length n is a binary symmetric matrix m = (m i,j ) n×n such that m i,j = 1 =⇒ d(i, j) < τ .Let d u (i, j) is the length of the path from i-th to j-th leaf in UST u for x.Given a threshold δ, UST u is said to be consistent with a contact map m of length n if m i,j = 1 =⇒ d u (i, j) < δ.
For a contact map m of length n, let U m n denotes the subset of U n consistent with m, and U m x denotes the subset of U x consistent with m.Note that U m x = U m n ∩ U x .Analogous notations apply to parse trees.

Estimation
Learning grammar G = Σ, V, v 0 , R, θ can be seen as estimating the unfixed parameters of G with the aim of concentrating probability mass from the entire space of unlabeled syntactic trees U * to the set of unlabeled syntactic trees for the target population U target .In practice, only a sample of the target population can be used for learning, hence estimation is performed on U sample ⊆ U target .
Note that even in the most general case the set of terminal symbols Σ is implicitly determined by the sample; moreover the start symbol v 0 is typically also fixed.A common special case confines learning grammar G to estimating θ for a fixed quadruple of non-probabilistic parameters Σ, V, v 0 , R (which fully determine the non-probabilistic grammar G underlying G).Given inferred grammar G * and a query set of unlabeled syntactic trees U query , probability prob(U query | G * ) is an estimator of the likelihood that U query belongs to population U target .
Maximum likelihood grammar Let X be a sample set of sequences in Σ * , and let M be a set of corresponding contact matrices.The sample set S = [XM] consists of a set of tuples (x, m), where x ∈ X and m ∈ M. Let U M X be the corresponding set of compatible USTs: Grammar G that concentrates probability mass on U M X can be estimated using the classical Bayesian approach: .
Noting that prob(U M X ) does not influence the result and, in the lack of prior knowledge, assuming prob(G) uniformly distributed among all G, the solution is then given by the maximum likelihood formula: Assuming independence of U m x s: In the absence of contact constraints the maximization problem becomes equivalent to the standard problem of estimating grammar G given the sample X: where m = 0 denotes a square null matrix of size equal to the length of the corresponding sequence, and Contrastive estimation Often it is reasonable to expect that U query comes from a neighbourhood of the target population N (U target ) ⊂ U * .In such cases it is practical to perform contrastive estimation [Smith and Eisner, 2005], which aims at concentrating probability mass distributed by the grammar from the neighbourhood of the of sample N (U sample ) to the sample itself U sample , such that: .
Consider two interesting neighbourhoods.First, assume that contact map m is known and conserved in the target population and hence in the sample: U m X = {U m x : x ∈ X}.This implies the same length n of all sequences.Then U m n is a reasonable neighbourhood of the target population, so Second, assume that sequence x is known to be yielded by the target population and the goal is to maximize likelihood that shapes of parse trees generated with G are consistent with contact map m.
Then U X is a reasonable neighbourhood of the sample U M X , so .

Definitions
Let G = Σ, V, v 0 , R, θ be a probabilistic context-free grammar such that Subsets R a , R b and R c are referred to as lexical, branching, and contact rules, respectively.Joint subset R b ∪ R c is referred to as structural rules.
Let m be a contact matrix compatible with the context-free grammar, i.e. no pair of positions in contact overlaps nor crosses boundaries of other pairs in contact (though pairs can be nested one in another): where ⊕ denotes the exclusive disjunction, and positions in contact are separated from each other by at least 2: Let distance threshold in tree δ = 4. Then a complete parse tree y generated by G is consistent with m only if for all m i,j = 1 derivation is performed with a string of production rules where

Parsing
Given an input sequence x of length n and a grammar G, prob ) by a slightly modified probabilistic Cocke-Kasami-Younger bottom-up chart parser [Cocke, 1969, Kasami, 1965, Younger, 1967].Indeed, productions in R a ⊎R b conforms to the Chomsky Normal Form [Chomsky, 1959], while it is easy to see that productions in R c requires , which effectively sums up probabilities of all possible parse trees Y x .In the first step, probabilities of assigning lexical non-terminals V T for each terminal in the sequence x are stored in the bottom matrix P 1 = P[1, :, :].Then, the table P is iteratively filled upwards with probabilities . New extended version of the algorithm (Fig. 1) computes prob(Y m x | G), i.e. it considers only parse trees Y m x which are consistent with m.To this goal it uses an additional table C of dimensions (m)/2×n×|V T |.After completing P 1 (lines 10-12), probabilities of assigning lexical non-terminals V T at positions involved in contacts are moved from P 1 to C (lines 13-21) such that each matrix C p = C[p, :, :] corresponds to p-th contact in m.In the subsequent steps C can only be used to complete productions in R c ; moreover both lexical non-terminals have to come either from P 1 or C, they can never be mixed (lines 35-40).The computational complexity of the extended algorithm is still O(n 3 ) as processing of productions in R c has to be multiplied by iterating over the number of contact pairs in m, which is O(n) since the cross-serial dependencies are not allowed.

Calculating
This section shows effective computing prob(U m n | G), which is denominator for the contrastive estimation of G CE(m) (cf.section 2.1.3).Given a sequence x of length n, a corresponding matrix m 01: function parse_cky_cm(x, m, Ra, Rb, Rc, Vt, Vn, v0) 02: # input: 03: # x -sequence, m -contact map 04: # Ra -lexical, Rb -branching, Rc -contact rules 05: # Vt -set of lexical, Vn -set of non Given grammar G, any complete derivation r is a composition r = ṙ • r, where ṙ ∈ (R a ) * and r ∈ (R b ∪ R c ) * .Let y be a parse tree corresponding to derivation r, and let ỹ be an incomplete parse tree corresponding to derivation r.Note that for any y corresponding to r = ṙ • r there exists one and only one ỹ corresponding to r.Let Ỹm x denote the set of such incomplete trees ỹ.Note that labels of the leaf nodes of ỹ are lexical non-terminals ∀(i) α i,i ∈ V T , and that ṙ represents the unique left-most derivation yield(ỹ) * ⇒ x.Thus, Note that value of the expression will not change if the second summation is over ỹ Combining with observation that prob(ỹ | G) does not depend on x, the expression can be therefore rewritten as: where a s ∈ Σ.Since G is proper then ∀(v ∈ V T ) |Σ| s=1 θ(v → a s ) = 1 and therefore the entire formula evaluates to 1, which can be easily shown by iterative regrouping.This leads to the final formula: Technically, ỹ∈ Ỹm n prob(ỹ | G) can be readily calculated by the bottom-up chart parser by setting

Learning
The present PCFG-CM approach was evaluated in practice for grammatical models G and Ḡ = G \ R c (the same grammar but without the contact rules) using an on-site framework for learning rule probabilities [Dyrka andNebel, 2009, Dyrka et al., 2013].Given an underlying CFG G, the framework estimates rule probabilities θ for the corresponding PCFG G = G, θ from the positive sample using a genetic algorithm in the Pittsburgh flavour, where each individual represents a whole grammar.Unlike previous applications of the framework in which probabilities of the lexical rules were fixed according to representative physicochemical properties of amino acids, in this research probabilities of all rules were subject to evolution.The objective functions were implemented for estimators GML , GCE(X) , and GCE(m) .Besides, the setup of the genetic algorithm closely followed that of [Dyrka and Nebel, 2009].
The input non-probabilistic grammar G consisted of an alphabet of twenty terminal symbols representing amino acid species grammar in [Dyrka and Nebel, 2009].The number of non-terminal symbols was limited to a few in order to keep reasonable the number of parameters to be optimized by the genetic algorithm.
Combinations of symbols in rules were not constrained beyond general definition of the model G in order to avoid interference with contact-map constraints, for the sake of transparent evaluation of the PCFG-CM.

Performance measures
Performance of grammars was evaluated using a variant of the k-fold Cross-Validation scheme in which k − 2 parts are used for training, 1 part is used for validation and parameter selection, and 1 part is used for the final testing and reporting results.Negative set was not used in the training phase.
In order to avoid composition bias, proteins in the test sample were scored against the null model (encoded as a unigram), which assumed global average frequencies of amino acids, no contact information, and the sequence length of the protein.The amino acid frequencies were obtained using the online ProtScale tool for the UniProtKB/Swiss-Prot database [Gasteiger et al., 2005]).
Discriminative performance Grammars were assessed on the basis of the average precision (AP) in the recall-precision curve (RPC).The advantage of RPC over the more common Receiver Operating Characteristic (ROC) is robustness to unbalanced samples where negative data is much more numerous than positive data [Davis and Goadrich, 2006].AP approximates the area under RPC.
Descriptive performance Intuitively, a decent explanatory grammar generates parse trees consistent with spatial structure of the protein.Perhaps the most straightforward approach to assess descriptive performance is to use the UST of the maximum likelihood parse tree as a predictor of spatial contacts between positions in sequence, parametrized by the cutoff δ on path length between the leaves.The natural threshold for grammar G, which is δ = 4 (the shortest distance between terminals generated by R b rules), was used for calculating the precision of contact prediction.In addition, AP of RPC, which sums up over all possible cutoffs, was computed to allow comparison with grammars without pairing rules.Eventually, the recall of the contact prediction at the threshold δ = 4 measured with regard to the partial contact map used in the training was used to assess the learning process.

Materials
Probabilistic grammars were estimated for three samples of protein fragments based on functionally relevant gapless motifs [Sigrist et al., 2002, Bailey andElkan, 1994].Within each sample, all sequences shared the same length, which avoided sequence length effects on grammar scores (which could be resolved by appropriate the null model).For each sample, one experimentally solved spatial structure in the Protein Data Bank (PDB) [Berman et al., 2000] was selected as a representative.
Three samples included amino acid sequence of two small ligand binding sites (already analysed in [Dyrka and Nebel, 2009]) and one functional amyloid HET-s (Table 3.1): • CaMn: a Calcium and Manganese binding site from the legume lectins [Sharon and Lis, 1990] collected according to the PROSITE PS00307 pattern [Sigrist et al., 2013] true positive and false negative hits.Boundaries of the motif were extended to cover the entire binding site, similarly to [Dyrka and Nebel, 2009].The motif folds into a stem-like structure with multiple contacts, many of them forming nested dependencies, which stabilize anti-parallel beta-sheet made of two ends of the motif (Fig. 2a based on pdb:2zbj [de Oliveira et al., 2008]); • NAP : the Nicotinamide Adenine dinucleotide Phosphate binding site fragment from an aldo/keto reductase family [Bohren et al., 1989] collected according to the PS00063 pattern true positive and false negative hits (four least consistent sequences were excluded).The motif is only a part of the binding site of the relatively large ligand.The intra-motif contacts seems to be insufficient for defining the fold, which depends also on interactions with amino acids outside the motif (Fig. 2b based on pdb:1mrq [Couture et al., 2003]); • HET-s: the HET-s-related motifs r1 and r2 involved in the prion-like signal transduction in fungi identified in a recent study [Daskalov et al., 2015].The largest subset of motifs with length of 21 amino acids was used to avoid length effects on grammar scores.When interacting with a related motif r0 from a cooperating protein, motifs r1 and r2 adopt beta-hairpin-like folds which stack together.While stacking of multiple motifs from several proteins is essential for stability of the structure, interactions between hydrophobic amino acids within single hairpin are also important.In addition, correlation analysis revealed strong dependency between positions 17 and 21 [Daskalov et al., 2015] (Fig. 2c based on [van Melckebeke et al., 2010]).Diversity of sequences ranged from the most homogenous CaMn to the most diverse HET-s, which consisted of 5 subfamilies [Daskalov et al., 2015].
Negative samples were designed to roughly approximate the entire space of protein sequences.
They were based on the negative set from [Dyrka and Nebel, 2009], which consisted of 829 single chain sequences of 300-500 residues retrieved from the Protein Data Bank [Berman et al., 2000] at identity of 30% (accessed on 12th December 2006).For each positive sample, the corresponding negative sample was obtained by cutting the basic negative set into subsequences of the length of the positive sequences.
All samples were made non-redundant at level of sequence similarity around 70%. Contact pairings were assigned manually and collectively to all sequences in the set based on a selected available spatial structure of a representative positive sequence in the PDB database (Fig. 2).

Performance
Probabilistic grammars with the contact rules G were learned through estimation of probabilities of rules θ for non-probabilistic CFG G using input samples made of sequences coupled with the contact map U m X , or using sequences alone U X .Probabilistic grammars without the contact rules Ḡ were learned using only the input samples made of sequences U X , since they cannot generate parse trees consistent with contact maps at the distance threshold δ = 4.Note that since there is the one-to-one correspondence between input sample set S = [XM] and sample of UST sets U M X , notations developed for the sets of USTs are used to denote the input samples.
>2kj3:A260-280 TTNSVETVVGKGESRVLIGNE >2zbj:A4-30 IVAVELDSYPNTDIGDPNYPHIGIDIK Figure 2: Schematic representation of structure of the sample motifs.Context-free-compatible contact pairings selected in this study are marked with dashes and slashes.Order of amino acids in sequence and its coordinates in protein are given below the structure.Notes: 1) in CaMn, only 4 out of 7 real hydrogen bond-related contacts in the stem-like part were included in the contact map for the sample for the sake of simplicity; 2) in HET-s, e.g. a pair V5 and I18 conforms to definition of contact, however it crosses another contact between L17 and E21.

Discriminative power
For evaluation of discriminative power of the PCFG-CM approach, rule probabilities were estimated using the maximum-likelihood estimator (denoted ML) and the contrastive estimator with regard to the contact map (denoted CE(m)).Discriminative performance of the resulting probabilistic grammars on U X and U m X is presented in Tab. 2 in terms of the average precision.The baseline here is the average precision of grammars estimated without contact constraints, Ḡm=0 ML and Gm=0 ML , tested on sequences alone U X , which ranged from 0.43-0.46for HET-s to 0.94-0.96for CaMn.The scores show negative correlation with diversity of the samples and limited effect of adding contact rules (though the latter may result from worse learning of increased number of parameters with added rules).Gm=0 ML performed much worse when tested on the samples with the contact map U m X , which indicates that preference for parses consistent with m is at best limited when training without constraints.
For all three samples, the highest AP (0.91-0.98) achieved grammars obtained using the contrastive estimation with regard to the contact map GCE(m) tested on the samples with the map U m X .The improvement relative to the baseline was most pronounced for HET-s, yet still statistically significant (p < 0.05) for NAP.As expected, the contrastively estimated grammars performed poorly on sequences alone U X except for the CaMn sample.
The maximum-likelihood grammars estimated with the contact information GML tested on U m X performed worse than the contrastively estimated grammars but comparably or significantly better (HET-s) than the baseline.The average precision of GML was consistently lower when tested without the map on sequences alone, yet still considerable (from 0.60 for HET-s to 0.95 for CaMn).It is notable that in the HET-s case, GML achieved better AP on U X than Gm=0 ML .Notably high AP for CaMn with Gm=0 ML tested on U m X and with GCE(m) tested on U X can be for modeling with the contact rules.

Descriptive power
For evaluation of descriptive power of the PCFG-CM approach, rule probabilities were estimated using the maximum-likelihood estimator (denoted ML) and the contrastive estimator with regard to sequences (denoted CE(X)).Descriptive value of the most probable parse trees generated using the resulting probabilistic grammars for test sequences without contact information U X is presented in Tab. 3. Efficiency of the learning was measured on the basis of the recall at δ = 4 with regard to the context-free compatible contact map used in the training.Consistency of the most likely parse tree with the protein structure was measured on the basis of the precision of the contact prediction at δ = 4 with regard to all contacts in the reference spatial structure with separation in sequence of at least 3.Both measures are not suitable for assessing grammars without contact rules Ḡ.Therefore, average precision over all thresholds δ was used as secondary measure for consistency of the most likely trees with the protein structure.Note that the AP scores achievable for a context-free parse tree are reduced by overlapping pairings.
The baseline here are the results for grammars with the contact rules estimated without contact constraints Gm=0 ML .The most likely parse trees generated using these grammars conveyed practically no information about contacts for NAP and HET-s (recall w.r.t m close to zero) and limited information about contacts for CaMn (moderate recall of 0.45).Increase of the recall to 0.79-0.98obtained for the most likely parse trees generated using grammars GML and GCE(X) testifies efficiency of the learning process with contact constraints.
Importantly, consistency of the most likely parse trees with the protein structure measured by the precision followed a similar pattern and increased from 0.13 for HET-s, 0.14 for NAP, and 0.69 for CaMn when learning on U X , to respectively 0.52-0.57,0.64, and 0.84-0.87,when learning on U m X .Accordingly, evaluation in terms of the average precision over distance thresholds indicated that distances in the most likely parse trees better reflected the protein structure if grammars were trained with the contact constraints on U m X .
4 Discussion and conclusions

Analysis of computational results
Computational validation of discriminatory power showed that additional knowledge present in the partial contact map can be effectively incorporated into the probabilistic grammatical framework through the concept of syntactic tree consistent with the contact map.The most effective way of training descriptors for a given sample was the contrastive estimation with reference to the contact map.This approach is only possible when a single contact map that fits all sequences in the target population can be used with the trained grammar.The maximum-likelihood estimators were effective when contacts were relevant to structure of the sequence (HET-s, CaMn).This is expected, as use of the contact rules is likely to be optimal for deriving a pair of amino acids in contact if they are actually correlated.Interestingly, in the case of HET-s, the maximum-likelihood grammar GML trained with the contact constraints compared favourably with the maximum-likelihood grammar ML trained on sequences alone even when tested on sequences alone (AP 0.60 versus 0.43).In other words, GML was more optimal with regard to the sample of sequences than Gm=0 ML .This indicates that if contacts are relevant for the structure of sequence, the PCFG-CM approach can improve robustness of learning to local optima.
Computational validation of descriptive power showed that the most likely parse trees, derived for inputs defined only by sequences, reproduced vast majority of contacts (recall of at least 0.79 at δ = 4) enforced by the contact-map constrained training input.Moreover, precision of contact prediction at δ = 4 and sequence separation 3+ was above 0.50, up to 0.87.This translated to the overall overlap with the full contact maps in range 0.27-0.39(not shown), since only a fraction of contacts can be represented in the parse tree of the context-free grammar, and even not all of them were enforced in the training or were optimally parsed with the contact rules.The benefit of contrastive estimation with reference to sequences was limited in comparison to maximum likelihood grammars.However, it should be noted that the shape of the most likely parse tree, which was used in the evaluation, does not necessarily reflects the most likely shape of parse tree.Unfortunately, the latter cannot be efficiently computed [Dowell and Eddy, 2004].
Reasonable performance of grammars estimated with contact constraints GML on sequences alone (AP from 0.60 to 0.95) is encouraging as it gives a hint of performance of the PCFG-CM approach in its potential most general application to model very diverse data sets where each training sequence is associated with a different contact map.In this case, contact maps cannot be used for recognizing unknown sequences.So far conclusive results for this kind of application are not yet available.

Limitations and perspectives
The computational experiments mainly served assessing intuitions, which led to development of the PCFG-CM approach.Full scale practical application to bioinformatic problems such as sequence search would certainly require several enhancements.For example, accurate accounting for various sequence length would likely require a more elaborated null model.Moreover, to increase the number of non-terminal symbols, the learning framework have to be improved.This includes more efficient estimation of probabilities of a large number of rules and/or added capability of inferring rules during learning [Unold, 2005, 2012, Coste et al., 2012, 2014].In addition, preliminary testing (not shown) suggests that scoring inputs with the product of probabilities using grammars with lexical rule probabilities fixed according to representative physicochemical properties of amino acids [Dyrka and Nebel, 2009], and the appropriately adjusted null model, has more discriminative power than the current approach.Extension of the PCFG-CM framework to account for uncertain contact information as in [Knudsen, 2005] seems to be straightforward through introducing the concept of fuzzy sets of syntactic trees.These application-related developments are left for future work.
Though tested in the learning setting consisting in optimizing only rule probabilities, the estimators defined in the present PCFG-CM framework can be used in more general learning schemes inferring also the grammar structure.Indeed, such schemes may even more benefit from constraining the larger search space.It is also interesting to consider extending the framework beyond context-free grammars as contacts in proteins are often overlapping and thus context-sensitive.In this case, however, the one-to-one correspondence between the parse tree and derivation breaks, therefore it may be advisable to redefine the grammatical counterpart of the spatial distance in terms of derivation steps in order to take advantage from higher expressiveness.

Contributions
The PCFG-CM framework was proposed by WD and elaborated by WD, FC and JT.Implementation of the special instance, computational experiments and analysis of results were carried out by WD.The paper was written by WD, FC and JT.M. Weigt, R. White, H. Szurmant, J. Hoch, and T. Hwa.Identification of direct residue contacts in protein-protein interaction by message passing.Proceedings of the National Academy of Sciences, 106:67-72, 2009.D. H. Younger.Recognition and parsing of context-free languages in time n3.Information and Control, 10(2):189 -208, 1967. ISSN 0019-9958.

Figure 1 :
Figure 1: Pseudocode of the modified CKY parser which consisted of all possible allowed combinations of symbols, hence |R a | = 60, |R b | = 196, |R c | = 144.The set of non-contact rules was identical to the standard

Table 2 :
Discriminative performance of grammars in terms of AP

Table 3 :
Descriptive quality of the most likely parse trees derived from sequences only, in terms of recall for δ = 4 w.r.t the known contact map m, and precision for δ = 4 (and AP over thresholds δ) w.r.t the full contact map of the reference pdb structure for sequence separation 3+.Note that lengths of the shortest paths between leaves in the most likely parse trees of grammars Ḡ equal 5, which makes measures based on δ = 4 unutile.