Fast computational mutation-response scanning of proteins
- Published
- Accepted
- Received
- Academic Editor
- Joseph Gillespie
- Subject Areas
- Bioinformatics, Biophysics, Computational Biology, Molecular Biology
- Keywords
- Protein, Mutational response, Compensatory mutations
- Copyright
- © 2021 Echave
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
- Cite this article
- 2021. Fast computational mutation-response scanning of proteins. PeerJ 9:e11330 https://doi.org/10.7717/peerj.11330
Abstract
Studying the effect of perturbations on protein structure is a basic approach in protein research. Important problems, such as predicting pathological mutations and understanding patterns of structural evolution, have been addressed by computational simulations that model mutations using forces and predict the resulting deformations. In single mutation-response scanning simulations, a sensitivity matrix is obtained by averaging deformations over point mutations. In double mutation-response scanning simulations, a compensation matrix is obtained by minimizing deformations over pairs of mutations. These very useful simulation-based methods may be too slow to deal with large proteins, protein complexes, or large protein databases. To address this issue, I derived analytical closed formulas to calculate the sensitivity and compensation matrices directly, without simulations. Here, I present these derivations and show that the resulting analytical methods are much faster than their simulation counterparts.
Introduction
Protein function is fundamentally related to protein structure. For this reason, insight into protein function can be gained by studying the structural deformations caused by perturbations. This is at the basis of general experimental and theoretical approaches to study proteins. An experimental example is Deep Mutational Scanning, which allows studying the effects of large numbers of mutations (Fowler & Fields, 2014; Livesey & Marsh, 2020). Theoretically, various computational perturbation-response methods have been developed and used to study the effects of ligand binding and mutations (Yilmaz & Atilgan, 2000; Ikeguchi et al., 2005; Zheng & Brooks, 2005; Echave, 2008; Atilgan & Atilgan, 2009).
Ligand binding can be modelled using forces applied to the protein residues involved in binding (Ikeguchi et al., 2005; Atilgan & Atilgan, 2009). This has been used to study various interesting problems. The most straightforward is predicting the conformational change induced by the binding of a ligand, when the binding site is known (Ikeguchi et al., 2005; Atilgan & Atilgan, 2009; Tamura & Hayashi, 2015). A related application is the prediction of ligand-binding sites related to known or desired deformations (Atilgan et al., 2010; Jalalypour et al., 2020). Another important application is the identification of allosteric sites and allosteric communication networks (General et al., 2014; Alfayate et al., 2019; Lake et al., 2020).
Mutations can also be modelled as forces and predicting the resulting responses Echave (2008). Mutation-response computations have been used for various problems. One example is the analysis and prediction of pathological mutations (Nevin Gerek, Kumar & Banu Ozkan, 2013; Tiberti et al., 2018). Another major application is the study of patterns of protein evolutionary divergence (Echave, 2008; Echave & Fernández, 2010; Nevin Gerek, Kumar & Banu Ozkan, 2013; Marcos & Echave, 2020).
In this paper, I focus on mutation-response methods. I consider two cases, mutation-reponse scanning and double mutation-response scanning. In mutation-response scanning, protein sites are scanned over, for each site many random mutations (modelled as forces) are introduced, the resulting deformations are calculated, and deformations are averaged over to obtain a sensitivity matrix, S (Echave, 2008; General et al., 2014) (Element Sij of S measures the mean structural deformation of site i due to mutations at site j.) In double mutation-response scanning, pairs of sites are scanned over, random mutations are introduced, the resulting deformations are calculated, and the minimum deformations are used to calculate a compensation matrix, D (Tiberti et al., 2018) (Element Dij of D measures the degree to which mutating site i can be compensated by mutating site j.) Because they are based on averaging and maximizing over several simulated mutations, I will call the previous methods simulation-based Mutation-Response Scanning (sMRS) and simulation-based Double Mutation-Response Scanning (sDMRS).
The previous simulation-based methods are not very computationally costly for small to medium proteins. However, the computational cost of sMRS and sDMRS simulations increases with increasing protein size. Therefore, calculations may become prohibitive for very large systems (e.g., supra-molecular complexes, like a ribosome or a virus capsid) or large sets of proteins (e.g., scanning the whole human proteome to detect potential pathological mutations). To alleviate this problem, faster methods are needed.
The purpose of the present paper is to present faster alternatives to sMRS and sDMRS. This article presents two analytical methods, aMRS and aDMRS, that allow, respectively, the calculation of S and D using closed-formed analytical formulas, without performing simulations. In the following sections, I describe the simulation methods, derive the analytical alternatives, and assess the analytical methods by comparison with their simulation-based counterparts.
Methods
In the following sections, I derive the formalism of Mutation Response Scanning (MRS) and Double Mutation Response Scanning (DMRS).
Covariance matrix
At finite temperature the protein fluctuates, sampling an ensemble of conformations. Let a specific backbone conformation be specified by the position vector r = (x1, y1, z1, … xN, yN, zN)T, where (xi, yi, zi) are the Cartesian coordinates of the alpha carbon (Cα) of site i, N is the number of sites, and super-index T denotes matrix or vector transposition. The native ensemble can be characterized by the native structure, r0 =〈r〉, and by the covariance matrix:
(1) where〈⋯〉is the average over conformations.
The covariance matrix is determined by the protein’s energy landscape. For simplicity, in this work I use the energy function of the Anisotropic Network Model (ANM) (Atilgan et al., 2001). This model represents the protein as a network of amino acids connected by harmonic springs. Specifically, each residue is represented by a single node placed at its Cα, and pairs of nodes that are within a cut-off distance R0 are connected with springs of force-constant k. The ANM energy function is
(2) where rx is the position vector of node x, r0x its equilibrium position, k is the spring force constant, and the sum runs over all contacts ij.
The covariance matrix can be derived from Eq. (2). First, a second-order Taylor expansion of (2) leads to
(3) where is the Hessian matrix. Then, assuming a Boltzmann distribution of conformations ρ(r) = e−V(r)/kBT with V(r) given by (3), it follows that
(4) where kB is Boltzmann’s constant, T the absolute temperature, and K−1 is the Hessian’s pseudo-inverse (K is not invertible because it has 6 zero eigenvalues corresponding to rotations and translations). Given a protein of known native structure r0, and parameters R0 and k, K is calculated differentiating (2), then C is obtained using (4).
Linear response approximation
The covariance matrix determines the conformational shift that results from applying a force to one or more protein atoms. An arbitrary force can be represented by a vector f with one component for each of the coordinates that represent the protein’s conformation. For small f, the structural response can be calculated using the Linear Response Approximation (LRA) (Ikeguchi et al., 2005; Echave, 2008):
(5)
Equation (5) allows the prediction of the effect of any given force f with the sole knowledge of C.
Mutation-response scanning
The aim of Mutation-Response Scanning (MRS) is to analyse how protein structure responds to point mutations. In the methods that I consider here, given a protein, mutations are modelled using forces, the resulting structural responses are calculated using the Linear Response Approximation, and these responses are averaged over mutations to calculate a sensitivity matrix S that quantifies the mutation-response patterns.
Mutations as forces
Point mutations can be modelled by forcing the contacts of the mutated site (Echave, 2008). Let j be the site to mutate, C(j) be the set of contacts of j, and jl the contact between j and l. Then, a mutation is modelled by applying a force
(6) where f(jl) is the force applied to contact jl. Let f(jl) be a scalar and ejl a unit vector directed from j to l. Then, f(jl) consists of a force f(jl)ejl applied to l, plus a reaction force −f(jl)ejl applied to j, and no force applied to other sites.
A random mutation at site j is modelled by picking independent random numbers f(jl) and building f(jl) and f(j) (Eq. (6)). Following previous work (Echave, 2008; Echave & Fernández, 2010; Marcos & Echave, 2020), I use
(7) Thus, the contact forces are picked from independent identical normal distributions.
Sensitivity matrix, S
What is the effect on a site i of mutating a site j? Consider a random mutation at site j, represented by a force f(j). Then, from (5), the structural deformation due to this mutation is given by
(8)
Δr0(j) can be written:
(9) where Δr0i(j) is the 3 × 1 column vector that contains the change in Cartesian coordinates of site i caused by mutation f(j) applied to site j. Therefore, the magnitude of the effect of the mutation on the structure of site i may be quantified by the Euclidean norm ||Δr0i(j)2||.
The sensitivity matrix S is the matrix with elements
(10) where i is the response site, j the mutated site, and〈⋯〉stands for averaging over mutations. Sij represents the structural response of site i averaged over mutations at site j. Mutation-response scanning is the calculation of the sensitivity matrix S defined by 10.
Simulation-based mutation-response scanning
The sensitivity matrix S can be obtained using the simulation-based Mutation-Response Scanning method, sMRS. Given a protein’s pdb file, this numerical method proceeds as follows.
-
Set parameters. Set parameters k and R0 of the ANM model, parameter σ used to generate forces (Eq. (7)), and a desired number of mutations to apply to each site, M.
-
Calculate the covariance matrix. Read protein coordinates from the pdb file, for all pairs of sites calculate Cα − Cα distances, compare them with R0 to define contacts, then calculate the elastic network’s matrix K using (2) and (3). Finally, invert this matrix to calculate C using (4).
-
Generate mutational forces. For each site j, generate μ = 1 ⋯ M mutational force vectors f(j, μ) using (6) and (7).
-
Calculate mutational deformations. For each mutational force f(j,μ), calculate the resulting response Δr0i(j, μ).
-
Calculate the sensitivity matrix. Average the deformations Δr0i(j, μ) over mutations μ to obtain element Sij of the sensitivity matrix S, according to (10).
Analytical formula for the sensitivity matrix
In this section, I derive an analytical formula that allows the direct calculation of the sensitivity matrix, S, without performing simulations.
The first step is to consider the deformation caused by forcing a single contact. Let f(jl) be a force applied along contact jl, composed by a force f(jl)ejl applied to l and a reaction force −f(jl)ejl applied to j. Replacing f(jl) into (5) and using (9), leads to
(11) where Δr0i(jl) is the structural shift of site i caused by f(jl) and Cxy is the 3 × 3 block of C corresponding to the covariance between sites x and y.
Second, the deformation resulting from mutating a site is the sum of the deformations caused by forcing its contacts. From (6), (8), and (9), it follows that
(12) where Δr0i(j) is the shift of i due to mutating j and the sum runs over all contacts of j. Replacing (11) into (12), leads to
(13)
Finally, an analytical formula for the direct calculation of the sensitivity matrix may be derived. Replacing (13) into (10), leads to
(14) where〈⋯〉stands for averaging over mutations at j. Since f(jl) ∼ N(0, σ2) are independent identically distributed random variables (“Mutations as forces”), it follows that
(15) where δxy is the Kronecker delta, which is 1 for x = y and 0 otherwise. Therefore, replacing (15) into (14), leads to
(16) This equation allows the calculation of the sensitivity matrix.
Analytical mutation-response scanning
The analytical Mutation-Response Scanning method, aMRS calculates the sensitivity matrix S using the analytical formula (16). Given a protein’s pdb file, this method proceeds as follows.
-
Set parameters. Set the parameters k and R0 of the ANM model, and the parameter σ that defines the distribution of forces (Eq. (7)).
-
Calculate the covariance matrix. Read protein coordinates from the pdb file, for all pairs of sites calculate Cα − Cα distances, compare them with R0 to define contacts, then calculate the elastic network’s matrix K using (2) and (3). Finally, invert this matrix to calculate C using (4).
-
Calculate the sensitivity matrix. Calculate the elements Sij of the sensitivity matrix S using (16).
Double mutation-response scanning
The aim of Double Mutation-Response Scanning (DMRS) is to analyse how protein structure responds to pairs of point mutations. Just as for the MRS methods described above, the DMRS methods that I consider in this section model mutations using forces and calculate structural responses using the Linear Response Approximation. These responses are used to calculate a compensation matrix D that quantifies the degree of structural compensation between pairs of mutations.
Compensation matrix
In this subsection, I define the compensation matrix that DMRS aims to calculate. Let Δr0(iμ) be the structural response to a mutation μ at site i, and Δr0(jν) be the structural response to a mutation ν at j. The deformation due to introducing both mutations is given by
(17) and the magnitude of this deformation is given by
(18)
The first two terms are positive, but the third term may be positive or negative. When the third term is negative, the mutations will compensate each other. Given a first mutation iμ, the maximum compensation due to a second mutation at j is obtained when Δr0(iμ)TΔr0(jν) is minimum. Therefore, the degree of compensation may be quantified by . For mutations modelled as forces, this is equal to minus the maximum, because if a force maximizes the dot-product, the opposite force, which is as likely, minimizes it. Therefore, to keeps things positive, it is convenient to define the compensating power of j by . With the help of this equation, I define a compensation matrix, D, with elements Dij given by
(19) where〈⋯〉μ is the average over μ. Dij is a positive number that quantifies the degree to which mutating j can compensate the structural effect of mutating i.
Forces for double mutation-response scanning
The choice of forces used to model mutations in “Mutations as forces” is not appropriate for calculating the compensation matrix because the maximum involved is ill defined. The value of Δr0(iμ)TΔr0(jν) is proportional to the lengths of force vectors f(iμ) and f(jμ). Defined as described in “Mutations as forces”, the lengths of these vectors may become arbitrarily large, making the maximum in (19) infinite. To fix this, I apply the additional constraint
(20) where σ2 is the parameter used to define contact forces (see Eq. (7)) and CN(x) is the number of contacts of site x. In practice, this is achieved by picking the forces as before, then renormalizing them. The norm of these forces is finite and the maximum of (19) is well defined.
Simulation-based double mutation-response scanning
The compensation matrix may be obtained using the method simulation-based Double Mutation-Response Scanning, sDMRS, which proceeds as follows.
-
Set parameters. Set parameters k and R0 of the ANM model, parameter σ used to generate forces (Eq. (7)), and a desired number of mutations to apply to each site, M.
-
Calculate the covariance matrix. Read protein coordinates from the pdb file, for all pairs of sites calculate Cα − Cα distances, compare them with R0 to define contacts, then calculate the elastic network’s matrix K using (2) and (3). Finally, invert this matrix to calculate C using (4).
-
Generate mutational forces. For each site i, generate μ = 1 ⋯ M mutational force vectors f(iμ) using (6), (7), and (20).
-
Calculate mutational deformations. For each mutational force f(iμ), calculate the resulting response Δr0(iμ).
-
Calculate the compensation matrix. For each pair (iμ,jν), calculate Δr0(iμ)TΔr0(jν), maximize over ν, and average over μ to obtain the elements of the compensation matrix D, according to (19).
Analytical formula for the compensation matrix
In this section, I derive an analytical formula that allows the direct calculation of the compensation matrix, D, without performing simulations.
The first step is to consider the overlap between two deformations, Δr0(i)TΔr0(j). Consider two mutations, at sites i and j, represented by forces f(i) and f(j), respectively. From (6) and (8), it follows that
(21) where Δr0(x) is the protein’s deformation due to mutating site x, Cx is the 3 N × 3 block of C with the 3 columns corresponding to site x, and f(xy) is the scalar force applied to contact xy. From (21), the overlap between two deformations is given by
(22)
For simplicity of notation, it is convenient to rewrite this equation in matrix form:
(23) where f(i) is a column vector whose elements are the CN(i) contact forces f(ik), f(j) is the column vector with CN(j) elements f(jl), and Aij is a matrix of size CN(i) × CN(j) with elements
(24)
At this point it is easy to derive a formula for the compensation matrix. The maximum of , subject to the constraint f(j)2 = σ2 CN(j) (Eq. (20)) can be shown to be (25)
Then, replacing (25) into (19), and using (15), leads to:
(26) where Tr is the trace operator. This equation allows the calculation of the compensation matrix.
Analytical double mutation-response scanning
The analytical Double Mutation-Response Scanning method, aDMRS, calculates the compensation matrix D using the analytical formula (26). Given a protein’s pdb file, this method proceeds as follows.
-
Set parameters. Set the parameters k and R0 of the ANM model, and the parameter σ that defines the distribution of forces (Eq. (7)).
-
Calculate the covariance matrix. Read protein coordinates from the pdb file, for all pairs of sites calculate Cα − Cα distances, compare them with R0 to define contacts, then calculate the elastic network’s matrix K using (2) and (3). Finally, invert this matrix to calculate C using (4).
-
Calculate the compensation matrix. Calculate the elements Dij of the compensation matrix D using (26).
Implementation
In the present work, sMRS (Simulation-based Mutation-Response Scanning), aMRS (Analytical Mutation-Response Scanning), sDMRS (Simulation-based Double Mutation-Response Scanning), and aDMRS (Analytical Double Mutation-Response Scanning) were implemented using the R language. As much as possible, the code was optimised by using the linear algebra functions of the BLAS and LAPACK packages. For implementation details see available code.
Parameters
The parameter values used in the present paper are R0 = 12.5 Å, k = 1/Å 2, and σ = 0.3/Å. With the chosen R0 value, previous work found good agreement between predicted and empirical structural deformations Marcos & Echave (2020). Regarding k, energy units are arbitrarily chosen so that k = 1/Å2. The precise values of k and σ do not affect the present results because they have a mere scaling effect on the sensitivity matrix and the compensation matrix (It can easily be proved that both matrices are proportional to ).
Dataset
Table 1 summarises the dataset used to assess the methods developed in this work. The structure files for the calculations were obtained from the Protein Data Bank for d2l8ma and d2acya, and from the Homstrad database for the other proteins (Stebbings & Mizuguchi, 2004). I use the 8 Homstrad proteins because mutation-response simulations were tested against empirical data for these proteins in a recent study (Marcos & Echave, 2020). I added the other two proteins, with which I am familiar from other studies, to complete the dataset: d2acya to have a second representative of the alpha & beta SCOP structural class and 2l8ma to add a large protein to the dataset.
domain | family | class | N |
---|---|---|---|
d1lcka1 | SH3 domain | All beta | 54 |
d1ntxa | Snake venom toxins | Small | 60 |
d1fxla2 | Canonical RNA-binding domain | Alfa & beta | 82 |
d1bxva | Plastocianine/Azurin-like | All beta | 91 |
d2acya | Acyl-phosphatase-like | Alpha & beta | 98 |
d1jiaa | Vertebrate Phospholipase A2 | All alpha | 122 |
d1hmta | Fatty acid binding protein-like | All beta | 131 |
d1a4fb | Globines | All alpha | 146 |
d1mcta | Eukaryotic proteases | All beta | 223 |
d2l8ma | Cytochrome P450 | All alpha | 405 |
Note:
Columns show, in order, protein domain id, family, and structural class according to the SCOP classification (Murzin et al., 1995), and protein length N.
Results
Mutation-response scanning
This section assesses the analytic Mutation-Response Scanning method (aMRS) by comparison with the simulation-based Mutation-Response Scanning method (sMRS). These methods were described in detail in “Methods”. Briefly, for a given protein, an sMRS simulation consists in subjecting each of the protein sites j to M mutations, calculating the resulting structural deformation of each site i, and averaging these deformations over mutations to obtain the elements Sij of a sensitivity matrix S (see “Simulation-based Mutation-Response Scanning”). The analytical method, aMRS, calculates S using the closed analytical expression Eq. (16), avoiding the need of simulations (see “Analytical Mutation-Response Scanning”). Methods are compared on the proteins of Table 1.
sMRS converges rapidly towards aMRS
I compare aMRS with sMRS for the proteins of Table 1. The point of this work is to assess whether the analytical method is faster than the simulation method. However, since the calculations performed with the simulation method depend on the number of mutations per site, M, before addressing computational cost, I consider the convergence of sMRS calculations.
Theoretically, sMRS and aMRS are equivalent ways of calculating the sensitivity matrix S. Specifically, in the limit of an infinite number of mutations per site, , the sMRS S should converge towards the aMRS S. To study this convergence, Fig. 1, compares simulated and analytical matrices for the example case of Phospholipase A2 (SCOP id d1jiaa) (Similar figures for the other proteins studied can be found in Supplemental_info.pdf). For the d1jiaa example, sMRS converges rapidly towards the aMRS matrix as M increases (Fig. 1C), so that the sMRS matrix calculated with M = 200 is very similar to the aMRS matrix (Fig. 1A and Fig. 1B).
For the other proteins the results are similar. Thus, for all cases the sMRS matrix converges rapidly towards the aMRS matrix (see grey lines of Fig. 1C). For M = 200, the correlation coefficient between sMRS and aMRS matrices is 1.00 for all proteins (Table 2). Thus, the sMRS sensitivity matrix converges rapidly with increasing M, so that with M = O(102) it is very similar to the aMRS matrix.
protein | N | tsMRS | taMRS | R | Ri | Rj |
---|---|---|---|---|---|---|
d1lcka1 | 54 | 6.26 | 0.03 | 1.00 | 1.00 | 0.99 |
d1ntxa | 60 | 7.18 | 0.05 | 1.00 | 1.00 | 1.00 |
d1fxla2 | 82 | 11.22 | 0.07 | 1.00 | 1.00 | 1.00 |
d1bxva | 91 | 12.44 | 0.07 | 1.00 | 1.00 | 0.97 |
d2acya | 98 | 18.08 | 0.07 | 1.00 | 1.00 | 0.98 |
d1jiaa | 122 | 18.77 | 0.12 | 1.00 | 1.00 | 0.99 |
d1hmta | 131 | 21.16 | 0.11 | 1.00 | 1.00 | 0.99 |
d1a4fb | 146 | 26.02 | 0.18 | 1.00 | 1.00 | 0.99 |
d1mcta | 223 | 54.08 | 0.37 | 1.00 | 1.00 | 0.99 |
d2l8ma | 405 | 180.91 | 1.47 | 1.00 | 1.00 | 0.99 |
Note:
N: protein length; tsMRS: CPU time of sMRS in seconds; taMRS: CPU time of aMRS in seconds. Convergence measures at M = 200 mutations per site: R: correlation coefficient between sMRS and aMRS sensitivity matrices; Ri: correlation between sensitivity profiles; Rj: correlation between influence profiles.
To further assess convergence, I consider sMRS and aMRS profiles. Site-dependent profiles are obtained by averaging the sensitivity matrix over rows or columns. Averaging over rows leads to an influence profile, with elements that measure the average influence of mutating j. Averaging over columns leads to a sensitivity profile, with elements that measure the sensitivity of site i with respect to mutations elsewhere.
Figure 2 compares sMRS and aMRS profiles for Phospholipase A2 (d1jiaa) (Similar figures for the other proteins studied can be found in Supplemental_info.pdf). Comparing influence profiles, we see that sMRS with M = 200 and aMRS profiles are very similar (Fig. 2A and Fig. 2B) and that sMRS influence profiles converge rapidly towards the corresponding aMRS profiles as M increases (Fig. 2C). Similarly, the sensitivity profile estimated by sMRS with M = 200 is also very similar to its aMRS counterpart (Figs. 2D and 2E) and the sMRS profile converges rapidly towards the aMRS profile as M increases (Fig. 2F).
Similar results are found for the other proteins studied. The convergence of influence profiles (grey lines of Fig. 2C) is somewhat slower than that of sensitivity profiles (grey lines of Fig. 2F), but in both cases there is good convergence. For M = 200, Pearson’s correlation between sMRS and aMRS influence profiles is in the range 0.97 ≤ R ≤ 1.00 and the correlation between sensitivity profiles is 1.00 for all proteins (Table 2). In summary, sMRS influence and sensitivity profiles converge rapidly, so that with M = O(102) they are very similar to their aMRS counterparts.
aMRS is much faster than sMRS
The purpose of this paper is to develop a faster mutation-response scanning method. To see whether aMRS is indeed faster than sMRS, Fig. 3 compares their computational cost. An sMRS calculation using a typical number of M = 200 mutations per site is much slower than an aMRS calculation (Fig. 3A). The computational cost, as measured by CPU time, scales with protein length as N1.5 for both sMRS and aMRS. As a result, tsMRS increases linearly with taMRS with a slope that is the speedup of aMRS vs. sMRS; For the M = 200 case, this speedup is tsDMRS/taDMRS ≈ 126 (Fig. 3B). Further, the speedup increases linearly with M: tsMRS/taMRS ∝ M (Fig. 3C). Thus, the analytical method provides a speedup of the order of the number of mutations per site, which is typically in the hundreds. In a word, aMRS is much faster than sMRS.
Double mutation-response scanning
This section assesses the analytical Double Mutation-Response Scanning method (aDMRS) by comparison with the simulation-based Double Mutation-Response Scanning method (sDMRS). These methods are alternative ways of calculating a compensation matrix D. This matrix is composed by elements Dij that measure the degree to which mutating site j may compensate the structural deformation due to a first mutation at site i (Eq. (19)). The simulation method, sDMRS, obtains this matrix numerically scanning over pairs of simulated mutations (see “Simulation-based Double Mutation-Response Scanning”). The analytical method, aDMRS, calculates the compensation values using a closed formula (Eq. (26)), avoiding the use of simulations (see “Analytical Double Mutation-Response Scanning”).
sDMRS converges slowly towards aDMRS
I compare aDMRS with aDMRS for the proteins of Table 1. As in “Mutation-Response Scanning”, before addressing computational cost, I consider the convergence of the simulation method with increasing M.
In principle, the simulation and analytical methods are equivalent. The compensation matrix D calculated with sDMRS with will be identical to the aDMRS matrix. However, in practice the sDMRS matrix depends on M. Figure 4 compares simulated and analytical compensation matrices for the example case of Phospholipase A2 (SCOP id d1jiaa) (Similar figures for the other proteins studied can be found in Supplemental_info.pdf). First, note that the compensation matrix obtained with sDMRS with M = 200 looks similar to the aDMRS matrix (Fig. 4A). More quantitatively, a scatter plot of sDMRS vs. aDMRS matrix elements shows good correlation, but there is a visible scattering of points around the linear fit (Fig. 4C). The similarity between sDMRS and aDMRS matrices can be measured by the correlation coefficient, which in this case is R = 0.95. Figure 4C shows that as M increases, the sDMRS matrix converges rapidly at first towards the aDMRS matrix, but convergence slows down with further increases of M. Thus, for Phospholipase A2, sDMRS with O(102) mutations per site produces a compensation matrix that is in good agreement with, but not identical to, the aDMRS matrix.
A similar situation is found for the other proteins of the dataset. Convergence quickly slows down as M increases (see grey lines of Fig. 4C ). For M = 200, the correlation between sDMRS and aDMRS matrices falls within the range 0.87 ≤ R ≤ 0.97 (Table 3). Thus, the sDMRS compensation matrix converges slowly towards the aDMRS matrix, so that for M = O(102) the simulated matrix is in moderate to good agreement with the analytical matrix. The degree of convergence is not clearly related to protein properties such as structural class or protein size, thus convergence should be tested whenever the simulation method is used.
protein | N | tsDMRS | taDMRS | R | Ri | Rj |
---|---|---|---|---|---|---|
d1lcka1 | 54 | 22.24 | 0.21 | 0.87 | 0.76 | 0.78 |
d1ntxa | 60 | 27.70 | 0.17 | 0.97 | 0.97 | 0.99 |
d1fxla2 | 82 | 59.58 | 0.43 | 0.93 | 0.89 | 0.94 |
d1bxva | 91 | 77.46 | 0.60 | 0.92 | 0.55 | 0.69 |
d2acya | 98 | 116.38 | 0.76 | 0.90 | 0.56 | 0.71 |
d1jiaa | 122 | 167.89 | 1.24 | 0.95 | 0.85 | 0.94 |
d1hmta | 131 | 203.65 | 1.30 | 0.97 | 0.77 | 0.93 |
d1a4fb | 146 | 274.29 | 2.13 | 0.92 | 0.70 | 0.83 |
d1mcta | 223 | 1,034.36 | 11.95 | 0.92 | 0.64 | 0.74 |
d2l8ma | 405 | 12,995.91 | 56.53 | 0.92 | 0.61 | 0.77 |
Note:
N: protein length; tsDMRS: CPU time of sDMRS in seconds; taDMRS: CPU time of aDMRS in seconds. Convergence measures at M = 200 mutations per site: R: correlation coefficient between sDMRS and aDMRS compensation matrices D; Ri: correlation between Di profiles; Rj: correlation between Dj profiles.
I further assess convergence by considering site-dependent compensation profiles. Averaging D over rows, I obtain a Dj profile that measures the average compensation power of sites j. Averaging over columns, I obtain a Di profile that measures how likely to be compensated mutations at i are. Figure 5 compares sDMRS and aDMRS profiles for Phospholipase A2 (d1jiaa). The M = 200 sDMRS profiles are visually similar to aDMRS profiles (Fig. 5A and Fig. 5D). The similarity is not very high, however: points are quite scattered around the linear fit in sDMRS vs. aDMRS plots (Fig. 5B and Fig. 5E). The convergence of sDMRS profiles towards their aDMRS counterparts is very slow (Fig. 5C and Fig. 5F).
Similar results are found for the other proteins studied. Profiles generally improve very slowly with increasing M (see grey lines of Fig. 5C and Fig. 5F). For M = 200, the correlation coefficient between sDMRS and aDMRS Di profiles falls in the range 0.55 ≤ R ≤ 0.97 and between Dj profiles it falls in the range 0.69 ≤ R ≤ 0.99 (Table 3). In summary, sDMRS profiles converge very slowly with increasing M, so that for M = O(102), they are often poorly converged. In addition, There are no obvious determinants of convergence: R is not clearly determined by either protein size or structural class. Therefore, whenever the simulation method is used, convergence should be tested.
aDMRS is much faster than sDMRS
To see whether aDMRS is faster than aDMRS, Fig. 6 compares their computational cost. sDMRS with M = 200 mutations per site is much slower than aDMRS (Fig. 6A). The computational cost, as measured by CPU time, scales with protein length as N3 for both sDMRS and aDMRS. As a result, tsDMRS increases linearly with taDMRS with a slope that is the speedup of aDMRS vs. sDMRS. For the M = 200 case, tsDMRS/taDMRS ≈ 137 (Fig. 6B). The speedup increases non-linearly with M (Fig. 6C). This dependence can be understood from the sDMRS procedure schematised in “Simulation-based Double Mutation-Response Scanning”. The cost of generating the mutations (steps 3 and 4) increases linearly with M, while performing the average and maximization needed to calculate the compensation matrix (steps 5) scales as M2. Therefore, for large M the analytical method provides a speedup of O(M2), making aDMRS much faster than sDMRS.
Discussion
I have derived, implemented, and assessed two mutation-response scanning methods, aMRS and aDMRS, which are analytical alternatives to the simulation methods sMRS and sDMRS, respectively. All methods were implemented using R with optimized BLAS and LAPACK libraries. None of the methods posed major implementation difficulties.
The methods were assessed on a dataset of 10 proteins of varying lengths. First, I consider the convergence of simulation methods. In the limit if infinite mutations per site (M), simulation and analytical methods should give the same results. In practice, the degree of convergence of the simulation methods depends on M. sMRS converges rapidly towards aMRS, so that with a typical M = O(102) the sDMRS sensitivity matrix and its marginal profiles are almost identical to those calculated with aMRS (Fig. 1C, Fig. 2C, Fig. 2F, Table 2). On the other hand, sDMRS converges slowly, so that even with M = O(102) sDMRS convergence is not guaranteed (Fig. 4C, Fig. 5C, Fig. 5F, Table 3). sDMRS converges more slowly than sMRS because it is more difficult to find extreme values (calculation of the compensation matrix involves maximization over pairs of mutations) than averages (sensitivity matrix elements are averages over mutations). In general, when using simulation-based methods convergence should always be assessed. In contrast, since the analytical methods do not depend on M, there is no need to study convergence, and possible convergence issues are altogether avoided.
Beyond convergence, since the purpose of this work was to develop faster methods, the key finding is that the analytical methods are much faster than the simulation methods. For a typical case of M = 200 mutations per site, aMRS is 126 × faster than sMRS and aDMRS is 137 × faster than sDMRS. While the computational cost of sMRS is relatively modest and increases rather slowly in proportion to N1.5 M, sDMRS is much more computationally expensive and its cost rises steeply in proportion to N3M2. The speedup of analytical methods is of O(M) for single-mutation scans and O(M2) for double-mutation scans. This speedup may be most important for large proteins. For instance, for the 405-sites-long Cytochrome P450, an sMRS calculation takes 3 CPU min vs. 1.5 s of the alternative aMRS calculation (Table 2). On the other hand, an sDMRS calculation takes 3.6 h vs. 1 min of the alternative aDMRS calculation (Table 3). Therefore, there is a large speedup for both single and double mutation-response scans, that may be most useful for the later case.
To further compare the mutation-response scanning methods considered here, I discuss some of their main limitations. All methods are based on the Linear Response Approximation formula Δr0 = Cf. Therefore, the main limitations are the validity of LRA, the quality of C, and how well mutations can be modelled by the force f. Regarding the first limitation, LRA will be valid if both perturbations (f) and their responses (Δr0) are small. Thus LRA should be valid for most mutations, failing only in the rare cases in which specific mutations induce very large conformational changes. Second, calculating C with a simple elastic network model, as done here, might impose additional limitations. However, this could be alleviated by calculating C using more sophisticated means, such as MD simulations, if necessary. More fundamentally, the main limitation is the very assumption that C characterizes the conformational ensemble, which will be the case for proteins with a single native structure, but may fail for proteins that have two or more stable conformations. The final limitation depends on whether mutations can be adequately modelled using forces (f). While it is possible that this fails for the prediction of specific mutations, mutations-as-forces models have been proved successful in many previous studies that depend on summary statistics such averages or maxima (Echave, 2008; Echave & Fernández, 2010; Tiberti et al., 2018; Marcos & Echave, 2020). For the present work, it should be noted that the limitations mentioned are common to the simulation methods and their analytical alternatives. The analytical approach adds no limitation to the list.
Given that limitations exist, it is worthwhile to discuss why this work has not validated the methods by comparison with empirical data. The main reason is that the aim of this work is not to develop mutation-response methods in better agreement with experiment, but to develop faster methods. This is why the assessment was performed by comparing between simulation and analytical approaches, rather than validating such approaches against empirical data. Validating mutation-response scanning itself is beyond the scope of this work. A second reason is that taking the validity of mutation-response scanning as a given is reasonable. For 8 of the proteins of Table 1, the mutation-response model of the present paper has been recently validated by comparison with empirical structural sensitivity profiles Marcos & Echave (2020). More generally, the validity of perturbation-response methods follows from their extensive successful use in a variety of applications for at least 15 years, as mentioned in “Introduction”.
The main conclusion of this work is that the analytical methods should be chosen over the simulation methods because they are faster and, in addition, they have no convergence issues. Therefore, the analytical methods should be useful for a wide range of potential applications, such as predicting evolutionary divergence of protein structures (Echave & Fernández, 2010; Marcos & Echave, 2020), detecting and interpreting pathological mutations (Nevin Gerek, Kumar & Banu Ozkan, 2013; Raimondi et al., 2018; Verkhivker, 2019), and detecting compensating mutations and rescue sites (Tiberti et al., 2018). The speedup afforded by the analytical methods would be especially helpful for treating otherwise intractable large proteins, protein complexes, and large protein databases.
To finish, I mention two possible lines of further development. A first line is to derive analytical expressions for the deformations caused by external forces applied to single sites, as in Perturbation-Response Scanning (PRS) (Atilgan & Atilgan, 2009; General et al., 2014) and Double Force Scanning (DFS) (Tiberti et al., 2018). This will be useful for applications related to ligand-binding induced deformations (Atilgan et al., 2010; General et al., 2014). Beyond deformations, a second line of development is to derive analytical alternatives to simulation-based methods that calculate effects of mutations on protein motions (Hamacher, 2008; Zheng & Tekpinar, 2009; Zheng & Thirumalai, 2009; Echave, 2012; Hamacher, 2008). This would be important for studies of the role of protein dynamics in function and evolution (Echave, 2012; Micheletti, 2013; Ponzoni & Bahar, 2018; Zhang et al., 2019; Zhang & Su, 2019; Wingert et al., 2021).