Fast computational mutation-response scanning of proteins

Studying the effect of perturbations on protein structure is a basic approach in protein research. Important problems, such as predicting pathological mutations and understanding patterns of structural evolution, have been addressed by computational simulations that model mutations using forces and predict the resulting deformations. In single mutation-response scanning simulations, a sensitivity matrix is obtained by averaging deformations over point mutations. In double mutation-response scanning simulations, a compensation matrix is obtained by minimizing deformations over pairs of mutations. These very useful simulation-based methods may be too slow to deal with large proteins, protein complexes, or large protein databases. To address this issue, I derived analytical closed formulas to calculate the sensitivity and compensation matrices directly, without simulations. Here, I present these derivations and show that the resulting analytical methods are much faster than their simulation counterparts.


INTRODUCTION
Protein function is fundamentally related to protein structure. For this reason, insight into protein function can be gained by studying the structural deformations caused by perturbations. This is at the basis of general experimental and theoretical approaches to study proteins. An experimental example is Deep Mutational Scanning, which allows studying the effects of large numbers of mutations (Fowler & Fields, 2014;Livesey & Marsh, 2020). Theoretically, various computational perturbation-response methods have been developed and used to study the effects of ligand binding and mutations (Yilmaz & Atilgan, 2000;Ikeguchi et al., 2005;Zheng & Brooks, 2005;Echave, 2008;Atilgan & Atilgan, 2009).
Ligand binding can be modelled using forces applied to the protein residues involved in binding (Ikeguchi et al., 2005;Atilgan & Atilgan, 2009). This has been used to study various interesting problems. The most straightforward is predicting the conformational change induced by the binding of a ligand, when the binding site is known (Ikeguchi et al., 2005;Atilgan & Atilgan, 2009;Tamura & Hayashi, 2015). A related application is the prediction of ligand-binding sites related to known or desired deformations (Atilgan et al., 2010;Jalalypour et al., 2020). Another important application is the identification of The native ensemble can be characterized by the native structure, r 0 =〈r〉 , and by the covariance matrix: C hðr À r 0 Þðr À r 0 Þ T i (1) where〈⋯〉is the average over conformations.
The covariance matrix is determined by the protein's energy landscape. For simplicity, in this work I use the energy function of the Anisotropic Network Model (ANM) (Atilgan et al., 2001). This model represents the protein as a network of amino acids connected by harmonic springs. Specifically, each residue is represented by a single node placed at its C a , and pairs of nodes that are within a cut-off distance R 0 are connected with springs of force-constant k. The ANM energy function is where r x is the position vector of node x, r 0 x its equilibrium position, k is the spring force constant, and the sum runs over all contacts ij.
The covariance matrix can be derived from Eq. (2). First, a second-order Taylor expansion of (2) leads to VðrÞ % 1 2 ðr À r 0 Þ T Kðr À r 0 Þ where K ¼ d 2 V=dr 2 ð Þ r¼r 0 is the Hessian matrix. Then, assuming a Boltzmann distribution of conformations ρ(r) = e −V(r)/k B T with V(r) given by (3), it follows that where k B is Boltzmann's constant, T the absolute temperature, and K −1 is the Hessian's pseudo-inverse (K is not invertible because it has 6 zero eigenvalues corresponding to rotations and translations). Given a protein of known native structure r 0 , and parameters R 0 and k, K is calculated differentiating (2), then C is obtained using (4).

Linear response approximation
The covariance matrix determines the conformational shift that results from applying a force to one or more protein atoms. An arbitrary force can be represented by a vector f with one component for each of the coordinates that represent the protein's conformation. For small f, the structural response can be calculated using the Linear Response Approximation (LRA) (Ikeguchi et al., 2005;Echave, 2008): Equation (5) allows the prediction of the effect of any given force f with the sole knowledge of C.

Mutation-response scanning
The aim of Mutation-Response Scanning (MRS) is to analyse how protein structure responds to point mutations. In the methods that I consider here, given a protein, mutations are modelled using forces, the resulting structural responses are calculated using the Linear Response Approximation, and these responses are averaged over mutations to calculate a sensitivity matrix S that quantifies the mutation-response patterns.

Mutations as forces
Point mutations can be modelled by forcing the contacts of the mutated site (Echave, 2008). Let j be the site to mutate, C(j) be the set of contacts of j, and jl the contact between j and l. Then, a mutation is modelled by applying a force where f(jl) is the force applied to contact jl. Let f(jl) be a scalar and e jl a unit vector directed from j to l. Then, f(jl) consists of a force f(jl)e jl applied to l, plus a reaction force −f(jl)e jl applied to j, and no force applied to other sites. A random mutation at site j is modelled by picking independent random numbers f(jl) and building f(jl) and f(j) (Eq. (6)). Following previous work (Echave, 2008;Echave & Fernández, 2010;Marcos & Echave, 2020), I use f ðjlÞ $ Nð0; r 2 Þ Thus, the contact forces are picked from independent identical normal distributions.

Sensitivity matrix, S
What is the effect on a site i of mutating a site j? Consider a random mutation at site j, represented by a force f(j). Then, from (5), the structural deformation due to this mutation is given by Δr 0 (j) can be written: where Δr 0 i (j) is the 3 × 1 column vector that contains the change in Cartesian coordinates of site i caused by mutation f(j) applied to site j. Therefore, the magnitude of the effect of the mutation on the structure of site i may be quantified by the Euclidean norm ||Δr 0 i (j) 2 ||. The sensitivity matrix S is the matrix with elements where i is the response site, j the mutated site, and〈⋯〉stands for averaging over mutations. S ij represents the structural response of site i averaged over mutations at site j. Mutation-response scanning is the calculation of the sensitivity matrix S defined by 10.

Simulation-based mutation-response scanning
The sensitivity matrix S can be obtained using the simulation-based Mutation-Response Scanning method, sMRS. Given a protein's pdb file, this numerical method proceeds as follows.
1. Set parameters. Set parameters k and R 0 of the ANM model, parameter σ used to generate forces (Eq. (7)), and a desired number of mutations to apply to each site, M.
2. Calculate the covariance matrix. Read protein coordinates from the pdb file, for all pairs of sites calculate C a − C a distances, compare them with R 0 to define contacts, then calculate the elastic network's matrix K using (2) and (3). Finally, invert this matrix to calculate C using (4).

Calculate mutational deformations.
For each mutational force f(j,µ), calculate the resulting response Δr 0 i (j, µ). 5. Calculate the sensitivity matrix. Average the deformations Δr 0 i (j, µ) over mutations µ to obtain element S ij of the sensitivity matrix S, according to (10).

Analytical formula for the sensitivity matrix
In this section, I derive an analytical formula that allows the direct calculation of the sensitivity matrix, S, without performing simulations.
The first step is to consider the deformation caused by forcing a single contact. Let f(jl) be a force applied along contact jl, composed by a force f(jl)e jl applied to l and a reaction force −f(jl)e jl applied to j. Replacing f(jl) into (5) and using (9), leads to where Δr 0 i (jl) is the structural shift of site i caused by f(jl) and C xy is the 3 × 3 block of C corresponding to the covariance between sites x and y.
Second, the deformation resulting from mutating a site is the sum of the deformations caused by forcing its contacts. From (6), (8), and (9), it follows that where Δr 0 i (j) is the shift of i due to mutating j and the sum runs over all contacts of j. Replacing (11) into (12), leads to Finally, an analytical formula for the direct calculation of the sensitivity matrix may be derived. Replacing (13) into (10), leads to Dr i ðjkÞ T Dr i ðjlÞ where〈⋯〉stands for averaging over mutations at j. Since f(jl) ∼ N(0, σ 2 ) are independent identically distributed random variables ("Mutations as forces"), it follows that where δ xy is the Kronecker delta, which is 1 for x = y and 0 otherwise. Therefore, replacing (15) into (14), leads to This equation allows the calculation of the sensitivity matrix.

Analytical mutation-response scanning
The analytical Mutation-Response Scanning method, aMRS calculates the sensitivity matrix S using the analytical formula (16). Given a protein's pdb file, this method proceeds as follows.
1. Set parameters. Set the parameters k and R 0 of the ANM model, and the parameter σ that defines the distribution of forces (Eq. (7)).
2. Calculate the covariance matrix. Read protein coordinates from the pdb file, for all pairs of sites calculate C a − C a distances, compare them with R 0 to define contacts, then calculate the elastic network's matrix K using (2) and (3). Finally, invert this matrix to calculate C using (4).
3. Calculate the sensitivity matrix. Calculate the elements S ij of the sensitivity matrix S using (16).

Double mutation-response scanning
The aim of Double Mutation-Response Scanning (DMRS) is to analyse how protein structure responds to pairs of point mutations. Just as for the MRS methods described above, the DMRS methods that I consider in this section model mutations using forces and calculate structural responses using the Linear Response Approximation. These responses are used to calculate a compensation matrix D that quantifies the degree of structural compensation between pairs of mutations.

Compensation matrix
In this subsection, I define the compensation matrix that DMRS aims to calculate. Let Δr 0 (iµ) be the structural response to a mutation µ at site i, and Δr 0 (jν) be the structural response to a mutation ν at j. The deformation due to introducing both mutations is given by (17) and the magnitude of this deformation is given by The first two terms are positive, but the third term may be positive or negative. When the third term is negative, the mutations will compensate each other. Given a first mutation iµ, the maximum compensation due to a second mutation at j is obtained when Δr 0 (iµ) T Δr 0 (jν) is minimum. Therefore, the degree of compensation may be quantified by min m Dr 0 ðilÞ T Dr 0 ðjmÞ. For mutations modelled as forces, this is equal to minus the maximum, because if a force maximizes the dot-product, the opposite force, which is as likely, minimizes it. Therefore, to keeps things positive, it is convenient to define the compensating power of j by max m ½Dr 0 ðilÞ T Dr 0 ðjmÞ 2 . With the help of this equation, I define a compensation matrix, D, with elements D ij given by where〈⋯〉 µ is the average over µ. D ij is a positive number that quantifies the degree to which mutating j can compensate the structural effect of mutating i.

Forces for double mutation-response scanning
The choice of forces used to model mutations in "Mutations as forces" is not appropriate for calculating the compensation matrix because the maximum involved is ill defined. The value of Δr 0 (iµ) T Δr 0 (jν) is proportional to the lengths of force vectors f(iµ) and f(jµ). Defined as described in "Mutations as forces", the lengths of these vectors may become arbitrarily large, making the maximum in (19) infinite. To fix this, I apply the additional constraint where σ 2 is the parameter used to define contact forces (see Eq. (7)) and CN(x) is the number of contacts of site x. In practice, this is achieved by picking the forces as before, then renormalizing them. The norm of these forces is finite and the maximum of (19) is well defined.

Simulation-based double mutation-response scanning
The compensation matrix may be obtained using the method simulation-based Double Mutation-Response Scanning, sDMRS, which proceeds as follows.
1. Set parameters. Set parameters k and R 0 of the ANM model, parameter σ used to generate forces (Eq. (7)), and a desired number of mutations to apply to each site, M.

2.
Calculate the covariance matrix. Read protein coordinates from the pdb file, for all pairs of sites calculate C a − C a distances, compare them with R 0 to define contacts, then calculate the elastic network's matrix K using (2) and (3). Finally, invert this matrix to calculate C using (4).

Analytical formula for the compensation matrix
In this section, I derive an analytical formula that allows the direct calculation of the compensation matrix, D, without performing simulations. The first step is to consider the overlap between two deformations, Δr 0 (i) T Δr 0 (j). Consider two mutations, at sites i and j, represented by forces f(i) and f(j), respectively. From (6) and (8), it follows that where Δr 0 (x) is the protein's deformation due to mutating site x, C x is the 3 N × 3 block of C with the 3 columns corresponding to site x, and f(xy) is the scalar force applied to contact xy. From (21), the overlap between two deformations is given by For simplicity of notation, it is convenient to rewrite this equation in matrix form: where f(i) is a column vector whose elements are the At this point it is easy to derive a formula for the compensation matrix. The maximum of Dr 0 ðiÞ T Dr 0 ðjÞ h i 2 , subject to the constraint f(j) 2 = σ 2 CN(j) (Eq. (20)) can be shown to be max Dr 0 ðiÞ T Dr 0 ðjÞ Then, replacing (25) into (19), and using (15), leads to: where Tr is the trace operator. This equation allows the calculation of the compensation matrix.

Analytical double mutation-response scanning
The analytical Double Mutation-Response Scanning method, aDMRS, calculates the compensation matrix D using the analytical formula (26). Given a protein's pdb file, this method proceeds as follows.
1. Set parameters. Set the parameters k and R 0 of the ANM model, and the parameter σ that defines the distribution of forces (Eq. (7)).
2. Calculate the covariance matrix. Read protein coordinates from the pdb file, for all pairs of sites calculate C a − C a distances, compare them with R 0 to define contacts, then calculate the elastic network's matrix K using (2) and (3). Finally, invert this matrix to calculate C using (4).
3. Calculate the compensation matrix. Calculate the elements D ij of the compensation matrix D using (26).

Implementation
In the present work, sMRS (Simulation-based Mutation-Response Scanning), aMRS (Analytical Mutation-Response Scanning), sDMRS (Simulation-based Double Mutation-Response Scanning), and aDMRS (Analytical Double Mutation-Response Scanning) were implemented using the R language. As much as possible, the code was optimised by using the linear algebra functions of the BLAS and LAPACK packages. For implementation details see available code.

Parameters
The parameter values used in the present paper are R 0 = 12.5 Å, k = 1/Å 2 , and σ = 0.3/Å. With the chosen R 0 value, previous work found good agreement between predicted and empirical structural deformations Marcos & Echave (2020). Regarding k, energy units are arbitrarily chosen so that k = 1/Å 2 . The precise values of k and σ do not affect the present results because they have a mere scaling effect on the sensitivity matrix and the compensation matrix (It can easily be proved that both matrices are proportional to r 2 k 2 ).
Dataset Table 1 summarises the dataset used to assess the methods developed in this work. The structure files for the calculations were obtained from the Protein Data Bank for d2l8ma and d2acya, and from the Homstrad database for the other proteins (Stebbings & Mizuguchi, 2004). I use the 8 Homstrad proteins because mutation-response simulations were tested against empirical data for these proteins in a recent study (Marcos & Echave, 2020). I added the other two proteins, with which I am familiar from other studies, to complete the dataset: d2acya to have a second representative of the alpha & beta SCOP structural class and 2l8ma to add a large protein to the dataset.

Mutation-response scanning
This section assesses the analytic Mutation-Response Scanning method (aMRS) by comparison with the simulation-based Mutation-Response Scanning method (sMRS). These methods were described in detail in "Methods". Briefly, for a given protein, an sMRS simulation consists in subjecting each of the protein sites j to M mutations, calculating the resulting structural deformation of each site i, and averaging these deformations over mutations to obtain the elements S ij of a sensitivity matrix S (see "Simulation-based Mutation-Response Scanning"). The analytical method, aMRS, calculates S using the closed analytical expression Eq. (16), avoiding the need of simulations (see "Analytical Mutation-Response Scanning"). Methods are compared on the proteins of Table 1.
sMRS converges rapidly towards aMRS I compare aMRS with sMRS for the proteins of Table 1. The point of this work is to assess whether the analytical method is faster than the simulation method. However, since the calculations performed with the simulation method depend on the number of mutations per site, M, before addressing computational cost, I consider the convergence of sMRS calculations.
Theoretically, sMRS and aMRS are equivalent ways of calculating the sensitivity matrix S. Specifically, in the limit of an infinite number of mutations per site, M ! 1, the sMRS S should converge towards the aMRS S. To study this convergence, Fig. 1, compares simulated and analytical matrices for the example case of Phospholipase A2 (SCOP id d1jiaa) (Similar figures for the other proteins studied can be found in Supplemental_info.pdf). For the d1jiaa example, sMRS converges rapidly towards the aMRS matrix as M increases (Fig. 1C), so that the sMRS matrix calculated with M = 200 is very similar to the aMRS matrix ( Fig. 1A and Fig. 1B).
For the other proteins the results are similar. Thus, for all cases the sMRS matrix converges rapidly towards the aMRS matrix (see grey lines of Fig. 1C). For M = 200, the correlation coefficient between sMRS and aMRS matrices is 1.00 for all proteins (Table 2). Thus, the sMRS sensitivity matrix converges rapidly with increasing M, so that with M = O(10 2 ) it is very similar to the aMRS matrix.
To further assess convergence, I consider sMRS and aMRS profiles. Site-dependent profiles are obtained by averaging the sensitivity matrix over rows or columns. Averaging over rows leads to an influence profile, with elements S j 1=N P N i S ij that measure the average influence of mutating j. Averaging over columns leads to a sensitivity profile, with elements S i 1=N P N j S ij that measure the sensitivity of site i with respect to mutations elsewhere. Figure 2 compares sMRS and aMRS profiles for Phospholipase A2 (d1jiaa) (Similar figures for the other proteins studied can be found in Supplemental_info.pdf). Comparing influence profiles, we see that sMRS with M = 200 and aMRS profiles are very similar ( Fig. 2A and Fig. 2B) and that sMRS influence profiles converge rapidly towards the corresponding aMRS profiles as M increases (Fig. 2C). Similarly, the sensitivity profile estimated by sMRS with M = 200 is also very similar to its aMRS counterpart (Figs. 2D and  (Murzin et al., 1995), and protein length N. 2E) and the sMRS profile converges rapidly towards the aMRS profile as M increases (Fig. 2F). Similar results are found for the other proteins studied. The convergence of influence profiles (grey lines of Fig. 2C) is somewhat slower than that of sensitivity profiles (grey lines of Fig. 2F), but in both cases there is good convergence. For M = 200, Pearson's correlation between sMRS and aMRS influence profiles is in the range 0.97 ≤ R ≤ 1.00 and the correlation between sensitivity profiles is 1.00 for all proteins (Table 2). In summary, sMRS influence and sensitivity profiles converge rapidly, so that with M = O(10 2 ) they are very similar to their aMRS counterparts.

aMRS is much faster than sMRS
The purpose of this paper is to develop a faster mutation-response scanning method. To see whether aMRS is indeed faster than sMRS, Fig. 3 compares their computational cost. An sMRS calculation using a typical number of M = 200 mutations per site is much slower than an aMRS calculation (Fig. 3A). The computational cost, as measured by CPU time, scales with protein length as N 1.5 for both sMRS and aMRS. As a result, t sMRS increases linearly with t aMRS with a slope that is the speedup of aMRS vs. sMRS; For the M = 200 case, this speedup is t sDMRS /t aDMRS ≈ 126 (Fig. 3B). Further, the speedup increases linearly with M: t sMRS /t aMRS ∝ M (Fig. 3C). Thus, the analytical method provides a speedup of the order of the number of mutations per site, which is typically in the hundreds. In a word, aMRS is much faster than sMRS.

Double mutation-response scanning
This section assesses the analytical Double Mutation-Response Scanning method (aDMRS) by comparison with the simulation-based Double Mutation-Response Scanning method (sDMRS). These methods are alternative ways of calculating a compensation matrix D. This matrix is composed by elements D ij that measure the degree to which mutating site j may compensate the structural deformation due to a first mutation at site i (Eq. (19)). The simulation method, sDMRS, obtains this matrix numerically scanning over pairs of simulated mutations (see "Simulation-based Double Mutation-Response Scanning"). The analytical method, aDMRS, calculates the compensation values using a closed formula (Eq. (26)), avoiding the use of simulations (see "Analytical Double Mutation-Response Scanning").
sDMRS converges slowly towards aDMRS I compare aDMRS with aDMRS for the proteins of Table 1. As in "Mutation-Response Scanning", before addressing computational cost, I consider the convergence of the simulation method with increasing M.
In principle, the simulation and analytical methods are equivalent. The compensation matrix D calculated with sDMRS with M ! 1 will be identical to the aDMRS matrix. However, in practice the sDMRS matrix depends on M.  (Fig. 4A). More quantitatively, a scatter plot of sDMRS vs. aDMRS matrix elements shows good correlation, but there is a visible scattering of points around the linear fit (Fig. 4C). The similarity between sDMRS and aDMRS matrices can be measured by the correlation coefficient, which in this case is R = 0.95. Figure 4C shows that as M increases, the sDMRS matrix converges rapidly at first towards the aDMRS matrix, but convergence slows down with further increases of M. Thus, for Phospholipase A2, sDMRS with O(10 2 ) mutations per site produces a compensation matrix that is in good agreement with, but not identical to, the aDMRS matrix. A similar situation is found for the other proteins of the dataset. Convergence quickly slows down as M increases (see grey lines of Fig. 4C ). For M = 200, the correlation between sDMRS and aDMRS matrices falls within the range 0.87 ≤ R ≤ 0.97 (Table 3). Thus, the sDMRS compensation matrix converges slowly towards the aDMRS matrix, so that for M = O(10 2 ) the simulated matrix is in moderate to good agreement with the analytical   matrix. The degree of convergence is not clearly related to protein properties such as structural class or protein size, thus convergence should be tested whenever the simulation method is used. I further assess convergence by considering site-dependent compensation profiles. Averaging D over rows, I obtain a D j profile that measures the average compensation power of sites j. Averaging over columns, I obtain a D i profile that measures how likely to be compensated mutations at i are. Figure 5 compares sDMRS and aDMRS profiles for Phospholipase A2 (d1jiaa). The M = 200 sDMRS profiles are visually similar to aDMRS profiles ( Fig. 5A and Fig. 5D). The similarity is not very high, however: points are quite scattered around the linear fit in sDMRS vs. aDMRS plots (Fig. 5B and Fig. 5E). The convergence of sDMRS profiles towards their aDMRS counterparts is very slow (Fig. 5C and Fig. 5F).
Similar results are found for the other proteins studied. Profiles generally improve very slowly with increasing M (see grey lines of Fig. 5C and Fig. 5F). For M = 200, the (F) convergence of the sDMRS D i profile towards the aDMRS profile. In C and F, the d1jiaa case is shown with black lines and points, and the other 9 proteins studied are shown with grey lines. Profiles were calculated with normalised matrices (matrix average is 1), they are in logarithmic scale, and R is the Pearson correlation coefficient between the log-transformed sDMRS and aDMRS profiles. Full-size  DOI: 10.7717/peerj.11330/fig-5 correlation coefficient between sDMRS and aDMRS D i profiles falls in the range 0.55 ≤ R ≤ 0.97 and between D j profiles it falls in the range 0.69 ≤ R ≤ 0.99 (Table 3). In summary, sDMRS profiles converge very slowly with increasing M, so that for M = O(10 2 ), they are often poorly converged. In addition, There are no obvious determinants of convergence: R is not clearly determined by either protein size or structural class. Therefore, whenever the simulation method is used, convergence should be tested.

aDMRS is much faster than sDMRS
To see whether aDMRS is faster than aDMRS, Fig. 6 compares their computational cost. sDMRS with M = 200 mutations per site is much slower than aDMRS (Fig. 6A). The computational cost, as measured by CPU time, scales with protein length as N 3 for both sDMRS and aDMRS. As a result, t sDMRS increases linearly with t aDMRS with a slope that is the speedup of aDMRS vs. sDMRS. For the M = 200 case, t sDMRS /t aDMRS ≈ 137 (Fig. 6B). The speedup increases non-linearly with M (Fig. 6C). This dependence can be understood from the sDMRS procedure schematised in "Simulation-based Double Mutation-Response Scanning". The cost of generating the mutations (steps 3 and 4) increases linearly with M, while performing the average and maximization needed to calculate the compensation matrix (steps 5) scales as M 2 . Therefore, for large M the analytical method provides a speedup of O(M 2 ), making aDMRS much faster than sDMRS.

DISCUSSION
I have derived, implemented, and assessed two mutation-response scanning methods, aMRS and aDMRS, which are analytical alternatives to the simulation methods sMRS and sDMRS, respectively. All methods were implemented using R with optimized BLAS and LAPACK libraries. None of the methods posed major implementation difficulties. The methods were assessed on a dataset of 10 proteins of varying lengths. First, I consider the convergence of simulation methods. In the limit if infinite mutations per site (M), simulation and analytical methods should give the same results. In practice, the degree of convergence of the simulation methods depends on M. sMRS converges rapidly towards aMRS, so that with a typical M = O(10 2 ) the sDMRS sensitivity matrix and its marginal profiles are almost identical to those calculated with aMRS (Fig. 1C, Fig. 2C, Fig. 2F, Table 2). On the other hand, sDMRS converges slowly, so that even with M = O(10 2 ) sDMRS convergence is not guaranteed (Fig. 4C, Fig. 5C, Fig. 5F, Table 3). sDMRS converges more slowly than sMRS because it is more difficult to find extreme values (calculation of the compensation matrix involves maximization over pairs of mutations) than averages (sensitivity matrix elements are averages over mutations). In general, when using simulation-based methods convergence should always be assessed. In contrast, since the analytical methods do not depend on M, there is no need to study convergence, and possible convergence issues are altogether avoided.
Beyond convergence, since the purpose of this work was to develop faster methods, the key finding is that the analytical methods are much faster than the simulation methods. For a typical case of M = 200 mutations per site, aMRS is 126 × faster than sMRS and aDMRS is 137 × faster than sDMRS. While the computational cost of sMRS is relatively modest and increases rather slowly in proportion to N 1.5 M, sDMRS is much more computationally expensive and its cost rises steeply in proportion to N 3 M 2 . The speedup of analytical methods is of O(M) for single-mutation scans and O(M 2 ) for double-mutation scans. This speedup may be most important for large proteins. For instance, for the 405-siteslong Cytochrome P450, an sMRS calculation takes 3 CPU min vs. 1.5 s of the alternative aMRS calculation (Table 2). On the other hand, an sDMRS calculation takes 3.6 h vs. 1 min of the alternative aDMRS calculation (Table 3). Therefore, there is a large speedup for both single and double mutation-response scans, that may be most useful for the later case.
To further compare the mutation-response scanning methods considered here, I discuss some of their main limitations. All methods are based on the Linear Response Approximation formula Δr 0 = Cf. Therefore, the main limitations are the validity of LRA, the quality of C, and how well mutations can be modelled by the force f. Regarding the first limitation, LRA will be valid if both perturbations (f) and their responses (Δr 0 ) are small. Thus LRA should be valid for most mutations, failing only in the rare cases in which specific mutations induce very large conformational changes. Second, calculating C with a simple elastic network model, as done here, might impose additional limitations. However, this could be alleviated by calculating C using more sophisticated means, such as MD simulations, if necessary. More fundamentally, the main limitation is the very assumption that C characterizes the conformational ensemble, which will be the case for proteins with a single native structure, but may fail for proteins that have two or more stable conformations. The final limitation depends on whether mutations can be adequately modelled using forces (f). While it is possible that this fails for the prediction of specific mutations, mutations-as-forces models have been proved successful in many previous studies that depend on summary statistics such averages or maxima (Echave, 2008;Echave & Fernández, 2010;Tiberti et al., 2018;Marcos & Echave, 2020). For the present work, it should be noted that the limitations mentioned are common to the simulation methods and their analytical alternatives. The analytical approach adds no limitation to the list.
Given that limitations exist, it is worthwhile to discuss why this work has not validated the methods by comparison with empirical data. The main reason is that the aim of this work is not to develop mutation-response methods in better agreement with experiment, but to develop faster methods. This is why the assessment was performed by comparing between simulation and analytical approaches, rather than validating such approaches against empirical data. Validating mutation-response scanning itself is beyond the scope of this work. A second reason is that taking the validity of mutation-response scanning as a given is reasonable. For 8 of the proteins of Table 1, the mutation-response model of the present paper has been recently validated by comparison with empirical structural sensitivity profiles Marcos & Echave (2020). More generally, the validity of perturbation-response methods follows from their extensive successful use in a variety of applications for at least 15 years, as mentioned in "Introduction".
The main conclusion of this work is that the analytical methods should be chosen over the simulation methods because they are faster and, in addition, they have no convergence issues. Therefore, the analytical methods should be useful for a wide range of potential applications, such as predicting evolutionary divergence of protein structures (Echave & Fernández, 2010;Marcos & Echave, 2020), detecting and interpreting pathological mutations (Nevin Gerek, Kumar & Banu Ozkan, 2013;Raimondi et al., 2018;Verkhivker, 2019), and detecting compensating mutations and rescue sites (Tiberti et al., 2018). The speedup afforded by the analytical methods would be especially helpful for treating otherwise intractable large proteins, protein complexes, and large protein databases.
To finish, I mention two possible lines of further development. A first line is to derive analytical expressions for the deformations caused by external forces applied to single sites, as in Perturbation-Response Scanning (PRS) (Atilgan & Atilgan, 2009;General et al., 2014) and Double Force Scanning (DFS) (Tiberti et al., 2018). This will be useful for applications related to ligand-binding induced deformations (Atilgan et al., 2010;General et al., 2014). Beyond deformations, a second line of development is to derive analytical alternatives to simulation-based methods that calculate effects of mutations on protein motions (Hamacher, 2008;Zheng & Tekpinar, 2009;Zheng & Thirumalai, 2009;