Bayesian inference of protein structure from chemical shift data
- Published
- Accepted
- Received
- Academic Editor
- Ugo Bastolla
- Subject Areas
- Biochemistry, Bioinformatics, Computational Biology, Computational Science
- Keywords
- Markov chain Monte Carlo, NMR, Probabilistic models, Protein structure, Chemical shifts
- Copyright
- © 2015 Bratholm et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
- Cite this article
- 2015. Bayesian inference of protein structure from chemical shift data. PeerJ 3:e861 https://doi.org/10.7717/peerj.861
Abstract
Protein chemical shifts are routinely used to augment molecular mechanics force fields in protein structure simulations, with weights of the chemical shift restraints determined empirically. These weights, however, might not be an optimal descriptor of a given protein structure and predictive model, and a bias is introduced which might result in incorrect structures. In the inferential structure determination framework, both the unknown structure and the disagreement between experimental and back-calculated data are formulated as a joint probability distribution, thus utilizing the full information content of the data. Here, we present the formulation of such a probability distribution where the error in chemical shift prediction is described by either a Gaussian or Cauchy distribution. The methodology is demonstrated and compared to a set of empirically weighted potentials through Markov chain Monte Carlo simulations of three small proteins (ENHD, Protein G and the SMN Tudor Domain) using the PROFASI force field and the chemical shift predictor CamShift. Using a clustering-criterion for identifying the best structure, together with the addition of a solvent exposure scoring term, the simulations suggests that sampling both the structure and the uncertainties in chemical shift prediction leads more accurate structures compared to conventional methods using empirical determined weights. The Cauchy distribution, using either sampled uncertainties or predetermined weights, did, however, result in overall better convergence to the native fold, suggesting that both types of distribution might be useful in different aspects of the protein structure prediction.
Introduction
Protein structures can today routinely be simulated by methods such as molecular dynamics or Monte Carlo simulations, using molecular mechanics force fields (Shaw et al., 2010; Karplus & McCammon, 2002; Snow et al., 2002). However, this is not always a feasible method to determine a protein structure by itself. To elucidate the native protein structure efficiently, the force field energy can be augmented by restraints obtained from experiments. This immediately raises the question, how can this be done rigorously and efficiently? One pragmatic approach to this problem is to define a hybrid energy using a penalty function, which describes the agreement between experimental data and data calculated from a proposed protein structure, together with a physical energy (such as from a molecular mechanics force field) (Jack & Levitt, 1978). An optimal structure in this approach could then be determined for example by minimizing the hybrid energy function (1)${E}_{\mathrm{hybrid}}={w}_{\mathrm{data}}\phantom{\rule{0ex}{0ex}}{E}_{\mathrm{data}}+{E}_{\mathrm{physical}}.$ This approach, however, does not uniquely define neither the nature nor weight of E_{data}, and the resulting protein structure will depend on the choices of these.
Chemical shifts have been combined with physical energies in a multitude of ways, e.g., using weighted RMSD values or various types of harmonic constraints. Vendruscolo and co-workers implemented a ‘square-well soft harmonic potential’, with corresponding gradients, and were able to run a chemical shifts biased MD simulation where they successfully refined slightly denatured protein structures to a C_{α}-RMSD of down to 0.84 Å from the corresponding crystal structures (Robustelli et al., 2010). The groups of Bax and Baker added the chi-square agreement between SPARTA (Shen & Bax, 2007) predicted chemical shift values and experimental chemical shifts with an empirical weight of 0.25 to the ROSETTA all-atom energy (Shen et al., 2008; Rohl et al., 2004). The ProCS method (Christensen et al., 2013) uses an approach similar to that of Bax and Baker, but with empirical weights inferred from a number of quantum mechanical calculations on representative protein models. The CHESHIRE approach (Cavalli et al., 2007) utilizes the experimental chemical shifts to predict secondary structure and backbone dihedral angles. These in turn are used to score molecular fragments from a database of known structures together with the chi-square agreement between the measured chemical shifts and the chemical shifts of the fragment in the database. Additionally, the final refinement phase includes a combination of physical energy terms and a term describing the correlation between experimental and back-calculated chemical shifts. A different approach was used by Meiler & Baker (2003), where the contribution of the experimental chemical shifts were set relative to 1 or 0 depending on whether or not the difference to the PROSHIFT prediction (Meiler, 2003) exceeded a maximum tolerance. The reasoning for not using a quadratic potential was that the experimental NMR data was automatically assigned and a quadratic potential is more sensitive to assignment errors.
Clearly information from chemical shifts can be incorporated in a multitude of ways with parameters, shape and weights often tweaked by hand or estimated empirically. The inferential structure determination (ISD) principles introduced by Rieping, Habeck & Nilges (2005) defines a Bayesian formulation of Eq. (1), which has previously been used to determine protein structures based on NOE (Habeck, Rieping & Nilges, 2006; Olsson et al., 2011) and RDC restraints (Habeck, Nilges & Rieping, 2008). In the following section the equations of an ISD approach for combining the knowledge of experimental chemical shifts with a physical energy are presented.
Theory
In the ISD approach we seek the probability distribution of the structure X and a set of uncertainties θ, correlating experimental and predicted chemical shifts, given a set of experimentally measured chemical shifts d, i.e., the probability $p\left(\mathbf{X},\mathbf{\theta}\mid \mathbf{d}\right)$. Using Bayes’ theorem, this probability can be written as (2)$p\left(\mathbf{X},\mathbf{\theta}\mid \mathbf{d}\right)=\frac{p\left(\mathbf{d}\mid \mathbf{X},\mathbf{\theta}\right)p\left(\mathbf{X},\mathbf{\theta}\right)}{p\left(\mathbf{d}\right)}.$ $p\left(\mathbf{d}\right)$ merely serves as a normalization constant, which we need not evaluate.
We’re making the basic assumption, that the deviation between predicted and experimental chemical shifts, given as (3)$\Delta {\delta}_{i}={\delta}_{\mathbf{X},i}-{\delta}_{\mathrm{exp},i}$ approximately follows some distribution with a variance uniquely defined by the type of nuclei (C_{α}, C_{β} etc.). The relevant equations for a Gaussian distribution and a Cauchy distribution (a Student’s t-distribution with one degree of freedom), respectively, are presented in the next sections.
Gaussian distribution
According to the principle of maximum entropy (Jaynes, 1957), the least informative probability distribution is the one having maximal information entropy, which given a specified mean and variance is the Gaussian distribution (Cover & Thomas, 2012). Assuming that each measured experimental chemical shift δ_{exp,i} is conditional independent given the structure, the likelihood $p\left(\mathbf{d}|\mathbf{X},\mathbf{\theta}\right)$ is obtained as the product of the individual probabilities of all measured chemical shifts. With i iterating over all n_{j} measured chemical shifts of nuclei type j, this takes the form of: (4)$p\left(\mathbf{d}\mid \mathbf{X},\mathbf{\theta}\right)=\prod _{j}\prod _{i=1}^{{n}_{j}}p\left({\delta}_{\mathrm{exp},ij}\mid {\delta}_{\mathbf{X},ij},{\sigma}_{j}\right)=\prod _{j}\prod _{i=1}^{{n}_{j}}\frac{1}{{\sigma}_{j}\sqrt{2\pi}}exp\left(-\frac{\Delta \underset{ij}{\overset{2}{\delta}}}{2{\sigma}_{j}^{2}}\right)=\prod _{j}{\left(\frac{1}{{\sigma}_{j}\sqrt{2\pi}}\right)}^{{n}_{j}}exp\left(-\frac{{\chi}_{j}^{2}}{2{\sigma}_{j}^{2}}\right),$ where σ_{j} is the standard deviation in predicting chemical shifts of nuclei type j and ${\chi}_{j}^{2}=\sum _{i}^{{n}_{j}}\Delta \underset{ij}{\overset{2}{\delta}}$. The structure, X, and the uncertainties in the model, θ, are assumed independent and $p\left(\mathbf{X},\mathbf{\theta}\right)$ can be expanded into (5)$p\left(\mathbf{X},\mathbf{\theta}\right)=p\left(\mathbf{X}\right)p\left(\mathbf{\theta}\right)=p\left(\mathbf{X}\right)\prod _{j}p\left({\sigma}_{j}\right).$ The prior probability for the protein structure can be expressed by the Boltzmann distribution, that is: (6)$p\left(\mathbf{X}\right)=\frac{1}{Z\left(T\right)}exp\left(-\frac{E\left(\mathbf{X}\right)}{{k}_{\mathrm{B}}T}\right),$ where the physical energy E(X) could for example be approximated using a molecular mechanics force field. Note that in this case, the partition function Z(T) is a normalization constant and evaluation of this is not necessary. We have little prior knowledge about σ_{j} other than that it is a scale parameter. An uninformative choice of prior distribution is the Jeffreys prior (Jeffreys, 1946), which in this case is simply: (7)$p\left({\sigma}_{j}\right)\propto {\sigma}_{j}^{-1}.$ Combining these expressions, $p\left(\mathbf{X},\mathbf{\theta}\mid \mathbf{d}\right)$ is thus proportional to (8)$p\left(\mathbf{X},\mathbf{\theta}\mid \mathbf{d}\right)\propto p\left(\mathbf{d}\mid \mathbf{X},\mathbf{\theta}\right)p\left(\mathbf{X}\right)p\left(\mathbf{\theta}\right)\propto \prod _{j}\left[{\sigma}_{j}^{-{n}_{j}-1}exp\left(-\frac{{\chi}_{j}^{2}}{2{\sigma}_{j}^{2}}\right)\right]exp\left(-\frac{E\left(\mathbf{X}\right)}{{k}_{\mathrm{B}}T}\right).$ The resemblance to a hybrid energy such as in Eq. (1) is obtained by (neglecting all constant terms): (9)${E}_{\mathrm{hybrid}}\left(\mathbf{X},\mathbf{\theta}\right)=-{k}_{\mathrm{B}}Tln\left(p\left(\mathbf{X},\mathbf{\theta}\mid \mathbf{d}\right)\right)={k}_{\mathrm{B}}T\sum _{j}\left(\left({n}_{j}+1\right)ln\left({\sigma}_{j}\right)+\frac{{\chi}_{j}^{2}}{2{\sigma}_{j}^{2}}\right)+E\left(\mathbf{X}\right).$ From this it is seen that the standard deviations are effectively describing the weight of the experimental data. The energy dependence of σ_{j} is depicted in Fig. 1.
Conjugate prior
As discussed below, sampling uncertainties for the Gaussian model using the Jeffrey’s prior leads to numerical problems. The problems arises if ${\chi}_{j}^{2}$ converges to zero, which leads to σ_{j} → 0. This can be seen from the maximum a posteriori estimator (MAP) of ${\sigma}_{j}^{2}$ according to Eq. (9): (10)${\sigma}_{j,\mathrm{MAP}}^{2}=\frac{{\chi}_{j}^{2}}{{n}_{j}+1}.$ We found that these problems could be avoided by using a weakly informative prior. The conjugate prior for the variance of the Gaussian distribution (${\sigma}_{j}^{2}$), when the mean is known, can be given by an Inverse-Gamma distribution: (11)$p\left({\sigma}_{j}^{2}\mid \alpha ,\beta \right)=\frac{{\beta}^{\alpha}}{\Gamma \left(\alpha \right)}{\left({\sigma}_{j}^{2}\right)}^{-\alpha -1}exp\left(-\frac{\beta}{{\sigma}_{j}^{2}}\right).$ $p\left(\mathbf{X},\mathbf{\theta}\mid \mathbf{d}\right)$ is thus proportional to (12)$p\left(\mathbf{X},\mathbf{\theta}\mid \mathbf{d}\right)\propto p\left(\mathbf{d}\mid \mathbf{X},\mathbf{\theta}\right)p\left(\mathbf{X}\right)p\left(\mathbf{\theta}\right)\propto \prod _{j}\left[{\sigma}_{j}^{-{n}_{j}-2\alpha -2}exp\left(-\frac{2\beta +{\chi}_{j}^{2}}{2{\sigma}_{j}^{2}}\right)\right]exp\left(-\frac{E\left(\mathbf{X}\right)}{{k}_{\mathrm{B}}T}\right).$ In contrast to Eq. (10), the maximum a posteriori estimator of ${\sigma}_{j}^{2}$ does not equal zero in the limit of ${\chi}_{j}^{2}\to 0$ with a non-zero choice of β: (13)${\sigma}_{j,\mathrm{MAP}}^{2}=\frac{2\beta +{\chi}_{j}^{2}\left(\mathbf{X}\right)}{2\alpha +2+{n}_{j}}.$ In all the simulations where σ_{j} was sampled we use Eq. (12) and α = β = 0.001 (Gelman, 2006) unless stated otherwise.
Marginal likelihood
Alternatively one can use the marginal likelihood where σ_{j} is integrated out: (14)$p\left(\mathbf{d}\mid \mathbf{X}\right)=\prod _{j}{\int}_{0}^{\infty}p\left(\mathbf{d}\mid \mathbf{X},{\mathbf{\sigma}}_{j}\right)p\left({\mathbf{\sigma}}_{j}\right)\mathrm{d}{\mathbf{\sigma}}_{j}\propto \prod _{j}{\left({\chi}_{j}^{2}\right)}^{\frac{-{n}_{j}}{2}}.$ This results in a hybrid energy of the form: (15)${E}_{\mathrm{hybrid}}\left(\mathbf{X}\right)=-{k}_{\mathrm{B}}Tln\left(p\left(\mathbf{X}\mid \mathbf{d}\right)\right)={k}_{\mathrm{B}}T\sum _{j}\left(\frac{{n}_{j}}{2}ln\left({\chi}_{j}^{2}\right)\right)+E\left(\mathbf{X}\right).$
Cauchy distribution
The Cauchy and Gaussian distribution are both special cases of the Student’s t-distribution, with degrees of freedom ν = 1 and ν = ∞ respectively. Compared to the Gaussian distribution, the Cauchy distribution has much heavier tails meaning that it will be less penalizing of single predictions far from the experimental values.
$p\left(\mathbf{d}|\mathbf{X},\mathbf{\theta}\right)$ is again obtained as the product of the individual probabilities of all measured chemical shifts, with scale parameters γ_{j} (equivalent to σ_{j} of the Gaussian distribution): (16)$p\left(\mathbf{d}\mid \mathbf{X},\mathbf{\theta}\right)=\prod _{j}\prod _{i=1}^{{n}_{j}}p\left({\delta}_{\mathrm{exp},ij}\mid {\delta}_{\mathbf{X},ij},{\gamma}_{j}\right)=\prod _{j}\left\{{\left(\pi {\gamma}_{j}\right)}^{-{n}_{j}}\prod _{i=1}^{{n}_{j}}{\left[1+{\left(\frac{\Delta {\delta}_{ij}}{{\gamma}_{j}}\right)}^{2}\right]}^{-1}\right\}.$ Note that the Cauchy distribution does not reduce into an expression that depends on the ${\chi}_{j}^{2}$ differences (in contrast to the Gaussian). The Jeffreys prior is the same as for the Gaussian distribution: (17)$p\left({\gamma}_{j}\right)\propto {\gamma}_{j}^{-1}.$ $p\left(\mathbf{X},\mathbf{\theta}\mid \mathbf{d}\right)$ is thus proportional to (18)$p\left(\mathbf{X},\mathbf{\theta}\mid \mathbf{d}\right)\propto \prod _{j}\left\{{\gamma}_{j}^{-\left({n}_{j}+1\right)}\prod _{i=1}^{{n}_{j}}{\left[1+{\left(\frac{\Delta {\delta}_{ij}}{{\gamma}_{j}}\right)}^{2}\right]}^{-1}\right\}exp\left(-\frac{E\left(\mathbf{X}\right)}{{k}_{\mathrm{B}}T}\right).$ The resemblance to a hybrid energy such as in Eq. (1) is obtained by (neglecting all constant terms): (19)${E}_{\mathrm{hybrid}}\left(\mathbf{X},\mathbf{\theta}\right)=-{k}_{\mathrm{B}}Tln\left(p\left(\mathbf{X},\mathbf{\theta}\mid \mathbf{d}\right)\right)={k}_{\mathrm{B}}T\sum _{j}\left\{\left(\left({n}_{j}+1\right)ln\left({\gamma}_{j}\right)+\sum _{i=1}^{{n}_{j}}ln\left[1+{\left(\frac{\Delta {\delta}_{ij}}{{\gamma}_{j}}\right)}^{2}\right]\right)\right\}+E\left(\mathbf{X}\right).$
Methodology
Computational methodology
Markov chain Monte Carlo simulations were carried out with PHAISTOS v1.0 (Boomsma et al., 2013) using either the multicanonical generalized ensemble via MUNINN (Ferkinghoff-Borg, 2002) or Metropolis–Hastings (Metropolis et al., 1953). Chemical shift predictions were performed with an implementation of CamShift (Kohlhoff et al., 2009) and the physical energy was approximated using the computational efficient PROFASI force field (Irbäck & Mohanty, 2006). The conformational degrees of freedom explored in the simulations were restricted to the backbone and side-chain dihedral angles (ϕ, ψ, χ) as well as the backbone bond angles. Backbone moves had torsion and bond angles biased by CS-Torus (Boomsma et al., 2014) and Engh-Huber statistics (Engh & Huber, 1991) respectively, which both introduces an implicit energy. Chemical shifts were only utilized by CS-Torus for biased sampling in reference simulations where no CamShift energy term was used. The simulations were performed on AMD Opteron 2.1 GHz CPU’s at ∼12 M steps/day or on Intel Xeon 3.07 GHz CPU’s at ∼18 M steps/day.
Convergence simulations
The Protein G convergence simulations were initialized from the experimental structure (PDB-id: 2OED). The simulations were run for 10 M MC steps at 300 K using Metropolis–Hastings. The physical move set was comprised of 50% local, uniform single side chain moves, 25% CRISP local moves (Bottaro et al., 2012) and 25% semilocal biased Gaussian step (BGS) backbone moves (Favrin, Irbäck & Sjunnesson, 2001).
Structure determination simulations
The structure determination simulations were each run on 32 threads for 100 M iterations. The temperature range explored with MUNINN were set to 273 K–500 K. The physical move set was comprised of 50% local, uniform single side chain moves, 40% CRISP backbone moves and 10% backbone-DBN pivot moves (Boomsma et al., 2008). In the simulations where the uncertainties were dynamically adjusted, an extra 10 M Monte Carlo steps were added which sampled a change in σ_{j} or γ_{j} as described below. Note that these moves are essentially computationally costless, since neither chemical shifts or force field energy terms need be recomputed.
Clustering of sampled structures
To make clustering feasible for the large amount of structures generated (320,000 structures for each combination of potential and protein), the sampled structures were converted to GIT vectors (Røgen & Fain, 2003) with PHAISTOS. The structures from each individual thread were subsequently divided into sets of 15 clusters with the Pleiades module of PHAISTOS (Harder et al., 2012) using K-means clustering (Lloyd, 1982). The choice of using 15 clusters is based on the suggestion of the Pleiades authors of creating 10–20 clusters. Since the clustering process is stochastic, it was performed 10 times for each thread and the optimal clustering according to the sum of squared errors were used for further analysis. From each of these clusters, a subset consisting of the 100 structures closest to the cluster centroid were selected for energy and RMSD evaluation and the median energy structures were chosen as cluster representatives. The GIT vectors can be created as output observables directly from the simulations, but in this case they were created from the simulation trajectories using the pdb2git application in PHAISTOS with the program GNU Parallel (Tange, 2011) used to parallelize the jobs. Re-weighting from the generalized ensemble to approximate the canonical ensemble were done automatically with Pleiades using the weighted k-means option.
Monte Carlo move in uncertainty parameter space
The ξ-move which re-samples the value of the uncertainties (i.e., σ or γ) was constructed by multiplying the previous value of ξ by a sampled constant centered around 1. Detailed balance is maintained by proposing a small change, ξ → ξ′, by: (20)${\xi}^{\prime}=\xi \cdot exp\left(\mathrm{rnom}\left({\sigma}_{\mu}\right)\right),$ where rnom(σ_{μ}) is a random number from a normal distribution with zero mean and standard deviation σ_{μ}. A value of σ_{μ} = 0.1 was found to yield a rapid and stable convergence for both the Gaussian and the Cauchy distribution.
Issues from unexplored degrees of freedom
It was observed that CamShift predictions of C_{β} chemical shifts for Isoleucine were consistently off by 3–8 ppm for the structures generated in the simulations performed with PHAISTOS. This was observed using both the CamShift implementation in PHAISTOS as well as with the standalone predictor. CamShift was trained on high quality X-ray structures where missing Hydrogens were added in accordance with the CHARMM22 topology file (Brooks et al., 2009). Letting the CamShift program optimize Hydrogen placement before prediction brought the accuracy of predicted Isoleucine C_{β} chemical shifts in range with the prediction for the remaining amino-acids. For reference, the RMSD for C_{β} chemical shift prediction of all amino-acids of a Chymotrypsin Inhibitor-II protein (CI2) structure were found to be 1.90 ppm including predictions for Isoleucine and 1.25 ppm if these predictions were excluded. As bond lengths and side-chain bond angles are not degrees of freedom in the simulations performed with PHAISTOS, the distances of β-Hydrogens and γ-Hydrogens relative to the C_{β} atoms are constant. Even though this affects the Camshift prediction, it is reasonable to assume that this can be compensated to some degree by small structural perturbations. However, this distance dependence of the C_{β} chemical shift prediction for Isoleucine is much larger than for the remaining amino acids (Kohlhoff et al., 2009) and as a result we chose to disable prediction for Isoleucine C_{β} chemical shifts in the simulations.
Results and Discussion
Problems with Gaussian weighting scheme when using a Jeffreys prior
Attempts to use predicted chemical shifts from CamShift while sampling σ using a Gaussian model (Eq. (9)) initially proved unsuccessful. Using any structure (compact or unfolded) as starting point for the Monte Carlo simulation, it was often observed that the χ^{2} agreement between predicted and experimental chemical shifts would converge to zero after only a few million iterations. Naturally this leads to σ → 0, which in turn essentially freezes the structure in the simulation, since any MC move that causes the slightest increase in chi-square will result in an enormous change in energy. If several types of chemical shifts were included in the simulation (possible chemical shift types from CamShift are H_{α}, C_{α}, H, N, C and C_{β}), the χ^{2} for one (random) of the included types would quickly converge to zero. One suspected reason was that the prior distribution was not well described by the more coarse grained PROFASI force field. CamShift calculations were therefore redone using the OPLS-AA/L force field (Kaminski & Friesner, 2001). This, however, led to identical results.
CamShift (and most likely other similar predictors) is able to make relatively large changes in prediction, from a small perturbation in the structure. Combined with sampling of σ, this can drive the simulation into an energy minimum with essentially zero error in the chemical shift prediction, even though the structure may or may not be anything like the native structure. We found the Cauchy distribution to be less sensitive to divergence of the scale parameter and to perform better as an uninformative model in our case. As an alternative to the Jeffreys prior, a weakly informative conjugate prior for the Gaussian model did not show these sampling issues.
Convergence of scale parameters
The convergence of the scale parameters for the Gaussian and Cauchy distributions (σ and γ respectively), with chemical shifts predictions by CamShift (Kohlhoff et al., 2009), were explored by starting a simulation with PHAISTOS (Boomsma et al., 2013) from the native structure of Protein G (PDB:2OED (Ulmer et al., 2003)). Experimental chemical shifts were obtained from Ref-DB (Zhang, Neal & Wishart, 2003) (RefDB:2575 (Orban, Alexander & Bryan, 1992)). For each model, a 10^{7} MC step simulation was performed: keeping the structure fixed, only sampling uncertainties (frozen), and a simulation where the atomic coordinates (X) was sampled as well (free). Tables 1 and 2 shows the mean of the sampled parameters from the last 10^{6} steps together with the maximum likelihood values obtained from the CamShift training set for reference.
C_{α} | H_{α} | N | H | C | C_{β} | |
---|---|---|---|---|---|---|
CamShift training set | 1.22 | 0.26 | 2.78 | 0.56 | 1.12 | 1.19 |
Frozen simulation^{a} | 1.13 | 0.26 | 3.53 | 0.52 | 1.06 | 1.21 |
Free simulation^{a} | 1.03 | 0.20 | 2.92 | 0.46 | 1.16 | 1.23 |
Notes:
C_{α} | H_{α} | N | H | C | C_{β} | |
---|---|---|---|---|---|---|
CamShift training set | 0.70 | 0.19 | 1.87 | 0.31 | 0.74 | 0.77 |
Frozen simulation^{a} | 0.62 | 0.17 | 1.90 | 0.32 | 0.64 | 0.69 |
Free simulation^{a} | 0.43 | 0.05 | 1.57 | 0.25 | 0.67 | 0.55 |
Notes:
Using a Gaussian distribution, the parameters in the ‘frozen’ simulation all converged within 0.1 ppm to the reported values from the CamShift training set, with the exception of the N nuclei which deviated by 0.75 ppm. The RMSDs presented in Table 1 for the CamShift training set were based on predictions on 7 proteins and, using a larger data set of 28 proteins, the average RMSD for the N nucleus increased from 2.78 ppm to 3.01 ppm (Kohlhoff et al., 2009). Thus, the slightly higher mean for N seems reasonable. Allowing the structure and weight parameters to be sampled simultaneously in the ‘free’ simulation overall lowered the RMSD of the prediction as expected, since the accepted structures in the Monte Carlo simulation will be biased by the correlation of predicted and experimental chemical shifts. However, the RMSD increased moderately for the C nucleus and slightly for C_{β}, indicating that the chemical shift prediction of C and C_{β} are less sensitive to changes in local structure than the four other nuclei.
In the simulations using a Cauchy distribution, the ‘frozen’ values were seen to be similar to the CamShift data set (within 0.1 ppm). When physical moves were introduced in the ‘free’ simulation, the sampled parameters were again found to be lowered, but remained within 0.3 ppm. Surprisingly, γ for H_{α} went from 0.17 ppm to 0.05 ppm, with similar values found when repeating the simulation. The χ^{2} error in the prediction of H_{α} chemical shifts were similar to that obtained with the Gaussian potential, indicating that the error in prediction for H_{α} atoms had several outliers. Since the Cauchy distribution is less sensitive to outlier values, these will have a lesser effect on the sampled parameters than for the Gaussian.
Comparison of weighting schemes in structure determination
A series of simulations starting from an unfolded state were performed on ENHD (PDB: 1ENH (Clarke et al., 1994), BMRB:15536 (Religa, 2008)), Protein G and the SMN Tudor Domain (PDB: 1MHN (Sprangers et al., 2003), RefDB:4899 (Selenko et al., 2001)) to compare how different weighting schemes performed for structure determination. The probabilistic schemes used included three Gaussian models: one using the maximum likelihood estimates of σ from the CamShift training set (Gaussian/fixed); one where the values of σ were sampled (Gaussian/sampled); and one using the marginalized distribution (Gaussian/marginalized). Similarly, two Cauchy models were tested: one using maximum likelihood values for γ from the CamShift training set (Cauchy/fixed), and one where the values for γ were sampled (Cauchy/sampled). As reference, the square well potential of Robustelli et al. (2010), which was made specifically for refinement with the CamShift model, were included in the simulations with different weights (Square well/α = 1, Square well/α = 5).
In all simulations, the generative predictive model CS-Torus (Boomsma et al., 2014) was used to sample backbone dihedral angles from a distribution biased by the amino-acid sequence. Chemical shifts can provide local information to the CS-Torus model to further improve the biased sampling, but this was not utilized in any simulations using CamShift predictions. Although including chemical shifts in the sampling would most likely improve the simulation results, we chose to keep the CamShift energy terms as the only bias from the experimental chemical shifts. To display the effect of using a non-local chemical shift predictor like CamShift instead of relying on local information alone in the sampling, simulations using chemical shifts in the CS-Torus model, rather than with CamShift prediction, were run as well.
Thirty-two folding simulations were run for each potential and protein for 100 M MC steps using the PROFASI (Irbäck & Mohanty, 2006) force field and a CamShift energy term. For each set of simulations, the sampled structures from each thread were subsequently split into clusters as described in the Methodology section, and cluster representatives were selected as the structures median in energy, from the 100 structures closest to the cluster centroid. Table 3 shows the number of threads sampling structures below 2 and 4 Å C_{α}-RMSDs to the native structures as well as the RMSDs for the cluster representative with the lowest PROFASI+ CamShift energy. The residue ranges used to calculate the RMSDs were 5–54 for ENHD, all residues for Protein G and 4–56 for the SMN Tudor Domain.
Threads (out of 32) sampling below 2Å (left) and 4Å (right) | Lowest-energy RMSD (Å) | ||||||||
---|---|---|---|---|---|---|---|---|---|
ENHD | Protein G | SMN | ENHD | Protein G | SMN | ||||
Gaussian/fixed | 32 | 32 | 0 | 7 | 29 | 30 | 3.67 | 3.11 | 3.11 |
Gaussian/sampled | 32 | 32 | 4 | 15 | 13 | 20 | 2.15 | 3.03 | 5.88 |
Gaussian/marginalized | 32 | 32 | 1 | 16 | 7 | 14 | 4.24 | 2.72 | 6.06 |
Cauchy/fixed | 32 | 32 | 9 | 25 | 15 | 21 | 1.94 | 1.15 | 2.58 |
Cauchy/sampled | 32 | 32 | 13 | 24 | 11 | 16 | 1.87 | 2.82 | 5.51 |
Square well/α = 1^{a} | 19 | 22 | 2 | 12 | 14 | 18 | 2.29 | 3.14 | 3.71 |
Square well/α = 5^{a} | 32 | 32 | 0 | 1 | 1 | 5 | 3.82 | 5.83 | 1.91 |
CS-Torus ^{b} | 4 | 27 | 8 | 25 | 0 | 0 | 19.2 | 3.01 | 8.33 |
Convergence of sampling
The data in Table 3 shows that for certain potentials and proteins, several threads failed to sample near-native structures. For ENHD all potentials but the CS-Torus model and square well/α = 1 potential sampled structures below 2 Å C_{α}-RMSD for all threads. While more than 20 threads sampled structures below 4 Å for both the CS-Torus and square well model, only 4 threads sampled structures below 2 Å for CS-Torus. For Protein G no threads for the Gaussian/fixed and square well/α = 5 potentials sampled structures below 2 Å. The square well/α = 1, Gaussian/marginalized and Gaussian/sampled potentials only sampled these near-native states with a few threads, while the Cauchy potentials and the CS-Torus model showed the fewest sampling issues.
Looking closer at the threads never sampling structures close to native for Protein G, it is found that the majority of these never progressed past a local energy-minimum with an alternative conformation where two β-strands have interchanged position (Fig. 2). Taking the median structure of the most dense cluster as representative for each thread, 27 of these show this incorrect fold for the Gaussian/fixed potential and 26 for the square well/α = 1 potential. The Cauchy distributions shows the opposite trend with 25 correct folds for both potentials, while the structures from the Gaussian/sampled and Gaussian/marginalized simulations had 14 and 11 correctly folded, respectively. For all of these potentials, the densest clusters of each thread have either this misfold or the correct structure. While the square well/α = 5 potential seem to find completely incorrect structures, the CS-Torus simulations finds the correct overall fold in 20 threads. The remaining CS-Torus threads are partly unfolded and none of them have the misfolded structure found in the simulations with CamShift energy terms. Finally, for the SMN Tudor Domain, the Gaussian/fixed model sampled structures below 2 Å for nearly all threads. The CS-Torus model and square well/α = 5 potential for 0 and 1 thread(s) respectively, while the remaining potentials sampled below 2 Å for around a third of the threads.
Ideally, the simulations with a given potential samples structures close to native consistently well for all proteins, which was not the case for the Gaussian/fixed model, square well/α = 5 potential, the CS-Torus reference model and to a lesser extent the Gaussian/sampled model. The two Cauchy potentials was most likely to sample low-RMSD structures across the three proteins. Due to limitations of the MUNINN implementation in PHAISTOS at the time the simulations were run, the multicanonical generalized ensembles from each thread cannot be re-weighted to approximate a single canonical ensemble, and clustering of structures must be done on a per-thread basis. Since cluster densities can’t readily be compared across threads, the structure clusters are evaluated from the force field and CamShift energy.
Lowest-energy clusters
Table 3 shows for each potential and protein the C_{α}-RMSDs to native for the lowest-energy structures found by clustering. There is no clear consensus of which potentials result in the most accurate structures overall based on the RMSD values. Visually (Figs. S1–S6) all but CS-Torus has the correct fold for ENHD, with the Gaussian/fixed, Gaussian/marginalized and square well/α = 5 structures being less compact than the crystal structure. For protein G, only the square well/α = 5 potential shows a slight misfold, and the overall somewhat high RMSDs is again due to slightly less compact structures, as well as a small displacement of beta-sheet positions for all but the CS-Torus and Cauchy/fixed models. Although the misfold shown in Fig. 2 was prevalent in the simulations in many threads, none of the lowest-energy structures have these interchanged β-strand positions. For the SMN Tudor Domain, the difference in RMSDs between the potentials is mainly due to the protein tails not being correctly placed in a compact structure.
As mentioned above, the obtained structures from the lowest-energy clusters are in general less compact than the crystal structures. This is a result of additional compactness terms being excluded in the simulations such that the effect of using different potentials for modelling the discrepancy between observed and predicted chemical shifts might be more clear. In nearly all of the simulations higher energy clusters exists that have lower RMSDs to the native structure, suggesting that near-native structures are sampled, but the compactness of the protein isn’t properly described by the force field. Evaluating sampled structures with energy terms not included in the Monte Carlo simulations is problematic, since the energy can fluctuate greatly with small changes in local structure. However when entire clusters of structures are evaluated this becomes less of a problem, especially when coarse grained energy terms is used in addition to the energies obtained from the simulations. The half-sphere exposure mixture model (HSEMM), implemented in PHAISTOS for modelling solvent exposure, is a variation of the multibody multinomial model (MuMu) (Johansson & Hamelryck, 2013) with the environment of residue i described by four features: the secondary structure according to CS-Torus, the backbone hydrogen bond network and the half sphere exposure up and down measure (Hamelryck, 2005). For every cluster, the energy from HSEMM was calculated and added to the total energy of the structures, with the hydrogen bond network feature integrated out to enforce the coarse grained characteristics of the model.
Lowest-re-scored-energy RMSD | |||
---|---|---|---|
ENHD | Protein G | SMN | |
Gaussian/fixed | 1.40 | 2.45 | 2.23 |
Gaussian/sampled | 1.03 | 1.29 | 1.24 |
Gaussian/marginalized | 1.11 | 1.00 | 3.81 |
Cauchy/fixed | 1.40 | 1.16 | 1.55 |
Cauchy/sampled | 1.86 | 0.86 | 2.50 |
Square well potential/α = 1^{a} | 1.15 | 1.37 | 3.05 |
Square well potential/α = 5^{a} | 0.96 | 4.35 | 1.91 |
CS-Torus ^{b} | 3.88 | 1.57 | 9.18 |
The results are summarized in Table 4 and show that the lowest-energy clusters re-scored with the solvent exposure term all have lower or similar RMSDs to the clusters evaluated with just the PROFASI+ CamShift energies. Sampling of the uncertainty when using the Gaussian distribution results in the structures closest to native, with RMSDs below 1.5 Å for all three proteins. For the Cauchy distribution, sampling the uncertainties does not seem to be an improvement over using predetermined weights, but both approaches gives better structures overall than the remaining potentials. Furthermore, it is clear that the non-local information provided by the CamShift model greatly improves structure sampling, as shown by the relatively poor performance of the simulations using only CS-Torus.
Conclusion
We present a probabilistic method for biasing protein structure simulations with experimentally measured chemical shifts, based on the inferential structure determination formalism (ISD) (Rieping, Habeck & Nilges, 2005). In this formalism, the weighting of experimental data can be determined entirely by the data itself, the predictive model and the physical force field.
Simulations were performed on three small proteins (ENHD, Protein G and SMN Tudor Domain) for a Gaussian and Cauchy-based probability distribution, using the chemical shift predictor CamShift (Kohlhoff et al., 2009). The ISD-determined uncertainties were found to correspond well to the empirically determined uncertainties in the CamShift predictions. Furthermore sampling the uncertainties as part of the protein structure determination simulations, lead to improved accuracy of the predicted structures when a Gaussian potential was used. Using a Cauchy potential with either sampled or fixed uncertainties did, however, show overall better convergence to the native fold, suggesting that the simulations are less likely to get stuck in local minima with these potentials. Additionally, the importance of capturing non-local information from experimental chemical shifts have been shown by comparing the use of the CamShift predictor to the local-only CS-Torus model.