Nucleic acids are central macromolecules for the storing, flow and regulation of genetic and epigenetic information in cellular organisms. RNA can adopt a wide variety of 3D structural conformations and this structural variability explains the multiplicity of roles that RNA performs on cells (Wan et al., 2011; Eddy, 2001). The classification of RNA backbone conformations into rotamers is a very useful way to delimit the conformational space of RNA structures. Rotamers are defined in terms of the backbone torsional angles namely α, β, γ, δ, ε and ζ (as shown in Fig. 1). This classification was proposed by Richardson et al. (2008), and has been achieved after the attempts of different research groups to find a consensus RNA backbone structural classification. There are 55 backbone rotamers, from which 46 are rotamers with well defined torsional angles distributions, and the remaining nine rotamers were proposed as wannabe rotamers. The ‘suite’ is the basic subunit used for rotamer classification. The suite is defined from sugar-to-sugar (or from the δ torsional angle of residue i-1 to the δ torsional angle of residue i), and it is contained within the dinucleotide (DN) subunit (see Fig. 1). 13C′ chemical shifts (CS) have been successfully used by our and other groups for proteins and glycans structural determination, validation and refinement (Shen & Bax, 2010; Martin et al., 2013; Frank et al., 2015; Garay & Vila, 2018). 1H CS have been successfully used by Sripakdeevong et al. (2014) for structure determination and prediction of noncanonical RNA motifs. Methods incorporating 13C CS for RNA structure determination, validation and refinement are also available (Frank, Stelzer & Bae, 2013; Frank, Law & Brooks, 2014; Brown, Summers & Johnson, 2015) but, to our knowledge, none of them include the explicit use of backbone rotamers. In this work, we study how to use 13C′ CS to classify RNA backbone conformations into rotamers with machine learning models. Overall, a complete understanding of the molecular basis of the biological processes in which RNA molecules are involved entails an accurate knowledge of their 3D structure. In this regard, it is well known that the computation of the 13C′ chemical shifts (CS) for RNA, at DFT-level of theory, is very sensitive to the backbone conformation of the molecule. Thus, among other potential application of our work is to build, for any possible combination of RNA backbone torsional-angles conformations, a detailed 13C CS look-up table. Hence, given a 13C CS value the corresponding set of RNA backbone torsional angles can be quickly determined, and vice versa, making the look-up table a very valuable tool with which determine, validate and refine RNA conformations.
In order to provide a clear understanding of the methodology implemented in this work, a flowchart with the overall work-flow is shown in Fig. 2. A theoretical dataset of RNA backbone rotamers with 13C′ CS values is necessary to train the machine learning models to classify RNA experimental suites into rotamers. In the following two sections we explain how we obtained both datasets.
Experimental 13C′ CS data for RNA molecules was retrieved from the BioMagResBank (BMRB; http://www.bmrb.wisc.edu) (Ulrich et al., 2008), along with their corresponding structures from the Protein Data Bank (PDB; https://www.rcsb.org/) (Berman et al., 2000). As it is fundamental to count on reliable experimental 13C′ CS values for an accurate structural analysis, data curation was carried out using 13Check_RNA (Icazatti et al., 2018) a Python module to correct RNA 13C′ CS systematic errors, recently developed in our group. The obtained dataset (see Table S1) contains 26 RNA structures with 13C′ CS for the five ribose carbon nuclei (C1′, C2′, C3′, C4′ and C5′), providing a total of 391 suite subunits and 391 sets of 13C′ CS. As there are at least 8 models in the NMR ensembles for each RNA molecule (up to 20 in some cases), the complete database contains 7,612 conformations. Given that we needed a one-to-one correspondence between the sets of CS and the rotamer suites, only the first structure from each NMR ensemble was used, considering that the first model listed in the PDB files is usually reported as the structure with the lowest energy scoring. This choice has a negligible average effect on the results of our analysis (see Figs. S10 and S11).
For every PDB entry, the 3D coordinates of the first model were extracted in order to compute the backbone torsional angles (δi−1, εi−1, ζi−1, αi, βi, γi, δi) of the suites. Then, these torsional angles were used to assign the RNA suites to their corresponding rotamer names. From the 46 original rotamers, only 38 are represented in the final experimental dataset.
In order to have a complete dataset with the 46 RNA backbone rotamers and their corresponding 13C′ CS, a theoretical dataset was also constructed. A template for each of the 16 possible combinations of DN (A, C, U and G) sequences was obtained from RNA structures found in the PDB. A Monte-Carlo conformational sampling was carried out by rotating the backbone torsional angles of the corresponding suite contained in each DN, while keeping the bond-lengths and bond-angles fixed (rigid geometry approximation). To perform such rotations, the torsional angle distributions for each of the 46 RNA backbone rotamer suites (Richardson et al., 2008) were used. A function which eliminates conformers with atom clashes was implemented as part of the routine. As a result, 10,340,852 conformations were generated. Quantum-theory level computation of CS is very time-consuming. Therefore, to reduce the number of calculations, a smaller number of conformations was selected. Aiming to keep most of the variability of the originally generated conformations, we computed the Shannon entropy S (see Eq. (1)) of the distribution of torsional angles. Here, Pi is the probability of the i conformation taken from a histogram with a bin size of 5 degrees. The entropy was computed for different subsets of conformations and sample sizes (from 5 to 100) (see Fig. 3). We decided to use the 80% of the maximum entropy as a cutoff, which implies a sample size of around 40 conformations per rotamer. As we also considered the 16 combinations of DN sequences, the total number of conformations computed at the DFT level of theory was 30,530. (1)
Details of the quantum-chemical calculations of the 13C′ shieldings
Previous to the DFT calculations of the obtained dataset, a test was performed over a subset of 41 rotamers of sequence AA. A similar approach as described below for mononucleotides was used, except that the templates were methyl-blocked DNs: Me − O3′i−2 − Ai−1 − Ai − O5′i+1 − Me. Subsequent comparison of the obtained 13C′ CS for these DNs with those obtained from the corresponding mononucleotides, gave the same result within 10−2 ppm while the total computation time was approximately half the total time for computing the complete DNs. Thus, the DN conformations from the final dataset were split in their corresponding mononucleotide subunits. Nucleotide subunits were treated as terminally-blocked mononucleotides with methyl groups (Me) in both termini (Me − O3′i−1 − Xi − O5′i+1 − Me). Phosphate groups of the backbone were treated as neutral, because we assume that all backbone charges are shielded during the quantum-chemical calculations. Results based on the analysis of 139 conformations of ubiquitin at pH 6.5 (Vila & Scheraga, 2008), indicate that use of neutral, rather than charged, aminoacids is a significantly better approximation of the observed 13Cα CS in solution for the acidic groups, and a slightly better representation, though significantly less expensive in computational time, for the basic groups. Considering that the phosphate group in RNA is close to the nucleus of interest (as it happens with the acidic groups) we can assume, without losing generality, that neutral rather than charged phosphate group is a better approximation for the computation of the 13C′ CS in the RNA suites. This approach was also adopted because under physiological conditions, the phosphate groups are completely ionized and neutralized by positive charges (Lehninger Nelson & Cox, 2000). A 6–311+G(2d,p) locally dense basis set (Chesnut & Moore, 1989) was used for calculation of backbone 13C′ CS and their nearest neighbor nuclei, at the DFT level of approximation (see Fig. 4 for details). The remaining nuclei were treated with a 3-21G basis set. The OB98 density functional was used because good results were previously observed for proteins and glycans in our group (Vila & Scheraga, 2009; Garay et al., 2014). All DFT computations were done using the Gaussian package (Frisch et al., 2004). Summarizing, the adopted strategies make the computed 13C′ CS from mononucleotides suitable for comparison with the 13C′ CS observed from complete RNA molecules.
|46 rotamers||22 families δi−1δiαγ||10 families δi−1δiα||10 families δi−1δiγ||7 families αγ||4 families δi−1δi||2 families A_noAi||2 families A*_noA*ii|
Families of rotamers
The original 46 RNA backbone rotamers were grouped in families based on their δi−1, δi, α and γ torsional angles values. Only these four (out of seven) backbone torsional angles in the suite subunit were chosen to group the rotamers because their distributions of observed values are bimodal (δi−1 and δi) and trimodal (γ and α), with clearly separated peaks (see Fig. 5). This selection allowed us to group rotamers based on the torsional angle values within the different peaks. As summarized in Table 1, four families were found when both δi−1 and δi torsional angles in the suite were used (see Table 2), seven families for the αγ combination, 10 families for δi−1δiα, and δi−1δiγ, and 22 families for δi−1δiαγ. In order to evaluate the classification performance of the RNA A–form helix conformations, the rotamers were also grouped as: (i) A_noA families, where the 46 rotamers were separated in A–form helix (1a) vs. no A–form helix rotamers, and (ii) A*_noA* families, where the 46 rotamers were separated in rotamers related to A–form helix (1a, 3d, 3b, 5d, 0a, 6b and 4b rotamers) vs. the remaining rotamers.
A series of machine learning methods were used to classify RNA suites as rotamers (or families of rotamers) based on their ribose 13C′ CS values. The following classification methods from the scikit-learn Python library (Pedregosa et al., 2011) were trained: K-Nearest Neighbors (NN), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM) and a class of neural network called Multi-Layer Perceptron (MLP). Different model parameters were tried out (see Table S3). A random sampling algorithm was also used as a control, where suites were classified randomly. The sequence of the suite was considered for classification, because we found that the performance increased compared to a sequence–independent classification (see Fig. S10). The classification performance was assessed with four measures: weighted accuracy, precision, recall and F1 score (Van Rijsbergen, 1979). The weighted accuracy was used in order to recalibrate the contribution of the different rotamers, because the observed frequency of the rotamers is highly uneven (see Fig. S1). The weights used in the weighted accuracy were obtained from a substitution matrix (ROSUM, for ROtamers SUbstitution Matrix). The definition of the ROSUM matrix was inspired by the BLOSUM matrix used for protein sequence alignment (Henikoff & Henikoff, 1992). The matrix is used to weight the match or no match, between the true rotamer and the predicted rotamer, as a function of the euclidean distance between rotamers (in the seven-dimensional space of the suite backbone torsional angles) and the observed frequency of each rotamer. The torsional angles values and the observed frequencies are extracted from the rotamers table (Richardson et al., 2008). A ROSUM matrix was obtained for each of the rotamer families described in the previous section. Further details on the construction of the ROSUM matrices are provided in Data Section S4. The precision and recall were used because they gave a general overview of the performance of the method. In particular, they allowed us to assess the fraction of classified items that were correctly identified and the sensitivity of the method. The F1 score was also used as a performance measure because it is the harmonic mean of precision and recall and as such, it gives more realistic measure of the classifier’s performance.
|δi−1δi families||46 rotamers||δ(i−1)||ε(i−1)||ζ(i−1)||α(i)||β(i)||γ(i)||δ(i)|
Experimental vs. theoretical
The classification models trained with theoretical data were used to classify the experimental suites. The result of the theoretical calculations (described in a previous section) are theoretical NMR isotropic shieldings (σ). The theoretical shieldings (σcomp) must be subtracted from a reference shielding value (σref) to be transformed into theoretical CS (δcomp) (see Eq. (2)) which can then be compared with the experimental CS (δexp). A simple reference value of σref = 185.00 ppm was used, which is very close to the theoretical isotropic shielding for TMS (σTMS,th) (Vila & Scheraga, 2009), and it is consistent with the reference value previously defined for proteins and glycans. Alternatively, a set of effective references were obtained as a function of: (i) the nitrogenous base sequence, (ii) the combinations of ribose puckering states in the four families of rotamers obtained from δi−1δi torsional angles distributions, (iii) the five carbon nuclei 13C′ CS mean values and (iv) a linear regression between theoretical and experimental ribose 13C′ CS values for a set of suites (see Table S2). (2)
Theoretical vs. theoretical
The classification models trained with theoretical data were also used to classify the theoretical suites. In this case, classification was assessed through a leave-one-out cross-validation (LOO-CV). In LOO-CV, the dataset is split into a test set and training set in a one-folded manner, which means that at every iteration a unique suite is taken apart from the dataset and the remaining suites are used for training. This process continues until every suite from the theoretical dataset is evaluated.
Experimental vs. experimental
A LOO-CV was also used to classify the experimental suites.
Results and Discussion
For experimental vs. theoretical classification (Fig. 6) the 46 rotamers can be classified by means of backbone 13C′ CS with a maximal F1 score of 0.34 (see Table S5). When the 46 rotamers are grouped in families based on their torsional angles distributions, the highest scores correspond to the use of δ(i−1) and δ(i) torsional angles, where all the classifiers gave maximal scores above 0.65. This result is in agreement with the fact that backbone 13C′ CS are highly sensitive to ribose puckering states (Giessner-Prettre & Pullman, 1987), since the δ torsional angle keeps a direct relation with the ribose puckering (Gelbin et al., 1996). The δi−1δiγ, δi−1δiα, δi−1δiαγ and αγ families also show improved scores over the classification of the 46 rotamers. The A*_noA* and A_noA families show low classification scores relative to their random choice classification scores, which means that backbone 13C′ CS cannot distinguish between A–form helix and no A–form helix rotamers. In general the use of more complex classifier models such as Neural Networks, Support Vector Machine, Decision Tree and Random Forest does not assure a better performance for the current task, thus the simpler Nearest Neighbor model can be chosen for classification into RNA rotamers. In both the theoretical vs. theoretical and the experimental vs. experimental classifications (see Figs. 7 and 8, respectively), the performances increase for every group of families, compared to the experimental vs. theoretical classification. In the theoretical vs theoretical classification the performance values are very close to 1.0 for δi−1δi families and A–form helix vs. no A–form helix rotamers (A_noA). In the theoretical vs. theoretical classification, the performance value ranges are particularly narrow, except for MLP and SVM classifiers.
The high scores obtained for the theoretical vs. theoretical classification indicates that 13C′ CS are in fact very sensitive to changes of the torsional angles, the only variable we changed for the construction of the theoretical dataset. At the same time the lower performance obtained in the experimental vs. theoretical classification, is signalling that the atomistic model used for the DFT computations is not good enough to reproduce the experimental observations.
One reason the theoretical vs. theoretical classification gives better results compared to both the experimental vs. experimental and the experimental vs. theoretical classifications, could be that the experimental database is very sparse and the theoretical dataset is instead dense, or in other words the coverage of the theoretical dataset is much better than the experimental one. To explore if this is in fact a reasonable explanation, we removed elements from the theoretical dataset to mimic the sparsity of the experimental dataset (see Fig. S13). We found that while the weighted accuracy decreased (on average 0.09 points) this is not enough to explain the lower performance of the experimental vs. theoretical (on average 0.31 points lower) or experimental vs. experimental (on average 0.16 points lower) classifications. In another experiment, noise on the order of the expected error (1.47 ppm) between experimental and theoretical 13C′ CS for those rotamers correctly classified, was added to the theoretical 13C′ CS and then a theoretical vs. theoretical + noise classification was performed (see Fig. S14). Both tests reinforce the idea discussed in the previous paragraph, i.e., we need a better model for the theoretical DFT computations. These experiments also provide indirect evidence indicating that the accuracy of the experimental vs. experimental classification will be improved as more RNA conformations are deposited in databases giving another incentive to determine and deposit RNA structures and 13C′ CS data.
In this work, we explored the use of RNA backbone 13C′ CS to classify backbone conformations into rotamers and families of rotamers. In general, our study led us to the following conclusions: (1) the classification of the rotamer families defined by the δ torsional angles (see Table 2), which are directly related to the ribose puckering states, gives the best performances, in line with the results previously described by other authors; (2) classification of A-form helix and no A-form helix rotamers using 13C′ CS is not better than a random classification; (3) the performance achieved using the simple Nearest-Neighbor method is on a par with more complex classifiers such as Neural Networks, Support Vector Machine, Decision Tree and Random Forest; (4) 13C′ CS values are able to sense changes in torsional angles, but they are also affected by other factors, thus future DFT computations of RNA 13C′ CS should use more complex models than the one used in this work; (5) experimental 13C′ CS can be useful to identify RNA rotamers, if the rotamers are re-grouped in smaller families as the 46 rotamers seems to be a too fine description for accurate discrimination in terms of 13C′ CS; (6) the usefulness of 13C′ CS for rotamers identification should improve as more RNA structures and experimental 13C′ CS become available.