Classification of RNA backbone conformations into rotamers using 13C′ chemical shifts: exploring how far we can go

Alejandro A. Icazatti; Juan M. Loyola; Igal Szleifer; Jorge A. Vila; Osvaldo A. Martin

doi:10.7717/peerj.7904

Classification of RNA backbone conformations into rotamers using ¹³C′ chemical shifts: exploring how far we can go

Alejandro A. Icazatti ¹, Juan M. Loyola¹, Igal Szleifer^2,3,4, Jorge A. Vila¹, Osvaldo A. Martin¹

1IMASL - CONICET, Universidad Nacional de San Luis, San Luis, Argentina

2Department of Biomedical Engineering, Northwestern University, Evanston, IL, United States of America

3Chemistry of Life Processes Institute, Northwestern University, Evanston, IL, United States of America

4Department of Chemistry, Northwestern University, Evanston, IL, United States of America

DOI: 10.7717/peerj.7904

Published: 2019-10-21
Accepted: 2019-09-16
Received: 2019-06-03

Academic Editor: Claudia Muhle-Goll

Subject Areas: Biochemistry, Bioinformatics, Computational Biology
Keywords: RNA, Rotamers, Machine learning, Chemical shifts, DFT

Copyright: © 2019 Icazatti et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Cite this article: Icazatti AA, Loyola JM, Szleifer I, Vila JA, Martin OA. 2019. Classification of RNA backbone conformations into rotamers using ¹³C′ chemical shifts: exploring how far we can go. PeerJ 7:e7904 https://doi.org/10.7717/peerj.7904

The authors have chosen to make the review history of this article public.

Abstract

The conformational space of the ribose-phosphate backbone is very complex as it is defined in terms of six torsional angles. To help delimit the RNA backbone conformational preferences, 46 rotamers have been defined in terms of these torsional angles. In the present work, we use the ribose experimental and theoretical ¹³C′ chemical shifts data and machine learning methods to classify RNA backbone conformations into rotamers and families of rotamers. We show to what extent the experimental ¹³C′ chemical shifts can be used to identify rotamers and discuss some problem with the theoretical computations of ¹³C′ chemical shifts.

Introduction

Nucleic acids are central macromolecules for the storing, flow and regulation of genetic and epigenetic information in cellular organisms. RNA can adopt a wide variety of 3D structural conformations and this structural variability explains the multiplicity of roles that RNA performs on cells (Wan et al., 2011; Eddy, 2001). The classification of RNA backbone conformations into rotamers is a very useful way to delimit the conformational space of RNA structures. Rotamers are defined in terms of the backbone torsional angles namely α, β, γ, δ, ε and ζ (as shown in Fig. 1). This classification was proposed by Richardson et al. (2008), and has been achieved after the attempts of different research groups to find a consensus RNA backbone structural classification. There are 55 backbone rotamers, from which 46 are rotamers with well defined torsional angles distributions, and the remaining nine rotamers were proposed as wannabe rotamers. The ‘suite’ is the basic subunit used for rotamer classification. The suite is defined from sugar-to-sugar (or from the δ torsional angle of residue i-1 to the δ torsional angle of residue i), and it is contained within the dinucleotide (DN) subunit (see Fig. 1). ¹³C′ chemical shifts (CS) have been successfully used by our and other groups for proteins and glycans structural determination, validation and refinement (Shen & Bax, 2010; Martin et al., 2013; Frank et al., 2015; Garay & Vila, 2018). ¹H CS have been successfully used by Sripakdeevong et al. (2014) for structure determination and prediction of noncanonical RNA motifs. Methods incorporating ¹³C CS for RNA structure determination, validation and refinement are also available (Frank, Stelzer & Bae, 2013; Frank, Law & Brooks, 2014; Brown, Summers & Johnson, 2015) but, to our knowledge, none of them include the explicit use of backbone rotamers. In this work, we study how to use ¹³C′ CS to classify RNA backbone conformations into rotamers with machine learning models. Overall, a complete understanding of the molecular basis of the biological processes in which RNA molecules are involved entails an accurate knowledge of their 3D structure. In this regard, it is well known that the computation of the ¹³C′ chemical shifts (CS) for RNA, at DFT-level of theory, is very sensitive to the backbone conformation of the molecule. Thus, among other potential application of our work is to build, for any possible combination of RNA backbone torsional-angles conformations, a detailed ¹³C CS look-up table. Hence, given a ¹³C CS value the corresponding set of RNA backbone torsional angles can be quickly determined, and vice versa, making the look-up table a very valuable tool with which determine, validate and refine RNA conformations.

Figure 1: RNA DN template with sequence AA, obtained from a random PDB structure.
C, H, O, N and P nuclei are colored in green, white, magenta, blue and yellow, respectively. Torsional angles of RNA backbone are named on Greek characters (α, β, γ, δ, ε, ζ). Suite (from δ_i−1 to δ_i), DN and nucleotide subunits are indicated.

Download full-size image

DOI: 10.7717/peerj.7904/fig-1

Methods

In order to provide a clear understanding of the methodology implemented in this work, a flowchart with the overall work-flow is shown in Fig. 2. A theoretical dataset of RNA backbone rotamers with ¹³C′ CS values is necessary to train the machine learning models to classify RNA experimental suites into rotamers. In the following two sections we explain how we obtained both datasets.

Figure 2: Flowchart of the general work-flow followed in this work.
The experimental data retrieval process and the theoretical data generation steps are indicated inside the green and the blue boxes, respectively. The classification step using machine learning models is indicated with an orange box. The term *rotamers* could indicate the original backbone rotamers or redefined rotamer families.

Download full-size image

DOI: 10.7717/peerj.7904/fig-2

Experimental dataset

Experimental ¹³C′ CS data for RNA molecules was retrieved from the BioMagResBank (BMRB; http://www.bmrb.wisc.edu) (Ulrich et al., 2008), along with their corresponding structures from the Protein Data Bank (PDB; https://www.rcsb.org/) (Berman et al., 2000). As it is fundamental to count on reliable experimental ¹³C′ CS values for an accurate structural analysis, data curation was carried out using 13Check_RNA (Icazatti et al., 2018) a Python module to correct RNA ¹³C′ CS systematic errors, recently developed in our group. The obtained dataset (see Table S1) contains 26 RNA structures with ¹³C′ CS for the five ribose carbon nuclei (C1′, C2′, C3′, C4′ and C5′), providing a total of 391 suite subunits and 391 sets of ¹³C′ CS. As there are at least 8 models in the NMR ensembles for each RNA molecule (up to 20 in some cases), the complete database contains 7,612 conformations. Given that we needed a one-to-one correspondence between the sets of CS and the rotamer suites, only the first structure from each NMR ensemble was used, considering that the first model listed in the PDB files is usually reported as the structure with the lowest energy scoring. This choice has a negligible average effect on the results of our analysis (see Figs. S10 and S11).

For every PDB entry, the 3D coordinates of the first model were extracted in order to compute the backbone torsional angles (δ_i−1, ε_i−1, ζ_i−1, α_i, β_i, γ_i, δ_i) of the suites. Then, these torsional angles were used to assign the RNA suites to their corresponding rotamer names. From the 46 original rotamers, only 38 are represented in the final experimental dataset.

Theoretical dataset

In order to have a complete dataset with the 46 RNA backbone rotamers and their corresponding ¹³C′ CS, a theoretical dataset was also constructed. A template for each of the 16 possible combinations of DN (A, C, U and G) sequences was obtained from RNA structures found in the PDB. A Monte-Carlo conformational sampling was carried out by rotating the backbone torsional angles of the corresponding suite contained in each DN, while keeping the bond-lengths and bond-angles fixed (rigid geometry approximation). To perform such rotations, the torsional angle distributions for each of the 46 RNA backbone rotamer suites (Richardson et al., 2008) were used. A function which eliminates conformers with atom clashes was implemented as part of the routine. As a result, 10,340,852 conformations were generated. Quantum-theory level computation of CS is very time-consuming. Therefore, to reduce the number of calculations, a smaller number of conformations was selected. Aiming to keep most of the variability of the originally generated conformations, we computed the Shannon entropy S (see Eq. (1)) of the distribution of torsional angles. Here, P_i is the probability of the i conformation taken from a histogram with a bin size of 5 degrees. The entropy was computed for different subsets of conformations and sample sizes (from 5 to 100) (see Fig. 3). We decided to use the 80% of the maximum entropy as a cutoff, which implies a sample size of around 40 conformations per rotamer. As we also considered the 16 combinations of DN sequences, the total number of conformations computed at the DFT level of theory was 30,530. (1) $S = - \sum_{i} P_{i} ln P_{i}$

Figure 3: Percentage of entropy of the sample against sample size for a given DN sequence and rotamer, UU and 1a, respectively, in this case.
The red line and the blue bars represent the mean and the range of percentage of entropy for a given sample size, respectively.

Download full-size image

DOI: 10.7717/peerj.7904/fig-3

Details of the quantum-chemical calculations of the ¹³C′ shieldings

Previous to the DFT calculations of the obtained dataset, a test was performed over a subset of 41 rotamers of sequence AA. A similar approach as described below for mononucleotides was used, except that the templates were methyl-blocked DNs: Me − O3′_i−2 − A_i−1 − A_i − O5′_i+1 − Me. Subsequent comparison of the obtained ¹³C′ CS for these DNs with those obtained from the corresponding mononucleotides, gave the same result within 10⁻² ppm while the total computation time was approximately half the total time for computing the complete DNs. Thus, the DN conformations from the final dataset were split in their corresponding mononucleotide subunits. Nucleotide subunits were treated as terminally-blocked mononucleotides with methyl groups (Me) in both termini (Me − O3′_i−1 − X_i − O5′_i+1 − Me). Phosphate groups of the backbone were treated as neutral, because we assume that all backbone charges are shielded during the quantum-chemical calculations. Results based on the analysis of 139 conformations of ubiquitin at pH 6.5 (Vila & Scheraga, 2008), indicate that use of neutral, rather than charged, aminoacids is a significantly better approximation of the observed ¹³C^α CS in solution for the acidic groups, and a slightly better representation, though significantly less expensive in computational time, for the basic groups. Considering that the phosphate group in RNA is close to the nucleus of interest (as it happens with the acidic groups) we can assume, without losing generality, that neutral rather than charged phosphate group is a better approximation for the computation of the ¹³C′ CS in the RNA suites. This approach was also adopted because under physiological conditions, the phosphate groups are completely ionized and neutralized by positive charges (Lehninger Nelson & Cox, 2000). A 6–311+G(2d,p) locally dense basis set (Chesnut & Moore, 1989) was used for calculation of backbone ¹³C′ CS and their nearest neighbor nuclei, at the DFT level of approximation (see Fig. 4 for details). The remaining nuclei were treated with a 3-21G basis set. The OB98 density functional was used because good results were previously observed for proteins and glycans in our group (Vila & Scheraga, 2009; Garay et al., 2014). All DFT computations were done using the Gaussian package (Frisch et al., 2004). Summarizing, the adopted strategies make the computed ¹³C′ CS from mononucleotides suitable for comparison with the ¹³C′ CS observed from complete RNA molecules.

Figure 4: Example of a methyl blocked mononucleotide used for DFT calculations.
The locally-dense basis-set approach is indicated by the different colors: the nuclei in magenta were treated with the extended 6-311+G(2d,p) basis set and the nuclei in green were treated with the smaller 3-21G basis set.

Download full-size image

DOI: 10.7717/peerj.7904/fig-4

Figure 5: Distribution plots for the six RNA backbone torsional angles α, β, γ, δ, ε and ζ in (A), (B), (C), (D), (E) and (F), respectively.
Torsional angles values were obtained from the RNA09 database used in Murray LW. 2007. RNA Backbone Rotamers and Chiropraxis. Doctoral Dissertation, Dept. of Biochemistry, Duke University, Durham, NC, USA.

Download full-size image

DOI: 10.7717/peerj.7904/fig-5

Table 1:

The 46 RNA backbone rotamers were arranged in 22, 10, 10, 7 and 4 families of rotamers based on the observed distributions of δ_i−1δ_iαγ, δ_i−1δ_iα, δ_i−1δ_iγ, αγ and δ_i−1δ_i torsional angles values, respectively.

Additionally, the 46 rotamers were separated in RNA A–form helix vs. no A–form helix rotamers in two ways: (i) RNA A-form helix rotamer 1a vs. the remaining no A–form helix rotamers (A_noA families) and (ii) rotamers related to A–form helix (i.e., 1a, 3d, 3b, 5d, 0a, 6b, 4b) vs. the remaining rotamers (A*_noA* families).

46 rotamers	22 families δ_i−1δ_iαγ	10 families δ_i−1δ_iα	10 families δ_i−1δ_iγ	7 families αγ	4 families δ_i−1δ_i	2 families A_noAⁱ	2 families A_noAⁱⁱ
&a	e	a	a	e	a	b	b
#a	q	c	c	e	c	b	b
0a	q	c	c	e	c	b	a
0b	t	d	d	e	d	b	b
0i	o	g	g	b	c	b	b
1[	l	b	b	e	b	b	b
1a	e	a	a	e	a	a	a
1b	l	b	b	e	b	b	b
1c	d	e	e	d	a	b	b
1e	f	e	e	f	a	b	b
1f	d	e	e	d	a	b	b
1g	c	a	a	c	a	b	b
1L	e	a	a	e	a	b	b
1m	e	a	a	e	a	b	b
1o	m	i	i	g	b	b	b
1t	k	f	f	d	b	b	b
1z	j	b	b	c	b	b	b
2[	t	d	d	e	d	b	b
2a	q	c	c	e	c	b	b
2h	r	g	g	f	c	b	b
2o	v	j	j	g	d	b	b
3a	e	a	a	e	a	b	b
3b	l	b	b	e	b	b	a
3d	a	a	a	a	a	b	a
4a	q	c	c	e	c	b	b
4b	t	d	d	e	d	b	a
4d	n	c	c	a	c	b	b
4g	p	c	c	c	c	b	b
4n	o	g	g	b	c	b	b
4p	s	d	d	a	d	b	b
4s	u	h	h	f	d	b	b
5d	a	a	a	a	a	b	a
5j	b	e	e	b	a	b	b
5q	h	f	f	b	b	b	b
5z	j	b	b	c	b	b	b
6d	n	c	c	a	c	b	a
6g	p	c	c	c	c	b	b
6j	o	g	g	b	c	b	b
6n	o	g	g	b	c	b	b
6p	s	d	d	a	d	b	b
7a	e	a	a	e	a	b	b
7d	a	a	a	a	a	b	b
7p	g	b	b	a	b	b	b
7r	i	i	i	c	b	b	b
8d	n	c	c	a	c	b	b
9a	e	a	a	e	a	b	b

DOI: 10.7717/peerj.7904/table-1

Families of rotamers

The original 46 RNA backbone rotamers were grouped in families based on their δ_i−1, δ_i, α and γ torsional angles values. Only these four (out of seven) backbone torsional angles in the suite subunit were chosen to group the rotamers because their distributions of observed values are bimodal (δ_i−1 and δ_i) and trimodal (γ and α), with clearly separated peaks (see Fig. 5). This selection allowed us to group rotamers based on the torsional angle values within the different peaks. As summarized in Table 1, four families were found when both δ_i−1 and δ_i torsional angles in the suite were used (see Table 2), seven families for the αγ combination, 10 families for δ_i−1δ_iα, and δ_i−1δ_iγ, and 22 families for δ_i−1δ_iαγ. In order to evaluate the classification performance of the RNA A–form helix conformations, the rotamers were also grouped as: (i) A_noA families, where the 46 rotamers were separated in A–form helix (1a) vs. no A–form helix rotamers, and (ii) A*_noA* families, where the 46 rotamers were separated in rotamers related to A–form helix (1a, 3d, 3b, 5d, 0a, 6b and 4b rotamers) vs. the remaining rotamers.

Classification

A series of machine learning methods were used to classify RNA suites as rotamers (or families of rotamers) based on their ribose ¹³C′ CS values. The following classification methods from the scikit-learn Python library (Pedregosa et al., 2011) were trained: K-Nearest Neighbors (NN), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM) and a class of neural network called Multi-Layer Perceptron (MLP). Different model parameters were tried out (see Table S3). A random sampling algorithm was also used as a control, where suites were classified randomly. The sequence of the suite was considered for classification, because we found that the performance increased compared to a sequence–independent classification (see Fig. S10). The classification performance was assessed with four measures: weighted accuracy, precision, recall and F₁ score (Van Rijsbergen, 1979). The weighted accuracy was used in order to recalibrate the contribution of the different rotamers, because the observed frequency of the rotamers is highly uneven (see Fig. S1). The weights used in the weighted accuracy were obtained from a substitution matrix (ROSUM, for ROtamers SUbstitution Matrix). The definition of the ROSUM matrix was inspired by the BLOSUM matrix used for protein sequence alignment (Henikoff & Henikoff, 1992). The matrix is used to weight the match or no match, between the true rotamer and the predicted rotamer, as a function of the euclidean distance between rotamers (in the seven-dimensional space of the suite backbone torsional angles) and the observed frequency of each rotamer. The torsional angles values and the observed frequencies are extracted from the rotamers table (Richardson et al., 2008). A ROSUM matrix was obtained for each of the rotamer families described in the previous section. Further details on the construction of the ROSUM matrices are provided in Data Section S4. The precision and recall were used because they gave a general overview of the performance of the method. In particular, they allowed us to assess the fraction of classified items that were correctly identified and the sensitivity of the method. The F₁ score was also used as a performance measure because it is the harmonic mean of precision and recall and as such, it gives more realistic measure of the classifier’s performance.

Table 2:

Mean torsional angles values of the representative (i.e., most frequent) rotamers from the four δ_i−1δ_i families.

Values were extracted from the rotamer table of Richardson et al. (2008).

δ_i−1δ_i families	46 rotamers	δ_(i−1)	ε_(i−1)	ζ_(i−1)	α_(i)	β_(i)	γ_(i)	δ_(i)
a	1a	81	212	289	295	174	54	81
b	1b	84	215	289	300	177	58	145
c	2a	145	260	289	288	193	53	84
d	2[	146	259	291	292	210	54	148

DOI: 10.7717/peerj.7904/table-2

Experimental vs. theoretical

The classification models trained with theoretical data were used to classify the experimental suites. The result of the theoretical calculations (described in a previous section) are theoretical NMR isotropic shieldings (σ). The theoretical shieldings (σ_comp) must be subtracted from a reference shielding value (σ_ref) to be transformed into theoretical CS (δ_comp) (see Eq. (2)) which can then be compared with the experimental CS (δ_exp). A simple reference value of σ_ref = 185.00 ppm was used, which is very close to the theoretical isotropic shielding for TMS (σ_TMS,th) (Vila & Scheraga, 2009), and it is consistent with the reference value previously defined for proteins and glycans. Alternatively, a set of effective references were obtained as a function of: (i) the nitrogenous base sequence, (ii) the combinations of ribose puckering states in the four families of rotamers obtained from δ_i−1δ_i torsional angles distributions, (iii) the five carbon nuclei ¹³C′ CS mean values and (iv) a linear regression between theoretical and experimental ribose ¹³C′ CS values for a set of suites (see Table S2). (2) $δ_{c o m p} = σ_{r e f} - σ_{c o m p} .$

Theoretical vs. theoretical

The classification models trained with theoretical data were also used to classify the theoretical suites. In this case, classification was assessed through a leave-one-out cross-validation (LOO-CV). In LOO-CV, the dataset is split into a test set and training set in a one-folded manner, which means that at every iteration a unique suite is taken apart from the dataset and the remaining suites are used for training. This process continues until every suite from the theoretical dataset is evaluated.

Experimental vs. experimental

A LOO-CV was also used to classify the experimental suites.

Results and Discussion

For experimental vs. theoretical classification (Fig. 6) the 46 rotamers can be classified by means of backbone ¹³C′ CS with a maximal F₁ score of 0.34 (see Table S5). When the 46 rotamers are grouped in families based on their torsional angles distributions, the highest scores correspond to the use of δ_(i−1) and δ_(i) torsional angles, where all the classifiers gave maximal scores above 0.65. This result is in agreement with the fact that backbone ¹³C′ CS are highly sensitive to ribose puckering states (Giessner-Prettre & Pullman, 1987), since the δ torsional angle keeps a direct relation with the ribose puckering (Gelbin et al., 1996). The δ_i−1δ_iγ, δ_i−1δ_iα, δ_i−1δ_iαγ and αγ families also show improved scores over the classification of the 46 rotamers. The A*_noA* and A_noA families show low classification scores relative to their random choice classification scores, which means that backbone ¹³C′ CS cannot distinguish between A–form helix and no A–form helix rotamers. In general the use of more complex classifier models such as Neural Networks, Support Vector Machine, Decision Tree and Random Forest does not assure a better performance for the current task, thus the simpler Nearest Neighbor model can be chosen for classification into RNA rotamers. In both the theoretical vs. theoretical and the experimental vs. experimental classifications (see Figs. 7 and 8, respectively), the performances increase for every group of families, compared to the experimental vs. theoretical classification. In the theoretical vs theoretical classification the performance values are very close to 1.0 for δ_i−1δ_i families and A–form helix vs. no A–form helix rotamers (A_noA). In the theoretical vs. theoretical classification, the performance value ranges are particularly narrow, except for MLP and SVM classifiers.

Box-plots with the weighted accuracy and F1 score for the experimental vs. theoretical classification of rotamers and families of rotamers, using Nearest Neighbor (NN), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) classifiers. — Figure 6: Box-plots with the weighted accuracy and F₁ score for the experimental vs. theoretical classification of rotamers and families of rotamers, using Nearest Neighbor (NN), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) classifiers.
A random-choice (RAND) algorithm was used as a baseline reference. Classification results for the 46, δ_i−1δ_iαγ, δ_i−1δ_iα, δ_i−1δ_iγ, αγ and δ_i−1δ_i, A*_noA* and A_noA rotamer families are shown in (A), (B), (C), (D), (E), (F), (G) and (H) respectively. The highest values of weighted accuracy and F₁ score, for the experimental vs. theoretical classification along with parameters of the classifiers are provided in Tables S4 and Fig. S1. Precision and recall are shown in Fig. S12.

Download full-size image

DOI: 10.7717/peerj.7904/fig-6

Box-plots with the weighted accuracy and F1 score for the theoretical vs. theoretical classification of rotamers and families of rotamers, using Nearest Neighbor (NN), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) classifiers. — Figure 7: Box-plots with the weighted accuracy and F₁ score for the theoretical vs. theoretical classification of rotamers and families of rotamers, using Nearest Neighbor (NN), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) classifiers.
A random-choice (RAND) algorithm was used as a baseline reference. Classification results for the 46, δ_i−1δ_iαγ, δ_i−1δ_iα, δ_i−1δ_iγ, αγ and δ_i−1δ_i, A*_noA* and A_noA rotamer families are shown in (A), (B), (C), (D), (E), (F), (G) and (H) respectively.

Download full-size image

DOI: 10.7717/peerj.7904/fig-7

Box-plots with the weighted accuracy and F1 score for the experimental vs. experimental classification of rotamers and families of rotamers, using Nearest Neighbor (NN), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) classifiers. — Figure 8: Box-plots with the weighted accuracy and F₁ score for the experimental vs. experimental classification of rotamers and families of rotamers, using Nearest Neighbor (NN), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) classifiers.
A random-choice (RAND) algorithm was used as a baseline reference. Classification results for the 46, δ_i−1δ_iαγ, δ_i−1δ_iα, δ_i−1δ_iγ, αγ and δ_i−1δ_i, A*_noA* and A_noA rotamer families are shown in (A), (B), (C), (D), (E), (F), (G) and (H) respectively.

Download full-size image

DOI: 10.7717/peerj.7904/fig-8

The high scores obtained for the theoretical vs. theoretical classification indicates that ¹³C′ CS are in fact very sensitive to changes of the torsional angles, the only variable we changed for the construction of the theoretical dataset. At the same time the lower performance obtained in the experimental vs. theoretical classification, is signalling that the atomistic model used for the DFT computations is not good enough to reproduce the experimental observations.

One reason the theoretical vs. theoretical classification gives better results compared to both the experimental vs. experimental and the experimental vs. theoretical classifications, could be that the experimental database is very sparse and the theoretical dataset is instead dense, or in other words the coverage of the theoretical dataset is much better than the experimental one. To explore if this is in fact a reasonable explanation, we removed elements from the theoretical dataset to mimic the sparsity of the experimental dataset (see Fig. S13). We found that while the weighted accuracy decreased (on average 0.09 points) this is not enough to explain the lower performance of the experimental vs. theoretical (on average 0.31 points lower) or experimental vs. experimental (on average 0.16 points lower) classifications. In another experiment, noise on the order of the expected error (1.47 ppm) between experimental and theoretical ¹³C′ CS for those rotamers correctly classified, was added to the theoretical ¹³C′ CS and then a theoretical vs. theoretical + noise classification was performed (see Fig. S14). Both tests reinforce the idea discussed in the previous paragraph, i.e., we need a better model for the theoretical DFT computations. These experiments also provide indirect evidence indicating that the accuracy of the experimental vs. experimental classification will be improved as more RNA conformations are deposited in databases giving another incentive to determine and deposit RNA structures and ¹³C′ CS data.

Conclusion

In this work, we explored the use of RNA backbone ¹³C′ CS to classify backbone conformations into rotamers and families of rotamers. In general, our study led us to the following conclusions: (1) the classification of the rotamer families defined by the δ torsional angles (see Table 2), which are directly related to the ribose puckering states, gives the best performances, in line with the results previously described by other authors; (2) classification of A-form helix and no A-form helix rotamers using ¹³C′ CS is not better than a random classification; (3) the performance achieved using the simple Nearest-Neighbor method is on a par with more complex classifiers such as Neural Networks, Support Vector Machine, Decision Tree and Random Forest; (4) ¹³C′ CS values are able to sense changes in torsional angles, but they are also affected by other factors, thus future DFT computations of RNA ¹³C′ CS should use more complex models than the one used in this work; (5) experimental ¹³C′ CS can be useful to identify RNA rotamers, if the rotamers are re-grouped in smaller families as the 46 rotamers seems to be a too fine description for accurate discrimination in terms of ¹³C′ CS; (6) the usefulness of ¹³C′ CS for rotamers identification should improve as more RNA structures and experimental ¹³C′ CS become available.

Supplemental Information

Supplementary Data

DOI: 10.7717/peerj.7904/supp-1

Download

[1] Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. 2000. The protein data bank. Nucleic Acids Research 28(1):235-242

[2] Brown JD, Summers MF, Johnson BA. 2015. Prediction of hydrogen and carbon chemical shifts from RNA using database mining and support vector regression. Journal of Biomolecular NMR 63(1):39-52

[3] Chesnut DB, Moore KD. 1989. Locally dense basis sets for chemical shift calculations. Journal of Computational Chemistry 10(5):648-659

[4] Eddy SR. 2001. Noncoding RNA genes and the modern RNA world. Nature Reviews Genetics 2(12):919-929

[5] Frank AT, Law SM, Ahlstrom LS, Brooks CL. 2015. Predicting protein backbone chemical shifts from Cα coordinates: extracting high resolution experimental observables from low resolution models. Journal of Chemical Theory and Computation 11(1):325-331

[6] Frank AT, Law SM, Brooks CL. 2014. A simple and fast approach for predicting H-1 and C-13 chemical shifts: toward chemical shift-guided simulations of RNA. Journal of Physical Chemistry B 118(42):12168-12175

[7] Frank AT, Stelzer AC, Bae S-h. 2013. Prediction of RNA 1H and ¹³C chemical shifts: a structure based approach. The Journal of Physical Chemistry B 117(43):13497-13506

[8] Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, Cheeseman JR, Montgomery Jr JA, Vreven T, Kudin KN, Burant JC, Millam JM, Iyengar SS, Tomasi J, Barone V, Mennucci B, Cossi M, Scalmani G, Rega N, Petersson GA, Nakatsuji H, Hada M, Ehara M, Toyota K, Fukuda R, Hasegawa J, Ishida M, Nakajima T, Honda Y, Kitao O, Nakai H, Klene M, Li X, Knox JE, Hratchian HP, Cross JB, Bakken V, Adamo C, Jaramillo J, Gomperts R, Stratmann RE, Yazyev O, Austin AJ, Cammi R, Pomelli C, Ochterski JW, Ayala PY, Morokuma K, Voth GA, Salvador P, Dannenberg JJ, Zakrzewski VG, Dapprich S, Daniels AD, Strain MC, Farkas O, Malick DK, Rabuck AD, Raghavachari K, Foresman JB, Ortiz JV, Cui Q, Baboul AG, Clifford S, Cioslowski J, Stefanov BB, Liu G, Liashenko A, Piskorz P, Komaromi I, Martin RL, Fox DJ, Keith T, Al-Laham MA, Peng CY, Nanayakkara A, Challacombe M, Gill P. MW, Johnson B, Chen W, Wong MW, Gonzalez C, Pople JA. 2004. Gaussian 03, Revision C.02. Wallingford CT: Gaussian, Inc.

[9] Garay PG, Martin OA, Scheraga HA, Vila JA. 2014. Factors affecting the computation of the ¹³C shielding in disaccharides. Journal of Computational Chemistry 35(25):1854-1864

[10] Garay PG, Vila JA, Martin OA. 2018. CheSweet: an application to predict glycan’s chemicals shifts. The Journal of Open Source Software 3(21):488

[11] Gelbin A, Schneider B, Clowney L, Hsieh SH, Olson WK, Berman HM. 1996. Geometric parameters in nucleic acids: sugar and phosphate constituents. Journal of the American Chemical Society 118(3):519-529

[12] Giessner-Prettre C, Pullman B. 1987. Quantum mechanical calculations of NMR chemical shifts in nucleic acids. Quarterly Reviews of Biophysics 20(3–4):113-172

[13] Henikoff S, Henikoff JG. 1992. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 89(22):10915-10919

[14] Icazatti AA, Martin OA, Villegas M, Szleifer I, Vila JA. 2018. 13Check_RNA: a tool to evaluate 13C chemical shift assignments of RNA. Bioinformatics 34(23):4124-4126

[15] Lehninger AL AL, Nelson DL, Cox MM. 2000. Lehninger principles of biochemistry. New York: Worth Publishers.

[16] Martin OA, Arnautova YA, Icazatti AA, Scheraga HA, Vila JA. 2013. Physics-based method to validate and repair flaws in protein structures. Proceedings of the National Academy of Sciences of the United States of America 110(42):16826-16831

[17] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. 2011. Scikit-learn: machine learning in python. Journal of Machine Learning Research 12:2825-2830

[18] Richardson JS, Schneider B, Murray LW, Kapral GJ, Immormino RM, Headd JJ, Richardson DC, Ham D, Hershkovits E, Williams LD, Keating KS, Pyle AM, Micallef D, Westbrook J, Berman HM. 2008. RNA backbone: consensus all-angle conformers and modular string nomenclature (an RNA Ontology Consortium contribution) RNA 14(3):465-481

[19] Shen Y, Bax A. 2010. SPARTA+: a modest improvement in empirical NMR chemical shift prediction by means of an artificial neural network. Journal of Biomolecular NMR 48(1):13-22

[20] Sripakdeevong P, Cevec M, Chang AT, Erat MC, Ziegeler M, Zhao Q, Fox GE, Gao X, Kennedy SD, Kierzek R, Nikonowicz EP, Schwalbe H, Sigel R. KO, Turner DH, Das R. 2014. Structure determination of noncanonical RNA motifs guided by 1H NMR chemical shifts. Nature methods 11(4):413-416

[21] Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, Livny M, Mading S, Maziuk D, Miller Z, Nakatani E, Schulte CF, Tolmie DE, Kent Wenger R, Yao H, Markley JL. 2008. BioMagResBank. Nucleic Acids Research 36(SUPPL. 1):402-408

[22] Van Rijsbergen C. 1979. Information Retrieval. Newton: Butterworth-Heinemann.

[23] Vila JA, Scheraga HA. 2008. Factors affecting the use of ¹³Cα chemical shifts to determine, refine, and validate protein structures. Proteins: Structure, Function, and Bioinformatics 71(2):641-654

[24] Vila JA, Scheraga HA. 2009. Assessing the accuracy of protein structures by quantum mechanical computations of ¹³Cα chemical shifts. Accounts of Chemical Research 42(10):1545-1553

[25] Wan Y, Kertesz M, Spitale RC, Segal E, Chang HY. 2011. Understanding the transcriptome through RNA structure. Nature Reviews Genetics 12(9):641-655

46 rotamers	22 families δ_i−1δ_iαγ	10 families δ_i−1δ_iα	10 families δ_i−1δ_iγ	7 families αγ	4 families δ_i−1δ_i	2 families A_noAⁱ	2 families A_noAⁱⁱ
&a	e	a	a	e	a	b	b
#a	q	c	c	e	c	b	b
0a	q	c	c	e	c	b	a
0b	t	d	d	e	d	b	b
0i	o	g	g	b	c	b	b
1[	l	b	b	e	b	b	b
1a	e	a	a	e	a	a	a
1b	l	b	b	e	b	b	b
1c	d	e	e	d	a	b	b
1e	f	e	e	f	a	b	b
1f	d	e	e	d	a	b	b
1g	c	a	a	c	a	b	b
1L	e	a	a	e	a	b	b
1m	e	a	a	e	a	b	b
1o	m	i	i	g	b	b	b
1t	k	f	f	d	b	b	b
1z	j	b	b	c	b	b	b
2[	t	d	d	e	d	b	b
2a	q	c	c	e	c	b	b
2h	r	g	g	f	c	b	b
2o	v	j	j	g	d	b	b
3a	e	a	a	e	a	b	b
3b	l	b	b	e	b	b	a
3d	a	a	a	a	a	b	a
4a	q	c	c	e	c	b	b
4b	t	d	d	e	d	b	a
4d	n	c	c	a	c	b	b
4g	p	c	c	c	c	b	b
4n	o	g	g	b	c	b	b
4p	s	d	d	a	d	b	b
4s	u	h	h	f	d	b	b
5d	a	a	a	a	a	b	a
5j	b	e	e	b	a	b	b
5q	h	f	f	b	b	b	b
5z	j	b	b	c	b	b	b
6d	n	c	c	a	c	b	a
6g	p	c	c	c	c	b	b
6j	o	g	g	b	c	b	b
6n	o	g	g	b	c	b	b
6p	s	d	d	a	d	b	b
7a	e	a	a	e	a	b	b
7d	a	a	a	a	a	b	b
7p	g	b	b	a	b	b	b
7r	i	i	i	c	b	b	b
8d	n	c	c	a	c	b	b
9a	e	a	a	e	a	b	b

δ_i−1δ_i families	46 rotamers	δ_(i−1)	ε_(i−1)	ζ_(i−1)	α_(i)	β_(i)	γ_(i)	δ_(i)
a	1a	81	212	289	295	174	54	81
b	1b	84	215	289	300	177	58	145
c	2a	145	260	289	288	193	53	84
d	2[	146	259	291	292	210	54	148

46 rotamers	22 families δ_i−1δ_iαγ	10 families δ_i−1δ_iα	10 families δ_i−1δ_iγ	7 families αγ	4 families δ_i−1δ_i	2 families A_noAⁱ	2 families A_noAⁱⁱ
&a	e	a	a	e	a	b	b
#a	q	c	c	e	c	b	b
0a	q	c	c	e	c	b	a
0b	t	d	d	e	d	b	b
0i	o	g	g	b	c	b	b
1[	l	b	b	e	b	b	b
1a	e	a	a	e	a	a	a
1b	l	b	b	e	b	b	b
1c	d	e	e	d	a	b	b
1e	f	e	e	f	a	b	b
1f	d	e	e	d	a	b	b
1g	c	a	a	c	a	b	b
1L	e	a	a	e	a	b	b
1m	e	a	a	e	a	b	b
1o	m	i	i	g	b	b	b
1t	k	f	f	d	b	b	b
1z	j	b	b	c	b	b	b
2[	t	d	d	e	d	b	b
2a	q	c	c	e	c	b	b
2h	r	g	g	f	c	b	b
2o	v	j	j	g	d	b	b
3a	e	a	a	e	a	b	b
3b	l	b	b	e	b	b	a
3d	a	a	a	a	a	b	a
4a	q	c	c	e	c	b	b
4b	t	d	d	e	d	b	a
4d	n	c	c	a	c	b	b
4g	p	c	c	c	c	b	b
4n	o	g	g	b	c	b	b
4p	s	d	d	a	d	b	b
4s	u	h	h	f	d	b	b
5d	a	a	a	a	a	b	a
5j	b	e	e	b	a	b	b
5q	h	f	f	b	b	b	b
5z	j	b	b	c	b	b	b
6d	n	c	c	a	c	b	a
6g	p	c	c	c	c	b	b
6j	o	g	g	b	c	b	b
6n	o	g	g	b	c	b	b
6p	s	d	d	a	d	b	b
7a	e	a	a	e	a	b	b
7d	a	a	a	a	a	b	b
7p	g	b	b	a	b	b	b
7r	i	i	i	c	b	b	b
8d	n	c	c	a	c	b	b
9a	e	a	a	e	a	b	b

δ_i−1δ_i families	46 rotamers	δ_(i−1)	ε_(i−1)	ζ_(i−1)	α_(i)	β_(i)	γ_(i)	δ_(i)
a	1a	81	212	289	295	174	54	81
b	1b	84	215	289	300	177	58	145
c	2a	145	260	289	288	193	53	84
d	2[	146	259	291	292	210	54	148