Some of the 20 proteinogenic amino acids are non-polar and hydrophobic, others are negatively or positively charged. The cysteine side chain can build covalent bonds in the form of disulfide bridges with other cysteines. The folding of a protein chain is therefore determined by disulfide bonds between its cysteines and the sets of weak non-covalent bonds that form between other amino acids of the chain (Anfinsen, 1973). Another factor that drives the folding is the tendency of hydrophobic molecules, including the non-polar side chains of amino acids, to cluster in the interior of the molecule. This allows the side chains to avoid contact with the aqueous environment of the cytoplasm. Due to these many factors, even small changes in the DNA sequence can render a protein useless or completely change its three-dimensional structure (Alexander et al., 2007). Since the discovery of the standard genetic code (SGC, Nirenberg & Matthaei, 1961), there has been an ongoing discussion on the evolution of this code especially because of its near universality (Vetsigian, Woese & Goldenfeld, 2006). This strong conservation inspired many theories on adaptive, historical and chemical arguments assuming that the SGC is optimized: it might reflect either the expansion of a more primitive code towards the inclusion of more amino acids, or could be a consequence of direct chemical interactions between RNA and amino acids. Alternatively, stereochemical, co-evolutionary and error minimization mechanisms might have acted in concert to assign the 20 proteinogenic amino acids to their present position in the SGC (Knight, Freeland & Landweber, 1999).
Haig & Hurst (1991) found evidence that the SGC minimizes the effects of mistranslations, since the impact of a base substitution is much stronger for almost all out of 10, 000 simulated alternative triplet block codes. Their results were based on the assumption that all single-point substitutions were equally likely. Freeland & Hurst (1998) extended this work by taking into account the imbalance between transitions and transversions. They showed that the relative efficiency of the genetic code even increased at any reasonable transition/transversion bias. In addition, when they included this bias into the translational errors, then only one in every million randomly generated codes showed to be more efficient than the natural code, again suggesting that the code might be the product of selection. Recently, Itzkovitz, Hodis & Segal (2010) proposed that the SGC is also optimized for another modification, i.e., the insertion of overlapping information within protein-coding regions. This feature strongly correlates with the robustness of the natural code against frameshift mutations. Such mutations can have very severe effects on a amino acid sequence, because a single shift impacts all triplets following downstream (Drummond & Wilke, 2008). On the other hand, these non-coding blocks inside the coding sequences appear to be particularly necessary in higher animals (such as mammals) to allow for the complex gene regulation processes. But in addition to the robustness of the coding sequences, the molecular dynamics of DNA also seem to depend on the degeneracy of the DNA code (Babbitt et al., 2014). Thus, we might expect that the use of suitable codons may also play an important role outside the coding regions. In particular, those last three arguments suggest that the DNA code is optimized for much more than just robustness against point mutations.
Therefore, we dedicate ourselves in this work to the so-called adaptive theory that suggests the pattern of codon assignments to be an adaptation optimizing a certain set of functions, such as the minimization of errors caused by mistranslations (Wong, 1975). We reproduced the simulations of Freeland & Hurst (1998) and extended this approach to the problem of frameshift mutations. Even though we generated a completely new set of 1, 000, 000 codes out of about 1018 possible codes, we still got very precisely the same results as proposed by Freeland and Hurst. It also turned out that if the natural code has been evolved to be robust against mistranslations, it seems to be just as optimized against frameshift mutations. Finally, it seems that this factor is more than just a side-effect of the point mutation stability, as codes that successfully outperform the SGC for one of the features proposed in this work do not automatically reach this level for any of the other features, but the SGC does.
Data and Methods
Generating one million random codes
Haig & Hurst (1991) generated alternative encodings by permuting the 20 (amino acid) labels of the 21 codon sets of the original code. Each set contains all codons that decode for one of the 20 proteinogenic amino acids plus there is one set containing the three stop codons. A new code is thus created by randomly assigning each of the 20 amino acids to one of the codon sets while leaving the stop codons unchanged. This shuffling keeps the basic organization of the natural code intact (i.e., the level of redundancy).
These permutations are chosen from a repertoire of 20!(>2.4 ×1018) possible alternative code tables. Given this huge number, both test set sizes are rather small, not only the 10, 000 codes analyzed by Haig & Hurst (1991) but also the 1, 000, 000 codes in the work by Freeland & Hurst (1998) can likely miss some even better codes. However, as both results showed a rather homologous picture of the resulting codes statistics, we decided to generate our own subset with 1, 000, 000 codes and to reproduce the results from Freeland & Hurst (1998) first.
One principal measure that we will focus on in the following is the conservation of the polar requirement (PR) for every triplet after a given mutation. See Table 1 for a complete list of the exact PR values for each of the 20 proteinogenic amino acids. These values were proposed by Woese et al. (1966). In these experiments, multiple mixtures of water and dimethylpyridine were used to estimate a specific linear trend for each amino acid on a log–log diagram between a chromatographic measure and the mole fraction water for different concentrations. According to this data, a mutation of e.g., Proline (CCA, 6.6) to Threonine (ACA, 6.6) would be a silent mutation as the PR remains unchanged. In contrast, a change from Proline to Arginine (CGA, 9.4) would lead to a significant alteration in PR.
|Amino acid||Abbrev.||PR||Amino acid||Abbrev.||PR||Amino acid||Abbrev.||PR|
Change in polar requirement after point mutations
Haig & Hurst (1991) introduced a mean square (MS) measure to quantify the relative efficiency of any given code: Let P(ci) be the PR value of an amino acid represented by codon ci. Mj(ci) is the jth mutation of codon ci, and thus P(Mj(ci)) the PR value of the mutated codon. Note that of course for every codon position M is a different function which describes a specific mutation, e.g., a point mutation. Furthermore, mi is the number of possible mutations of codon ci. Again, this number depends on the mutation M. There are 61 codons that might be mutated, as we leave out the three stop codons. Hence, the squared distance D for a given mutation M is calculated by (1)
They introduced a scaling factor F that is defined as the number of possible mutations over all codons, . Specifically, for point mutations M1, M2, and M3 at the first, second, and third position, these values are F1 = 174, F2 = F3 = 176, because in the set of all 61∗3 = 183 possible mutations, there are nine resulting into stop codons for the first position and seven for the other two mutations. The mean squared error MS1, MS2, and MS3 can then be calculated by (2)
The mean squared error of mutations for all three positions is then computed by (3)
Since transitions (U ↔ C, A ↔ G) occur more likely than transversions (U, C ↔ A, G), Freeland & Hurst (1998) proposed a weighting that incorporates this bias. Basically, they split up the set of all mutations into two parts, S (transitions) and V (transversions) with (4) Here, Mj,s(ci) is the jth transition, Mj,v(ci) represents the jth transversion of codon ci, and and are the number of all transitions and transversions for ci. After separating these two cases, S can be weighted by an ω resulting in a weighted squared error Wω with a weighted scaling factor Fω: (5) The weighted mean squared error over all codons is denoted as WMS1ω, WMS2ω, WMS3ω, respectively and computed by
These local errors are combined to a position-independent weighted mean squared error WMS0ω, which is defined by
One drawback here of course is the free parameter ω. For ω = 1, WMS0 equals MS0, but for other regimes, it is hard to tell which ω to choose. Interestingly, Freeland & Hurst (1998) found that the regime that is optimal for stability for the natural code is corresponding with the ratio found in nature.
The MS and the WMS measures focus on transcriptional errors and point mutations. A third factor are so called translational errors, i.e., the mistranslation of accurate mRNA. This error is also not equally distributed over all positions, they are much more likely for the third position. Friedman & Weinstein (1964) investigated the polypeptide product resulting from in vitro translation of poly(U)mRNA. Freeland & Hurst (1998) quantified those results, as seen in Table 2, and used this combined weighting to propose a translational error (tMS). The squared error for transitions and transversions are calculated as shown in Eq. (4), but this time, we have fixed weights ψS and ψV for both, S and V. Thus, the translational squared error Tψ and its scaling factor Fψ is defined as The translational mean square deviation of mutations at all positions (tMS0) is computed by
|Combined weighting for|
Change in polar requirement after frameshift mutations
There is another important set of mutations: the so-called frameshift mutations. These mutations describe the impact of deletions or insertions (so called indels) of nucleotides out of or into coding sequences. Such mutations can be specifically severe, as they are changing the reading frame of all codons following upstream. As the amino acids are represented by triplets, the reading frame can only be shifted by one or two positions to the left or right to be effectively changed. These shifts are then called ±1 resp. ±2 frameshifts. A ±3 shift mutation is not changing the reading frame, but such mutations can also have a strong impact on the function or the shape of an affected protein.
In Table 3, the impact of frameshift mutations on the reading frame is illustrated on a short sample sequence. It can be seen that a frameshift of +1 is leading to the same codons as for a −2 shift. The same holds for the −1 and +2 shifts. Thus, without loss of generality, we will restrict the following examination on only ±1 reading frame shifts.
|Original sequence||AUCGUAGUCAAU||⟶||AUC GUA GUC AAU|
|Shift +1||XAUCGUAGUCAA||⟶||XAU CGUAGUCAA|
|Shift −2||CGUAGUCAAUXY||⟶||CGUAGUCAA UXY|
|Shift −1||UCGUAGUCAAUX||⟶||UCGAUGUCA AUX|
|Shift +2||XYAUCGUAGUCA||⟶||XYA UCGAUGUCA|
As with the point mutations, we consider only the non-stop codons (61 triplets). This time, the mutation is defined as follows: first, happens a frameshift and then the completion of the triplet by a new character. Since we do not know this character for the general case, we have to estimate the average change in polar requirement (PR) over all four possible nucleotides. The triplet AUU e.g., can become AAU, CAU, GAU, UAU after a right shift (+1) or UU A, UU C, UU G, UU U after a left shift (−1). Thus, this gives us 4∗61 = 244 triplets, and for each of these patterns, we can estimate the pairwise PR change between the PR of the associated amino acid and that of the original triplet and its associated amino acid.
We will further exclude all occurring stop codons after shift from our statistics. For mutations to both sides, there are 12 codon-to-stopcodon indels: for the right shift mutations (+1 shifts), these are
For the left shift mutations (−1 shifts), we get
Hence, for both right and left shift there are Fr = Fl = 232 possible codon-to-codon transformations. The squared distances Dr and Dl are then calculated as shown in (1). Right-shift (rMS), left-shift (lMS) and the total mean squared error (fMS) are denoted as (6)
Please note that the total squared error for left- and right-shift mutations are identical: let X and Y be two non-stop codons and X becomes Y after a shift mutation to the left (X←Y). Then, it follows directly that there is also a right-shift pair (Y → X) of the exact same two codons. In other words, the left- and right-shift mutations lMS and rMS are calculated from the exact same set of codon-to-codon pairs. Thus, they are identical. But then also the total mean squared error fMS equals to lMS and rMS.
Combining frameshift and point mutations
Finally, we want to see, especially for the newly generated codes, whether or not they are top scorers in only one of the tested categories and if there are other codes that might perform better in a combined comparison. Specifically, we combined tMS0 and fMS to a new measure ftMS with (7)
We were able to reproduce the MS-scores that also Freeland & Hurst (1998) obtained for their set of 1, 000, 000 codes. In Table 4, the similarities of the estimated errors can be seen. Listed are the mean squared errors and their standard deviation. In brackets, we provided the proportion of random codes that are more conservative than the SGC, e.g., one out of 10, 000 codes had a lower MS0 value than the SGC. We could also reproduce the effect regarding the transition weighted error WMS0: In Fig. 1, the SGC code becomes even more outstanding when varying ω, compared against the top 15 better codes (using the MS0 score). Already for ω = 3, there are only three of the 15 codes left with a WMS0 score lower than that of the SGC. Remarkably, there are some codes that even increase their score for higher ω. In Table 5, the amino acid sequences for our top 15 codes can be seen. The three that also remain better for ω ≥ 3 are marked in bold.
Mean ± SD (P)
(n = 1, 000, 000)
|Freeland and Hurst
(n = 1, 000, 000)
|MS0||9.42 ± 1.51 (0.0001)||9.41 ± 1.51 (0.0001)|
|MS1||12.05 ± 2.80 (0.0031)||12.04 ± 2.80 (0.0031)|
|MS2||12.63 ± 2.60 (0.2213)||12.63 ± 2.60 (0.2216)|
|MS3||3.59 ± 1.50 (0.0001)||3.59 ± 1.50 (0.0001)|
Items in bold are those three codes that still outperformed SGC after applying the transition/transversion weighting (See Fig. 1).
The third series of calculations that Freeland and Hurst carried out maps translational error instead of errors resulting from point mutations. Figure 2B shows the distribution of tMS values of this work set of one million random codes. Descriptive statistics of the distribution of the variant codes of this work in comparison with the statistics of Freeland and Hurst are given in Table 6 along with the obtained tMS value of the SGC.
(n = 1, 000, 000)
|Freeland and Hurst
(n = 1, 000, 000)
|Mean ± SD||7.63 ± 1.35||7.63 ± 1.35|
|# better codes||2||1|
Freeland and Hurst found only one out of one million random codes with a lower tMS score compared to the SGC. With our set of one million variant codes, two better codes were found. To render the number of better codes more precisely, a set of ten million unique variant codes was created, which led to the estimation that the probability of a code as efficient as the SGC arising by chance alone is Pt = 0.0000028. Although this number is nearly three times higher than the probability estimated by Freeland & Hurst (1998) and though the precise quantification of mistranslations may be questioned, the SGC shows clear evidence of structure. Its efficiency is indeed two orders of magnitude higher than previously obtained by the MS or WMS measures.
In Fig. 2B, the histogram for the tMS0 values can be seen. Again, the arrow indicates the bin that includes the tMS0-score of the SGC. Remarkably, only two random codes are found with a lower tMS0 value than the natural code. Therefore, the probability of a code as efficient as the natural code arising by chance alone is Pt = 0.000002.
Using the proposed model for frameshift mutations (fMS), we estimated the fMS values for the same 1, 000, 000 random codes that were used in the previous paragraph and compared their fMS values with the SGC. As can be seen in the histogram in Fig. 2C, again, the SGC is outstanding to most codes. Then, we combined both scores (tMS0 and fMS) to the ftMS score (see Eq. 7). This score will be low only for those codes that are small for both underlying values. As it turned out, most codes are not well suited to tackle both measures; however, globally the SGC wins a little ground compared to the other codes (see Fig. 2D).
Freeland & Hurst (1998) have published in their work the first 15 codes that outperformed the SGC regarding the tMS0 score. We compared these 15 codes with respect to the fMS score (see Table 7). Similarly, to our results, there are some codes that improve and some that worsen as the transition weighting increases. Code 13 has the lowest WMS01 score and remains better than the SGC for the WMS02 score (along with two others). With regard to the tMS0 score, however, all these three codes fail, only code 2 achieves a better result than the SGC. Regarding FMS scores, only code 13 gets better, the same code that has surpassed the WMS0 scores.
Finally, Goldman (1993) used a so called record-to-record-travel algorithm (Dueck, 1993) to estimate the global optimum for the point mutation scenario. Recently, Buhrman et al. (2011) analytically proved that this code really is the optimal solution. It’s coding table can be seen in Fig. 3B. We used this code to compare it against the SGC. Most random codes that were better than the SGC typically did not perform well for the fMS values. But even though the Goldman code is not optimal for the fMS value, it still did outperform the SGC in this respect. The individual scores are summarized in Table 8.
|Measure||SGC||Code 672469||Code 648040||Code 218957||Goldman’s Code|
|3.1 × 10−5||0||8.9 × 10−5||3.5 × 10−4||0|
|Pt||2.0 × 10−6||6.0 × 10−5||0||4.6 × 10−2||0|
|Pf||2.7 × 10−4||7.8 × 10−1||2.7 × 10−1||0||5 × 10−5|
In this work, we restricted our analysis to the same subset of triplet codes as Haig & Hurst (1991) did: a random permutation of the amino acids but based on the block code of the Standard Genetic Code (SGC). This means that all newly generated codes still express the same degeneration of the third position that the SGC does. Accordingly, all silent mutations of the SGC also remain silent mutations in each of the new codes, i.e., the degeneration of the third position has zero effect on the stability considerations made in this work.
Looking at the millions of simulated alternative block codes, both the analysis Freeland & Hurst (1998) as well as our work show that among all these codes, the SGC stands out in particular for the conservation of Polar Requirement. Freeland & Hurst (1998) have also shown that this is not the case for a variety of other biochemical features. We showed that there are better codes than the SGC regarding both errors: there is the global optimum for the point mutation that also outperforms the SGC regarding frame shift mutations (Goldman, 1993). But even if we just take a look at the first 15 codes that Freeland & Hurst (1998) found to be more robust to point mutation than the SGC, the proposed code 13 is even more robust than Goldman’s code regarding frame shift mutation (see Tables 7 and 8). Obviously, it is not too difficult to find a code on a global scale that is more robust than the SGC. However, the SGC appears to be effectively optimized for these two features by the evolutionary processes that might have played a role here.
Haig & Hurst (1991) concluded that the primary selective force is to minimize the effects of codon-anticodon mismatch during translation. Freeland & Hurst (1998) then found further evidence for their hypothesis: when transitions are weighted twice as heavy as transversions, the relative efficiency of the natural code increases. In addition, the observed behavior of the SGC is not common to all codes: seven out of the 15 codes that had the best MS0 scores (all better than the SGC) even decreased in their robustness for increasing ω. Finally, including the rather crude model of mistranslations, again the relative efficiency of the SGC increases by a factor of 2 (indicating that SGC is “one in a million”).
Despite the strong evidence that the SGC might have been evolved to conserve amino acid polar requirement against mistranslations, there is no doubt that also the generation of stop codons along any mutation might play an important role. Seligmann & Pollock (2004) examined this phenomenon and argued that it would be optimal to enhance the probability of stop codon generation after frameshifts, and thus early silencing a damaged gene to save energy and resources of the biosynthetic machinery. In this work, we completely omitted the stop codons, and rather focused on the conservation ability of the SGC, just as proposed by Freeland & Hurst (1998), and the systematic alteration according to frameshift mutation as proposed by Itzkovitz, Hodis & Segal (2010).
We found clear evidence that along the unique evolution of the SGC to what it is today, it is also significantly more robust against any frameshift mutation than most random codes are. We have further shown that the proposed measure of frameshift stability is symmetric, i.e., equally robust for left and right reading frame shifts. But please note that this might not hold for any real sequences, due to the unbalances of nucleotides and codons in real sequences. Hence, this might be an interesting question to follow up in further investigations. However, in our sample of one million random codes only 267 codes were outperforming the SGC in terms of polar requirement conservation.
One might argue that a code optimized to withhold even a frameshift mutation, might be protected against point mutation as a side effect. Therefore, we examined our codes also with a mixed measure, the average of the general point mutation measure (tMS0) and the frameshift mean squared error (fMS). Interestingly, we did not find a single permutation within our million samples to be better than the SGC for this combined measure (ftMS). Hence, the proposed results here are giving evidence that the SGC is even more than just “one in a million”, as Freeland and Hurst stated.
In Table 8 the comparison between the SGC and the best codes for each measure (WMS02, tMS0, and fMS) is summarized. For each code, we evaluated WMS02, tMS0, and fMS, as well as the proportion of better random codes found within the respective category. Code 672469, the best code under mutational bias (WMS0) is also very efficient regarding translational errors (Pt = 0.00006), but with over 70% of better codes found using the frameshift measure it clearly does not minimize the effect of frameshift mutations at all. The same behavior can be observed for code 648040, the code that performs best under translational bias. It minimizes the effect of point mutations very efficiently (), but there are 27% more conservative codes concerning the effect of frameshift mutations. The best performing code regarding the effect of frameshift mutations (code 218957) is the only one, aside from the SGC, that minimizes all three effects. It minimizes point mutations very efficient with () and is still under the best 5% that are conservative regarding translational errors.
The SGC and the three most conservative codes are shown in Fig. 3. The amino acids are colored according to their respective polar requirement. The visual comparison of all four amino acid distributions shows that Goldman’s code and code 648040 superficially bear little similarity to the SGC. However, the pattern of the best fMS code (code 218957) closely resembles SGC’s PR distribution.
Freeland and Hurst had one reason to suspect that mistranslation bias was more important than mutational bias in the course of evolution. Their single better code in terms of tMS showed a behavior very similar to that of the natural code when tested under different transition weightings, while their best code under general transition bias was, relatively, two orders of magnitude less efficient than the natural code in terms of mistranslation. The results of this work, however, indicate that this hypothesis is at least not the only aspect for which the SGC appears to be optimized. While code 648080, the best code in terms of tMS, showed a similar behavior as the natural code in terms of increasing transition bias as well, code 672469, the best code under WMS0, is nearly as efficient as the SGC in minimizing the effect of translational errors (see Table 8).
Finally, some organisms (mostly viruses) have so called overlapping reading frames (ORFs, Normark et al. (1983)), i.e., there is a protein embedded in another protein’s coding sequence, but shifted to another reading frame. This might function for gene regulation purposes, but it is a far from understood mechanism. For the scope of this work, it would be very interesting to examine whether or not there are statistical features, e.g., in the codon usage, that might be useful to find candidate regions that qualify for ORFs. There is also growing evidence of overlapping functions in other species than viruses, including mammals (Kovacs et al., 2010).
The impact of point mutations, translational errors and frameshift mutations were investigated in this work. For all three deleterious mechanisms, the genetic code shows clear evidence of its capability to minimize their effects by conserving the polarity of the coded amino acids. The results show that the SGC is most efficient in minimizing the effect of translational errors. It outperforms more than 99.99% of one million randomly generated codes. This effect even got stronger for the combination of all three proposed measures, indicating that all three factors might have been contributed independently to the evolution of this sophisticated, robust, and universal coding.
The analysis of this work assume that the codon assignments of the SGC reflect an adaptive outcome of natural selection for error minimization. It is therefore necessary to address two key vulnerabilities in terms of genetic code optimization. First, the assumed model by which mutations or mistranslations occur lacks a satisfying amount of empirical data. The quantification of mistranslational errors by Freeland and Hurst was rather crude and it would be valuable to conduct further analyses incorporating better mistranslational error data. Second, the quantification of amino acid averaged similarity is somewhat inaccurate. There is a plethora of reasons to expect similarity to rather being a multidimensional concept that is still not fully understood and may well be a partly relative phenomenon, depending on the precise amino acid sequence of a protein.
However, we do not expect the polarity requirement to be the sole evolutionary force that shaped the SGC. There are many theories about how the code might have evolved and what its origin might have looked like (Higgs, 2009). But our results—along the lines of lots of other work (Freeland et al., 2000)—show that there is a clear non-random structure underlying the SGC, making it remarkably robust not only against point but also against frame shift mutations.
Thus, our principal conclusion is that stability against frameshift mutations should be put on to the list of the series of features the SGC achieved in the course of evolution. We presented a theoretical framework to evaluate the SGC’s efficiency, assuming that all incoming nucleotides are in fact equally likely. As a matter of fact, triplet usage or nucleotide frequency are far from being uniformly distributed in the exome of almost all organisms. Hence, the consequent next step will be to take a closer look at biological relevant data and compare how competitive the SGC will be on real data scenarios.
matlab binary file including the randomly generated 1,000,000 alternative codes used in this manuscript
There are two matrices included: AA_names, which is a cell array including the names of the associated amino acids as well as their three- and one-letter abbreviation. random1, which includes in the first line the SGC, in the following 1,000,000 lines the million permutations.