Analysis of synonymous codon usage of transcriptome database in Rheum palmatum

Background Rheum palmatum is an endangered and important medicinal plant in Asian countries, especially in China. However, there is little knowledge about the codon usage bias for R. palmatum CDSs. In this project, codon usage bias was determined based on the R. palmatum 2,626 predicted CDSs from R. palmatum transcriptome. Methods In this study, all codon usage bias parameters and nucleotide compositions were calculated by Python script, Codon W, DNA Star, CUSP of EMBOSS. Results The average GC and GC3 content are 46.57% and 46.6%, respectively, the results suggested that there exists a little more AT than GC in the R. palmatum genes, and the codon bias of R. palmatum genes preferred to end with A/T. We concluded that the codon bias in R. palmatum was affect by nucleotide composition, mutation pressure, natural selection, gene expression levels, and the mutation pressure is the prominent factor. In addition, we figured out 28 optimal codons and most of them ended with A or U. The project here can offer important information for further studies on enhancing the gene expression using codon optimization in heterogeneous expression system, predicting the genetic and evolutionary mechanisms in R. palmatum.


INTRODUCTION
Codon degeneracy is a common phenomenon among living beings as the 20 different amino acids are encoded by 64 types of codons. Most amino acids encoded by two to six codons except for Met or Trp (Duret, 2002;Ang et al., 2016), this phenomenon is defined as synonymous codon. It has been reported that the using frequency of synonymous codon in encoding amino acids is non-random for different genes or genomes, which is regarded as codon usage bias (Romero, Zavala & Musto, 2000). The formation of codon bias can be influenced by multiple forces, such as natural selection, mutational pressure, and random genetic drift (Bentele et al., 2013). Moreover, the codon usage bias may affect gene transcription, protein expression and some other biological processes of protein expression (Behura & Severson, 2013;Powell & Dion, 2015). Performing the codon usage bias analysis and identifying the characteristics of codon usage bias for certain genes or genomes are very significant for comprehending the molecular mechanisms of gene expression and for understanding the long-term molecular evolution in genomes.
Rheum palmatum is an endangered and very important medicinal plant in China. The dried root and rhizomes of R. palmatum (named rhubarb) have been used to treat various diseases like infectious disease, cancer, constipation, and renal disorders in Asian countries (Zalucki, Power & Jennings, 2007). It has been reported that the main bioactive compounds of R. palmatum are anthraquinones (Jang & Kuk, 2018). R. palmatum is mainly distributed in northwest China, and the wild resources of R. palmatum are decreasing rapidly due to the overexploitation of natural resources (Zhou et al., 2014;Wang et al., 2016b). Therefore, it is beneficial for us to understand the anthraquinones biosynthetic pathways based on exploring the codon usage characteristics of R. palmatum and so to protect the wild resources of R. palmatum.
The genomic information of R. palmatum is not determined now. RNA-seq technology is an effective method to introduce transcriptome database, which can provide large numbers of coding sequences (CDS) with non-coding and repetitive sequences excluded. The transcriptome of R. palmatum was assembled with the RNA-seq method and 140,224 unigenes were screened out in our previous study. In this project, we carried out a series of analysis to determine the codon usage patterns of R. palmatum and pointed out the optimal codons. The results are intelligible for us to unify the molecular evolution of R. palmatum and provide some new perspectives to decipher gene function and carry out genetic engineering of R. palmatum in the future.

Database preparation
The transcriptome database of R. palmatum used in this project was obtained from our previous research work due to the lack of genomic information. In total, 140,224 annotated unigenes were obtained to analysis the coding DNA sequences (CDS) using the ESTScan (Iseli, Jongeneel & Bucher, 1999) and BLASTx online software (https://blast.ncbi.nlm.nih.gov/Blast.cgi). 14917 CDS were left after analysis by using ESTScan and BLASTx online software. Among these CDSs, the sequences contain correct initiation and termination codons were remained and sequences contain internal termination codons were eliminated by using SelectCDS script (Meier et al., 2017). In additional, only the sequences longer than 100 amino acids in length were hold for further analysis. Finally, 2,626 CDS were left for the downstream codon usage analysis.

Indices of codon usage and codon bias
GC3s is an important parameter for evaluating the frequency of guanine + cytosine at the third synonymously coding position, excluding Met, Trp, and termination codons. The value of RSCU (Relative Synonymous Codon Usage) was calculated by dividing the observed frequency of codon usage by that expected under the situation that all codons encoding the same amino acid are used in the same probability. When the value of RSCU is great than 1 suggests the positive codon bias, and RSCU is smaller than 1 indicates the negative codon bias, whereas the value of RSCU equals 1 that means the codons to be used randomly or equally (Zhang et al., 2017). The values of ENc (Effective Number of Codons) range from 20 to 61, that are used to estimate the codon bias for an individual gene. An ENc value equals to 20 indicates that the amino acid in genes encoded by only one codon, while an ENc value equals to 61 indicates that the absence of codon usage preference (Sharp & Li, 1987). The study has shown that when the ENc value smaller than 36, the gene is affirmed to own strong codon usage preference (Fuglsang, 2006). CAI (Codon Adaptation Index) was often used to evaluate the extent of bias toward codons that were known to be preferred in highly expressed genes. The values of CAI range from 0 to 1.0, and the bigger the value is meaning the most frequently used synonymous codon (Fuglsang, 2008). CAI is calculated using the CAI calculate server (http://genomes.urv.es/CAIcal/).

Neutrality plot analysis
The neutrality plot (GC12 vs GC3) can demonstrate the balance between mutation and selection in shaping codon bias. GC12 represents the average ratio of GC content in the first and second position of the codons (GC1 and GC2), while GC3 stands for the GC content in the third position. If we found a statistically strong correlation between GC12 and GC3, we can suggest that the dominant driving force for the R. palmatum is mutation pressure. Conversely, if there is no correlation between GC12 and GC3, we can see that the dominant driving force for the R. palmatum is natural selection (Sharp & Li, 1987;Sueoka, 1988).

ENc-GC3s plot and PR2-Bias plot analysis
The ENC-GC3s (ENC vs GC3s) plot is usually carried out to exam the codon usage of a certain gene is only affect by mutation or also by other terms such as natural selection. The ENc-GC3s plot is constructed by the abscissa GC3s values and the ordinate ENC values. Furthermore, we calculate an expected curve on the ENc-GC3s plot. If we discover that the corresponding points distribute around expectation curve, we can conclude that the mutation pressure is the independent force in the progress of forming codon bias. If the corresponding points is significantly blow or far from the expected curve, there must be some other factors such as natural selection plays a key role in the formation of codon bias (Sueoka, 1988).

Correspondence analysis of codon usage
Correspondence analysis (COA) is a widely accepted approach to determine the multivariate statistical analysis of codon usage patterns. As the genes possess 59 sense codons (61 sense codons in all, but exclude the unique Met and Trp codons), we put all the genes into a 59-dimensional hyperspace in the plot. The method can explore the major

Determination of optimal codons
We get the top and bottom 5% of genes based on the CAI values as the high and low dataset, respectively. Then we get the mean RSCU values of the two gene groups, respectively. Finally, we carried out the Chi-squared contingency test to confirm the optimal codons. The usage frequency of certain codons that was remarkably higher in genes with high expression levels than that genes with low expression levels (p < 0.01) were regarded as optimal codons (Sau et al., 2005).

Nucleotide composition of R. palmatum and codon bias
We performed the nucleotide composition of 2,626 CDS in R. palmatum as the nucleotide content can affect the codon bias seriously (Nie et al., 2014). Among the 2,626 CDS in R. palmatum, the nucleotide content of A varies from 1.58% to 58.90%, with a mean value of 30.42%, the frequency of T is 5.89%-43.89%, with an average value of 23.48%, the proportion of G is 6.52%-52.44%, with a mean value of 21.03%, the ratio of C is 4.49%-53.27%, with a mean value of 24.46%. To further understand the impact of nucleotide contents of R. palmatum genes on codon bias, we also calculated the GC contents and GC3. The results demonstrated that the GC contents of all genes ranged between 40% and 50% (Fig. 1). The mean value of GC proportion was 46.57%, which indicated that the genes in R. palmatum are richer in AT contents than that of GC. The ratio of GC1, GC2 and GC3 was 51.33%, 41.38 and 46.60, respectively, which testified that the proportion of GC1 was highest, and the contents of GC2 was very similar to that of GC3. We further performed the relationship analysis between GC12 and GC3 using the neutrality plots (GC12 against GC3) method. The content value of GC3 varied from 21% to 89%. The results shown that the slopes of the regression lines (regression coefficient) were 0.171 (Fig. 2), verifying that the dominant force of the codon bias was natural selection in the RNA-seq data of R. palmatum, rather than mutation pressure.

The effect of RSCU and ENC on codon bias
The ENC values in R. palmatum range from 30.83 to 61.00, with a mean value of 52.83. From the ENC values we can see that only 9 genes show a high codon bias with an ENC value smaller than 35. The results suggested that there is a common random codon usage in R. palmatum genes, without a strong codon bias preference. In addition, the RSCU values of 59 sense codons also indicates that the genes in R. palmatum are with a weak codon bias preference. As shown in the Table 1, more than half of the codons (30/59), marked in bold, are used frequently.

The role of GC3s in the codon bias formation
We performed the ENC-plot here to distinguish the influence of GC3s in shaping codon bias of R. palmatum. As we can see from the Fig. 3, most of the R. palmatum genes were distributed far from the expected ENC-plot curve, only a few genes are were distributed around this curve. The result suggested that the mutational pressure is not the only factor in shaping the codon bias, other factors such as translational selection may play an important role in the formation of codon bias. Further we calculated (ENCexp-ENCobs)/ENCexp for all the genes in R. palmatum to discriminate the difference between observed and expected ENC values (Sinclair & Choy, 2002). As shown in Fig. 4, (ENCexp-ENCobs)/ENCexp values for most genes were within

Correspondence analysis
We carried out the correspondence analysis based on the RSCU values of all genes from R. palmatum. The result has shown that the Axis 1 and Axis 2 were two main contributors which contribute 11.53% and 6.38% of the total variance, respectively. As shown in Fig. 5, the position of the genes on the plane defined by the first two axes. Moreover, to check the role of GC content in shaping codon bias, the GC content of genes was higher than 60% marked with purple color, the GC content was between 45% and 60% marked with blue color, the GC content was below 45% marked with yellow color, respectively. As shown in Fig. 5, the red dots were separated along the primary axis, the blue and green dots located in the middle of the plot. In addition, we have calculated the correlations among the six important parameters including Axis 1, GCall, GC3, GCall, ENC and CAI ( Table 2). The parameter Axis 1 exhibited significant correlations with other four parameters such as GC12, GC3, GCall and CAI (r = 0.458, r = 0.931, r = 0.687, r = 0.260; p < 0.001), indicating that the codon bias of R. palmatum genes were influenced by two main factors including mutational pressure and translational selection.

C/T to A/G balance analysis in third codon
The previous studies about the effect of mutation pressure on codon bias have been reported that AU and GC come in pairs at third codon positions (Wright, 1990). However, our results suggest that the frequency of A3 and U3 (or G3 and C3) are different in R. palmatum genes. Estimating the proportion of GC and AT pairs in genes can offer more details about the effect of the forces on the codon bias formation. We carried out a Parity Rule 2 (PR2) plot analysis to determine whether there exist biases R. palmatum genes. As shown in Fig. 6, most of the points were distributed between 0.2 and 0.8 in the plot which suggesting that there is a low bias in either G3/C3 or A3/T3 in R. palmatum. In addition, the plot was divided into four quadrants taking 0.5 as the center on both axes. We found that there are more points in the fourth quadrant (the ratio of G3/GC3 and A3/AT3 > 0.5) than other three quadrants. The second quadrant contained the fewest points. All the results obtained above suggested that there is a slight significant preference for A and G at the third codon position of the R. palmatum genes. Therefore, there were some other forces like translational selection playing roles in the formation of codon bias.

Effects of gene expression level
To determine the effect of gene expression level in shaping codon bias, we performed correlation coefficients analysis between the codon adaptation index (CAI) values and index values of the genes including ENC, GC3, GC12 and GCall content (Guan et al., 2018). As shown in Fig. 7 and Table 2, the index CAI values indicated a significantly negative correlation (r = −0.054, p < 0.001) with the ENC values, and showed a remarkable positive correlation with GC3 (r = 0.232, p < 0.001), GC12 (r = 0.059, p < 0.001) and GCall (r = 0.176, p < 0.001) content.

Determination of optimal codons for R. palmatum
We carried out a two-way Chi-squared contingency test to compare the codon usage among different genes. The highly-and lowly-expressed data of the genes based on average RSCU values were list in the Table 3. As shown in Table 3, 28 optimal codons were figured out and most of the optimal codons ended with A or U, except UUG for Leu. All the amino acids were coded by different codons such as Leu possess four codons and Ser possess three codons, Ile, Val, Arg, Ala, Thr and Pro all identified by two codons and others amino acids were all defined by only one codon. The results above indicated that the R. palmatum genes were preferred to A/U-ending synonymous codons which was inconsistent with Triticum aestivum (Zhang et al., 2007), Oryza sativa (Liu et al., 2004) and Zea mays (Liu et al., 2010) that preferred to G/C-ending synonymous. Our result was consistent with the study of codon usage patterns in Lonicera macranthoides which was also biased to A/U-ending synonymous codons (Liu et al., 2019).

DISCUSSION
A lot of theories have supposed to clarify the origin of codon usage bias. The two main theories are neutral theory and the ''selection-mutation-drift'' model (Sharp & Li, 1986;Bulmer, 1988). Based on the neutral theory, the mutations occur in coding positions must be mutate neutral, thus lead to random selection of synonymous codons, while according to the ''selection-mutation-drift'' pattern, codon usage bias only informed the balance between selection favoring optimal codons and mutation-drift allowing persistence of non-optimal codons (Sharp & Li, 1986). In the genes with high expression levels, the selection makes a significant role in shaping the codon usage bias, while in the genes with low expression levels mutation-drift plays important role in determining codon usage (Kliman & Hey, 1994). But with the appearance of genome information of more species, it seems that these two theories are not enough to prove the characteristics of codon usage anymore. For example, in Oryza sativa codon usage is the result of nucleotide composition and expression level of each gene, as well as CDS length (Liu et al., 2004). In our study, the factors involved in shaping R. palmatum codon usage bias include the GC content, expression level of genes and natural selection as well as mutation pressure.
A previous study has found that there was no significant correlation between codon usage bias and gene expression level in mammals, which was inconsistent with our result that the codon bias in R. palmatum genes was also influenced by gene expression level (Karlin & Mrázek, 1996). But in rice, the genes with high expression levels always have strong variation in codon usage that is consistent with our results which indicates that the highly expressed genes are preferred to GC-rich in codon usage (Liu et al., 2004).
Previous studies have found that codon usage bias is not affected by nucleotide composition in Chlamydomonas reinhardtii (Naya et al., 2001) and Echinococcus spp. (Fernandez, Zavala & Musto, 2001) genomes that GC-rich in genomes. In R. palmatum, there is clear heterogeneity of codon usage among genes: R. palmatum favored the A/Uending codons. Some previous studies have reported that the genes from dicot plant prefer the A/U-ending codons such as Malus domestica (Van Hemert & Berkhout, 2016), Myrica rubra (Li et al., 2016) and Lonicera macranthoides (Feng et al., 2013). The plant R. palmatum is a kind of dicot plant in general and our result was consistent with the data reported that shows an A/U-ending codons preference. Moreover, the ratio of A/U-ended codons and G/C-ended codons for the most frequently used codons is 22:8. The result was consistent with that of nucleotide composition above and the phenomenon was also similar to other species with richer AT contents, namely Kluyveromyces lactis, Saccharomyces cerevisiae and Pichia pastoris (Peixoto, Fernandez & Musto, 2004;Liu et al., 2019).

CONCLUSIONS
Codon usage bias is a common and a complicated natural phenomenon for various kinds of living beings (Banerjee & Roy, 2009). We can illuminate the regular patterns of evolution, find out some new genes, optimize heterogeneous expression system through the overall codon usage bias analysis (Galtier et al., 2018). With the rapid development of high throughput sequencing, the analysis of codon usage bias pattern based on genome and transcriptome data is increasing rapidly (Angov, 2011;Goodman, Church & Kosuri, 2013). Such analysis based on big data is very useful to better understand evolutionary mechanisms of species and translation selection force in shaping codon usage bias.
In our project, we carried out the codon usage bias analysis of R. palmatum genes based on the transcriptome data. The GC content in R. palmatum CDS was 46.57%, which indicated that the CDSs of R. palmatum were slightly AT rich. Furthermore, the optimal codons analysis suggested that the R. palmatum CDS were preferred to A/U-ending synonymous codons. All these results clarified that the nucleotide composition of R. palmatum plays a significant role in shaping codon bias. Meanwhile, codon bias in R. palmatum genes was also influenced by gene expression level. In addition, 28 optimal codons were figured out and most of the optimal codons ended with A or U, except UUG for Leu. After a series of analyses, the codon usage bias in R. palmatum is influenced by nucleotide composition, natural selection, mutation pressure, and gene expression level.
In conclusion, our data offers new perspectives for the codon usage pattern in R. palmatum and has made a firm foundation for the gene engineering in R. palmatum.

ADDITIONAL INFORMATION AND DECLARATIONS
Funding