COVID-19 is an ongoing global pandemic characterized by long-term respiratory system damage in patients, and is caused by the SARS-CoV-2 betacoronavirus. It is likely of zoonotic origin, but capable of human-to-human transmission, and since the first observed cases in the Wuhan province of China (Chan et al., 2020; Riou & Althaus, 2020), it has infected over 14 million people, with 612,054 recorded deaths (as of 22 July 2020). In addition to its immediate effects on the respiratory system, its long term effects are still being researched, including symptoms such as neuroinvasion (Li, Bai & Hashikawa, 2020; Wu et al., 2020), cardiovascular complications (Kochi et al., 2020; Zhu et al., 2020), and gastrointestinal and liver damage (Lee, Huo & Huang, 2020; Xu et al., 2020). Due to its high transmissibility, and capacity for asymptomatic transmission (Wong et al., 2020), study of COVID-19 and its underlying pathogen remain a high priority. As a result, the high amount of frequently updated data on viral genomes on databases such as GISAID (Elbe & Buckland-Merrett, 2017) and NextStrain (Hadfield et al., 2018) provides researchers with invaluable resources to track the evolution of the virus as it spreads across the world.
SARS-CoV-2 has a linear, single-stranded RNA genome, and does not depend on host proteins for genomic replication, instead using an RNA synthesis complex formed from nonstructural proteins (nsp) coded by its own genome. Four of the key proteins involved in the complex are nsp7, nsp8, nsp12, and nsp14, all of which are formed from cleavage of the polyprotein Orf1ab into mature peptides. Nsp12, also known as RdRp (RNA-dependent RNA polymerase), is responsible for synthesizing new strands of RNA using the viral genome as a template. Nsp7 and nsp8 act as essential co-factors for the polymerase unit, together creating the core polymerase complex (Kirchdoerfer & Ward, 2019; Peng et al., 2020), while nsp14 is an exonuclease which provides error-correcting capability to the RNA synthesis complex, therefore allowing the SARS-CoV-2 to maintain its large size genome (Subissi et al., 2014; Ma et al., 2015; Ogando et al., 2019; Romano et al., 2020). Owing to their role in maintaining replication fidelity and directly affecting the mutation-selection equilibrium of RNA viruses, these proteins are key targets of study in understanding the mutation accumulation and adaptive evolution of the virus (Eckerle et al., 2010; Peng et al., 2020).
In our previous study, we examined the top 10 most frequent mutations in the SARS-CoV-2 nsp12, and identified that four of them are associated with an increase in mutation density in two genes, the membrane glycoprotein (M) and the envelope glycoprotein (E) (the combination of which is hereafter referred to as MoE, as we previously described), which are under less selective pressure, and mutations in these genes are potential markers of reduced replication fidelity (Eskier et al., 2020a). In this study, we follow up on our previous findings and analyze the mutations in nsps 7, 8 and 14, in addition to nsp12, to identify whether the mutations are associated with a nonselective increase in mutation load or not. We then examine whole genome mutation densities in mutant isolates in comparison to wildtype isolates using linear regression models, in order to understand whether the mutations are associated with potential functional impact. Our findings indicate that mutations in nsp14 are most likely to be predictors of accelerated mutation load increase.
Materials and Methods
Genome sequence filtering, retrieval and preprocessing
As previously described (Eskier et al., 2020a), SARS-CoV-2 isolate genome sequences and the corresponding metadata were obtained from the GISAID EpiCoV database (date of accession: 17 June 2020). We applied further quality filters, including selecting only isolates obtained from human hosts (excluding environmental samples and animal hosts), those sequenced for the full length of the genome (sequence size of 29 kb or greater), and those with high coverage for the reference genome (<1% N content, <0.05% unique mutations, no unverified indel mutations). To ensure alignment accuracy, all nonstandard unverified nucleotide masking was changed to N due to the specifications of the alignment software, using the Linux sed command, and the isolates were aligned against the SARS-CoV-2 reference genome (NC_045512.2) using the MAFFT (v7.450) alignment software (Katoh et al., 2002), using the parameters outlined in the software manual for aligning closely related viral genomes (available at https://mafft.cbrc.jp/alignment/software/closelyrelatedviralgenomes.html). Variant sites in the isolates were annotated using snp-sites (2.5.1), bcftools (1.10.2), and ANNOVAR (release date 24 October 2019) software (Wang, Li & Hakonarson, 2010; Page et al., 2016), to identify whether a given mutation was synonymous or nonsynonymous. In addition, the 5′ untranslated region of the genome (bases 1–265) and the 100 nucleotides at the 3′ end were removed from the alignment and annotation files due to a high number of gaps and unidentified nucleotides. We further removed any sequences with incomplete sequencing location or date data in order to avoid complications in downstream analyses. Following the filters, 29,600 genomes were used for the analyses.
Mutation density calculation
Variants were categorized as synonymous and nonsynonymous following annotation by ANNOVAR, with intergenic or terminal mutations being considered synonymous. Gene mutation densities were calculated separately for synonymous and nonsynonymous mutations, as well as the total of SNVs, for each isolate, using a non-reference nucleotides per kilobase of region metric. Mutation densities were calculated for the combined membrane glycoprotein (M) and envelope glycoprotein (E) genes (MoE), the surface glycoprotein gene (S) and the whole genome.
Descriptive statistics for continuous variable days were calculated with mean, standard deviation, median, and interquartile range. Kolmogorov–Smirnov test was used to check the normality assumption of the continuous variables. In cases of non-normally distributed data, the Wilcoxon rank-sum (Mann–Whitney U) test was performed to determine whether the difference between the two MoE status groups was statistically significant. The Fisher’s exact test and the Pearson chi-square test were used for the analysis of categorical variables. The univariate logistic regression method was utilized to assess the mutations associated with MoE status in single variables, and then multiple logistic regression method was performed. The final multiple logistic regression model was executed with the backward stepwise method. The relationship between mutation density and time in isolates with mutations of interest, as well as in the group comprising all isolates, was examined via non-polynomial linear regression model and Spearman’s rank correlation. A p-value of less than 0.05 was considered statistically significant. All statistical analyses were performed using IBM SPSS version 25.0 (Chicago, IL, USA).
Results and Discussion
Increases in the mutation load of SARS-CoV-2 are unevenly distributed across its genome
To identify the trends in SARS-CoV-2 mutation load over time, we calculated the average mutation density per day for all isolates for whole genome, S gene and MoE regions, capping outliers at the 95th and 5th percentile values to minimize the potential effects of sequencing errors (Fig. 1). Our results show a very strong positive correlation between average mutation density and time, in both the genome level and the S gene. In comparison, MoE has a weak positive correlation, with a wider spread of mean density in early and late periods compared to the genome and the S gene. This is consistent with reduced selective pressure on the M and E genes, as has previously been described (Dilucca et al., 2020b). The top nonsynonymous mutation is 23403A>G (in 22,271 isolates), responsible for the D614G substitution in the spike protein, followed by the 14408C>T mutation (in 22,226 isolates) in the nsp12 region of the Orf1ab gene, causing P323L substitution in the RdRp protein, and the 28144C>T mutation (in 3,081 isolates), responsible for the L84S substitution in the Orf8 protein. The most common synonymous mutation is the 8782C>T mutation (in 3,047 isolates), and is found on the nsp4 coding region of the Orf1ab gene. For the S gene, the most frequent synonymous mutation is the 23731C>T mutation (in 622 isolates), and the second most common nonsynonymous mutation, after the aforementioned D614G mutation, is 25350C>T (in 215 isolates), responsible for the P1263L substitution. For MoE, the most common synonymous and nonsynonymous mutations are 26735C>T (in 341 isolates) and 27046C>T (in 530 isolates), respectively, both of which are found in the M gene, and the latter of which causes T175M amino acid substitution. Other than the D614G mutation, all of the mentioned mutations are C>T substitutions, the prevalence of which in T- or A-rich regions of the SARS-CoV-2 genome have been previously documented (Simmonds, 2020).
Mutations in RNA synthesis complex proteins are associated with higher mutation load
After identifying the increase in mutation load over time, which was more prominent in genes with high functional impact (S, Orf1ab) compared to other structural genes (M, E and N), as seen in Fig. 1 and Figs. S1 and S2, we sought to examine possible associations of variants in proteins involved in SARS-CoV-2 genome replication with the increase. We first identified the five most frequently observed mutations for nsps 7, 8, 12 (also known as RdRp) and 14, four of the proteins cleaved from the Orf1ab polyprotein and are involved in the RNA polymerization, followed by analyzing the association of each mutation with the presence of MoE mutations (hereafter referred to as MoE status) using the chi-square test. A total of 12 out of the 20 mutations were found to have a significant association with MoE status (p-value < 0.05) (Table 1). Compared to our previous findings on the top 10 nsp12 mutations (Eskier et al. 2020a), which was based on an analysis of 11,208 samples as of 5 May 2020, 13536C>T and 13862C>T have increased in rank of appearance, from 6th and 7th to 4th and 5th, respectively, and decreased in p-value to show statistically significant associations. In addition, the 13730C>T mutation have increased in rank of appearance from 4th to 3rd. Out of the other nsps tested, nsp14 was found to have four significant mutations, while nsp7 had two and nsp8 had one.
|NSP||Mutations||Values||MoE absent||MoE present||Total||p|
Effects of geographical location on MoE status
In addition to time and genotype, we also examined the potential association between the location of isolates and MoE status as a possible confounding factor. We first examined whether there is a significant association between location, defined here as continent the isolate was originally obtained, and MoE status. Our results indicate that there is a strong association between location and MoE status, with the highest percentage of MoE present isolates in Asia (14.5%), and the percentage ratio in South America (6.5%) (p-value < 0.001). In comparison to our previous findings, South America had a dramatic decrease in MoE present isolate percentage, likely as a result of the increased sequencing efforts (from 118 isolates to 416) removing potential sampling biases or localized founder effects. Africa, Asia, and North America had an increase in MoE present proportion, while Europe, Oceania, and South America showed lowered percentages (Table 2).
|Locations||MoE absent||MoE present||Total||p|
After observing the potential confounding effect of location on MoE status, we sought to understand whether a location is more or less likely to predict MoE status, using a logistic regression model (Table 3). Comparing each individual region (1) to the other five (0), we found that Asia, Europe and North and South America are all possible predictors of MoE status (p-value < 0.05), with Asia and Europe 1.697 and 1.184 times as likely to be MoE present as the other regions, and North and South America 0.589 and 0.650 times as likely, respectively.
OR, odds-ratio; CI, confidence interval.
Using these findings, we created different logistic regression models to identify which of the 12 mutations are likely to be independent predictors of MoE status (Table 4). In the single variable model, all 12 mutations we previously identified and location were found to be potential predictors (p-value < 0.05). Forming final models including the 12 mutations (Final Model A) and the mutations as well as locations (Final Model B), we observed that the predictor effect of two of the mutations nsp8 12478G>A and nsp14 18998C>T do not appear to be sufficiently independent of the other mutations in Final Model A. After adding the location variable to the Final Model A, location remains a significant predictor, with all five non-reference locations less likely to predict MoE than Asia, the reference location, and nsp12 14805C>T is found to not have a predictor effect independent of location (p-value = 0.073). Following Final Model B, nine mutations appear to have a significant association with MoE status, independent of other variables: 11916C>T, 12073C>T, 13536C>T, 13730C>T, 13862C>T, 14408C>T, 18060C>T, 18736T>C and 18877C>T (p-value < 0.05).
|Mutations||Single variables||Final model A||Final model B|
|p||OR||95% CI||p||OR||95% CI||p||OR||95% CI|
OR, odds-ratio; CI, confidence interval; Multiple logistic regression final model was executed on all these statistically significant variables, included together in the model, and selected with the backward stepwise method.
Nsp14 mutations have significant impact on increased genomic mutation density
We then examined the effects of each mutation on genomic mutation density to see whether the relationship between the mutations and MoE status are indicative of a genome-wide trend. Due to selection potentially effecting nonsynonymous mutations differentially, we separated the mutations in the two categories and calculated mutation density separately for each category. Our results show that nsp14 mutations show the most consistent association with mutations between MoE and the whole genome. All three nsp14 mutations (18060C>T, 18736T>C and 18877C>T) which have a significant association with MoE status also show a similar relationship with genomic mutation density (Fig. 2). 18060C>T (L7L) has the lowest odds ratio for MoE status (Table 4), and while it shows a slower increase in synonymous mutation density compared to wildtype isolates (Fig. 2A), it has a significant impact on faster mutation density increase in nonsynonymous mutations (Fig. 2B). In comparison, 18877C>T (L270L) (Figs. 2C and 2D) and 18736T>C (F233L) (Figs. 2E and 2F) both show a high prediction capacity for MoE and an increased mutation density. In comparison, mutations in nsp7 (Figs. S3 and S4) and nsp12 (Figs. S5–S8) show much lower impact on altered mutation density increase rate. 12073C>T, an nsp7 mutation, displays high divergence from wildtype isolate patterns; however, its low sample size (n = 16) creates a skewed distribution of isolates across time, complicating any potential inference.
Our previous work identified RdRp mutations as contributors to the evolution of the SARS-CoV-2 genome and this study confirmed those findings. Furthermore, we hypothesized that mutations of the other critical components of the viral replication and transcription machinery may have similar effects. Our results implicate nsp14 as a source of increased mutation rate in SARS-CoV-2 genomes. Three of the five most common nsp14 mutations, namely 18060C>T, 18736T>C and 18877C>T are associated with increases in both genome-wide mutational load, as well as MoE status, an alternative indicator of mutational rate and virus evolution. Interestingly all three are located within the ExoN domain, which is responsible for the proofreading activity of nsp14; however, only 18736T>C mutation is non-synonymous (F233L), while 18060C>T and 18877C>T are synonymous mutations and therefore, only after functional studies it will be possible to understand their effects on viral replication processes.
The origins and fates of the three nsp14 mutations are also quite different: Being present in the first case detected in the Washington state of the US in mid-January, 18060C>T mutation has been almost completely confined to the US, as 1,657 of 2,007 isolates (82.6%) originate from the US (https://bigd.big.ac.cn/ncov/variation/annotation/variant/18060, accessed 6 September 2020). On the other hand, 18877C>T mutation arising around at the end of January likely in Saudi Arabia and being detected in much less cases (n = 893), is still present in many isolates, most frequently in Saudi Arabia (54.1%) and Turkey (37.4%). 18736T>C mutation was first detected in the US at the beginning of March and like the 18060C>T mutation, has almost completely been limited to the US (281/362 or 77.6%). Unlike the other two, this mutation has been detected in only two isolates since 27 May, and not after 1 July 2020. However, it should be noted that 18877C>T mutation arose within the dominant 23403A>G/14408C>T lineage, while the other two nsp14 mutations are in different lineages. Therefore, dominance or disappearance of different nsp14 mutations may have less to do with these particular mutations and more with the co-mutations. Yet, we cannot rule out possible effects of these nsp14 mutations on the fitness of SARS-CoV-2.
Previous studies on alphacoronavirus nsp14 protein had shown that nsp14, via its exonuclease activity, can modulate host-virus interactions, degrading double-stranded RNA produced during genome replication to suppress immune response, thus increasing viral viability (Becares et al., 2016). SARS-CoV-2 nsp14, due to similar exonuclease activity, is therefore a potential modulator of host interactions, independent of its link to increased mutation load. However, the exact effects of the mutations we identified, two of which are synonymous and may only indirectly affect protein structure, have to be studied experimentally to show any possible changes in viral property that they might affect. Of note, a recent study where codon usage of SARS-CoV-2 was analyzed in terms of temporal evolution of the virus genome revealed that nsp14 is one of three genes (together with S and N genes) that display the highest Codon Adaptation Index (CAI) values (Dilucca et al., 2020a). CAI is a measure of optimal codon usage and indicates how well codons adapt to the host. Based on higher CAI values in nsp14, one could speculate that such mutations have been accumulating preferentially to reach the optimal mutation rate that allows the most advantageous mutation-selection equilibrium for SARS-CoV-2. Indeed, our previous results (Eskier et al., 2020b) indicated that the mutation densities of SARS-CoV-2 genomes are closely related to the pandemic stage and population dynamics directly affects the average mutational load of the viral genome. During the rapid growth stages, such as those observed in March in the UK and the US, replication fidelity can be traded off to gain higher replication rates and broader mutational diversity. However, mutations in the replication machinery that result in too high mutation rates would likely be detrimental and eliminated. On the other hand, a small percentage of the resulting mutations could possibly be advantageous, including those that could confer resistance to antiviral drugs. So far, we or others have not been able to detect such mutations advantageous for the virus, however, higher mutation rates make appearance of such a mutation more likely.
We believe that the mutations discussed in this study can be of help to future studies, in both fighting the COVID-19 pandemic, and better understanding of how mutations in coronavirus replication proteins can affect viral viability and replication fidelity in hosts. Also, it is yet to be determined whether COVID-19 cases infected with SARS-CoV-2 that has mutation(s) that are associated with higher mutation rate respond better to nucleoside analogs, such as remdesivir or ribavirin.