A 3-mRNA-based prognostic signature of survival in oral squamous cell carcinoma

Background Oral squamous cell carcinoma (OSCC) is the most common type of head and neck squamous cell carcinoma with an unsatisfactory prognosis. The aim of this study was to identify potential prognostic mRNA biomarkers of OSCC based on analysis of The Cancer Genome Atlas (TCGA). Methods Expression profiles and clinical data of OSCC patients were collected from TCGA database. Univariate Cox analysis and the least absolute shrinkage and selection operator Cox (LASSO Cox) regression were used to primarily screen prognostic biomarkers. Then multivariate Cox analysis was performed to build a prognostic model based on the selected prognostic mRNAs. Nomograms were generated to predict the individual’s overall survival at 3 and 5 years. The model performance was assessed by the time-dependent receiver operating characteristic (ROC) curve and calibration plot in both training cohort and validation cohort (GSE41613 from NCBI GEO databases). In addition, machine learning was used to assess the importance of risk factors of OSCC. Finally, in order to explore the potential mechanisms of OSCC, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis was completed. Results Three mRNAs (CLEC3B, C6 and CLCN1) were finally identified as a prognostic biomarker pattern. The risk score was imputed as: (−0.38602 × expression level of CLEC3B) + (−0.20632 × expression level of CLCN1) + (0.31541 × expression level of C6). In the TCGA training cohort, the area under the curve (AUC) was 0.705 and 0.711 for 3- and 5-year survival, respectively. In the validation cohort, AUC was 0.718 and 0.717 for 3- and 5-year survival. A satisfactory agreement between predictive values and observation values was demonstrated by the calibration curve in the probabilities of 3- and 5- year survival in both cohorts. Furthermore, machine learning identified the 3-mRNA signature as the most important risk factor to survival of OSCC. Neuroactive ligand-receptor interaction was most enriched mostly in KEGG pathway analysis. Conclusion A 3-mRNA signature (CLEC3B, C6 and CLCN1) successfully predicted the survival of OSCC patients in both training and test cohort. In addition, this signature was an independent and the most important risk factor of OSCC.


INTRODUCTION
Oral squamous cell carcinoma (OSCC) is a common malignant tumor, which results in approximately over 600,000 new cases and 350,000 related deaths every year (Siegel, Miller & Jemal, 2017). Advances in treatments for OSCC have significantly improved the quality of life and further life expectancy of patients. However, the overall clinical outcomes of patients remain poor, especially among those diagnosed at advanced stages (Shimomura et al., 2018;Wang et al., 2017). Hence, exploring effective prognostic models is essential to predict overall survival (OS) of OSCC patients and guide the therapeutic process for clinicians. Present prognostic models of OSCC focus on multiple clinicopathological parameters like age, smoking status, and advanced clinical stage at diagnosis (Boxberg et al., 2018;Lee et al., 2010). However, cancers generally possess complex molecular regulation mechanisms and thus traditional predictive factors are limited by their efficiency, specificity, and consistency among patients (Shen et al., 2017a).
Recently, molecular biomarkers of diagnosis or prognosis, such as protein-coding genes and non-coding RNAs, have gained much attentions in oncology. Messenger RNAs (mRNAs) have been highlighted for their important roles in physiological and pathological processes as well as their potential predictive abilities (Feng et al., 2019;Tschirdewahn et al., 2019). Moreover, an integrated model composed of multiple genes has been shown to be more predictive than single clinic biomarker (Shao et al., 2018). An area under the curve (AUC) of 0.7 and above indicates relatively good prediction. For example, a 7-mRNA signature (AATF, APP, GNPDA1, HPRT1, LASP1, P4HA1 and ILF3) of head and neck squamous cell carcinoma (HNSCC) shows a moderate predictive ability for 5-year OS (AUC for training set, 0.75; testing set, 0.66) (Shen et al., 2017a). Another prognostic model of HNSCC based on 6-mRNA (FOXL2NB, PCOLCE2, SPINK6, ULBP2, KCNJ18 and RFPL1) also has a good performance of 5-year OS (AUC for training set, 0.766; testing set, 0.669) (Tian, Meng & Zhang, 2019). However, existing prognostic models for OSCC lacks excellent accuracy and a comprehensive assessment. A three-mRNA signature (PLAU, CLDN8 and CDKN2A) identified by Zhao et al. (2018b) performs with insufficient accuracy (AUC = 0.609) and this model was not assessed by calibration plot. Another prognostic signature was derived from seven DNA methylation CpG sites (AJAP1, SHANK2, FOXA2, MT1A, ZNF570, HOXC4, and HOXB4) as a potential reliable indicator for predicting survival in training cohort (AUC = 0.76), while in validation cohort the accuracy and stability of this model needs further improvements (AUC = 0.66 or 0.67) (Shen et al., 2017b). The current study therefore attempted to develop a better prognosis model with comprehensive evaluation.
The Cancer Genome Atlas (TCGA) is a public database with genome sequencing data on 33 tumor types, analysis of which has contributed a tremendous amount to the understanding of potential molecular mechanisms, diagnostic prediction and the prognosis of cancers . In this study we developed and validated a prognostic model with 3 mRNAs based on TCGA OSCC RNA-seq data, which showed good performance for 3-and 5-year overall survival (OS). Furthermore, this 3-mRNA signature was demonstrated as the most important and independent factor for risk of OSCC survival compared with other clinical features (age, gender, race, survival time, survival status, tumor grade, pathological stage and TNM stage).

Data source and differential expression analysis
Level 3 RNA-Seq data consisting of 328 OSCC tissues and 32 normal controls were downloaded from the TCGA data portal (up to December 8, 2018). Related clinical data (age, gender, race, survival time, survival status, tumor grade, pathological stage and TNM stage) were also obtained. All the data concerned in this study are publicly available from the United States National Cancer Institute (https://portal.gdc.cancer.gov/). Raw sequence data was originally derived from IlluminaHiSeq_RNASeq. We used the edgeR package to identify differentially expressed mRNAs (DEmRNAs) between OSCC and normal tissues. False discovery rate (FDR) was used for multiple testing correction of all p-values. Absolute log 2 fold change(FC) ≥ 2 and the FDR < 0.01 were used as cut-off criteria.

Construction and evaluation of mRNA prediction model
Only patients with complete information to build prediction model, including age, gender, race, survival time, survival status, pathological stage and TNM stage were included. Metastatic stages were not analyzed, since most data for this item were missing. After filtering available data, a total of 259 OSCC patients and 32 normal tissues were included in our analysis; their baseline information is shown in Table 1.
Univariate Cox analysis was used to estimate the association between the expression level of mRNAs and patient's OS with p < 0.05 set as the cut-off of statistical significance. Then, the least absolute shrinkage and selection operator (LASSO) method (Tibshirani, 1997) was used to further screen out prognostic mRNAs. Finally, stepwise multivariate Cox regression analysis was performed to construct a mRNA-derived prognostic model based on the Akaike information criterion (AIC). After calculating the risk score based on the formula of the chosen model, smooth curve fitting was competed to evaluate the relationship between risk score and survival . If a linear relationship between these measures was found, then the cut-off value was set as the median risk score. Otherwise, segmented regression model and likelihood ratio test were employed to find the threshold. The Kaplan Meier survival curve was completed to assess the survival rates of patients with high/low risk score.
A nomogram was generated to predict the individual's overall survival at 3 years and 5 years. The time-dependent receiver operating characteristic (ROC) curve and calibration plot were used to assess the performance of the nomogram.

Validation of the mRNA prediction model
Dataset GSE41613 with 97 OSCC patients was selected as the validation cohort from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/). The timedependent ROC curve and calibration plots were created to investigate whether the built model could effectively predict survival in OSCC.

3-mRNA signature as an independent and important prognostic factor
To verify that the mRNA signature was independent of other clinical characteristics, univariate and multivariate Cox regression analysis were implemented. First, univariate Cox analysis identified the clinical features that were related to OS of OSCC patients. Then multivariate Cox regression was used to explore whether the mRNA signature could be an independent indicator after adjusting other traits. Then machine learning (extreme gradient boosting (XGBoost)) analysis (Livne et al., 2018) was used to further identify the importance of the mRNA signature on the prognosis of OSCC compared with other clinical features. The parameters were set as following: booster=gbtree, objective=binary:logistic, max_depth=10, min_child_weight=6.23, subsample=0.75, colsample_bytree=0.99. The XGBoost model has been previous demonstrated to provide state-of-the-art results for a variety of medical applications and has received numerous awards in machine learning (Chen & Guestrin, 2016).

Functional enrichment analysis
DEmRNAs were identified by edgeR package based on the high/low risk score with a cutoff criteria of absolute log 2 FC ≥ 1 and the FDR < 0.05. Gene Ontology (GO) enrichment analysis was performed using the Database for Annotation Visualization and Integrated Discovery (DAVID) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways with clusterProfiler package. A p value < 0.05 was considered to represent statistical significance. All statistical analyses were performed using R and EmpowerStats software (http://www.empowerstats.com; X&Y Solutions, Inc, Boston, MA, USA).

DEmRNAs in OSCC
A total of 1996 DEmRNAs (707 up-regulated and 1289 down-regulated) were identified in this OSCC dataset (Table S1). The distribution of DEmRNAs between OSCC and normal controls are shown in Fig. 1. The expression heat map of DEmRNAs is presented in Fig.  S1 where red and green represents significantly upregulated and downregulated genes, respectively.

Construction and evaluation of a mRNA prediction model
A total of 120 DEmRNAs with potential prognostic value were identified by univariate Cox analysis, 50 remained after being filtered by LASSO ( Figs. 2A and 2B). Finally, 3 mRNAs (CLEC3B, CLCN1 and C6 ) were selected to construct prediction model by stepwise multivariate Cox regression analysis. The total risk score was imputed as follows: The result of smooth curve fitting shows a linear relationship between risk score and survival, thus the median of the risk score was viewed as cut-off value (high risk score: risk score >−3.03, low risk score: risk score <−3.03). Patients with a high risk score had a poor prognosis (Fig. 3). Nomograms of 3-and 5-year OS in the cohort are presented in Fig. 4. AUC of the time-dependent ROC curve was 0.705 and 0.711 for 3-and 5-year survival, respectively (Fig. 5). Remarkable, the calibration curve also demonstrated satisfactory agreement between predictive values and observation values in the probabilities of 3-and 5-year survival (Fig. 5).

External validation set and performance
The GSE41613 dataset including 97 OSCC patient records was chosen to validate the prognostic three genes signature. Time-dependent AUCs of risk scores were 0.718 and 0.717 for 3-and 5-year OS, respectively (Fig. 6). The calibration curve also shows satisfactory agreement between predictive values and observation values for the probabilities of 3-and 5-year OS in the cohort (Fig. 6).

Risk score of 3-mRNA signature as an independent indicator to predict OSCC prognosis
Cox analysis was performed to assess the independent prognostic value of the 3-mRNA signature and the results are shown in Table 2. According to the results from univariate analysis, the risk score, tumor grade, tumor size, lymphatic metastasis and pathological stage were significantly associated with OS in the TCGA cohort. After multivariate adjustment using the factors above, the risk score remained a powerful and independent prognostic factor (hazard ratio [HR] = 1.9, p < 0.001) in this cohort.
To further understand the importance of the risk score, machine learning analysis was carried out. The XGBoost model indicated a high performance in predicting OSCC patients' survival status (AUC = 0.927) (Fig. 7). In addition, the most important parameter for final survival status prediction was identified with this model as the risk score (Table 3 and Fig. 7).

Functional enrichment analysis
A total of 404 differentially expressed genes (218 upregulated genes and 186 downregulated genes) correlated with high/low risk score group were displayed (Fig. 8). To better understand the potential mechanism of processing different risk score groups, GO and KEGG enrichment analyses were performed. For ''biological processes (BP)'', the top five enriched terms were cellular protein metabolic process, negative regulation of endopeptidase activity, phospholipid efflux, retinoid metabolic process and cholesterol efflux. For the ''cellular component (CC)'' ontology, the top five were: extracellular space, extracellular region, blood microparticle, chylomicron and very-low-density lipoprotein particle. Finally, the five ''molecular function (MF)'' top terms were structural molecule activity, lipid binding, cholesterol transporter activity, phosphatidylcholine-sterol Oacyltransferase activator activity and lipid transporter activity (Figs. 9A, 9B and 9C and Table S2). Additionally, a total of 13 significantly enriched KEGG pathways are listed in Table S3, and the top 10 KEGG pathways are shown in Fig. 9D. A neuroactive ligand-receptor interaction was found to be the main associated pathway.

DISCUSSION
In the era of precision medicine, it is necessary to use accurate prognostic models to guide the clinical decision-making process and to design a more personalized therapeutic program for patients. However, the performance of existing OSCC prognostic models are limited. Thus, this study aimed to develop a better prognostic model. A 3-mRNA-based model (CLEC3B, CLCN1 and C6 ) of OSCC and it showed moderate predictive ability (training cohort: AUC = 0.705 and 0.711 for 3-and 5-year OS; validation cohort: AUC = 0.718 and 0.717 for 3-and 5-year OS). In addition, the 3-mRNA signature was an independent and the most important risk factor of the prognosis of OSCC. Three articles in the literature build a prognostic model of OSCC based on molecule. A 3-mRNA signature and a 3-lncRNA signature indicate an imperfect predictive ability (AUC < 0.7) (Zhao et al., 2018a;Zhao et al., 2018b). Although seven DNA methylation CpG sites are a potential reliable indicator for predicting OS (AUC = 0.76), its accuracy for validation requires improvement (AUC = 0.66 or 0.67) (Shen et al., 2017b). The 3-mRNA-based model constructed in the current study performed well in both training and validation cohorts (training cohort: 0.711; validation cohort: AUC = 0.717). Calibration is also an important indicator to assess prediction model; however, the three exiting models fail to perform calibration curve (Shen et al., 2017b;Zhao et al., 2018a;Zhao et al., 2018b).
To assess the consistency of our prediction model, calibration curves were implemented and the result of analysis indicated that our model performed well.
In addition, to test the importance of survival status prediction with this 3-mRNA signature, machine learning method (XGBoost) was used to rank clinical parameters and the 3-mRNA signature. XGBoost is a gradient tree boosting algorithm with best performance to solve the classification problems. (Ogunleye & Qing-Guo, 2019). In addition, XGBoost has the advantage of assessing of parameter importance by calculating as the cumulative average of the modality gain over all the constituent decision trees in the ensemble model   (Livne et al., 2018). The results of machine learning verified that 3-mRNA signature was the most important factor to OSCC survival with high accuracy (AUC = 0.927). This finding again demonstrated that the identified 3-mRNA signature could better predict the survival for OSCC patients and they may play important roles in the development of OSCC. CLEC3B, a member of the C-type lectin superfamily, is located on human chromosome 3p21.31. The function of CLEC3B depends on the location of the tumor. The gene acts as an oncogene in colorectal cancer because CLEC3B is secreted by cancer-associated fibroblasts promotes tumor cell migration (Zhu et al., 2019), whereas it could inhibit clear cell renal cell carcinoma proliferation via mitogen-activated protein kinase (MAPK) pathway (Liu et al., 2018). In our study, it may act as a tumor suppressor in that down-regulation of CLEC3B indicates a poor prognosis. CLEC3B expression has negative correlation with proliferation inducers and proliferative markers, while there is a positive correlation between CLEC3B and proliferation inhibitors, indicating CLEC3B lowers cell proliferation, and may explain why it may serve as a tumor suppressor in OSCC (Liu et al., 2018). In addition, tetranectin coded by CLEC3B is significantly down-regulation in metastatic OSCC (Arellano-Garcia et al., 2010), indicating that CLEC3B may suppress OSCC development. However, specific mechanism of CLEC3B needs to be further explored. Similarly, C6, complement 6, encodes protein which formats the membrane attack complex (MAC) with C5b, C7, C8 and C9. C6 is down-regulated in oesophageal carcinoma and its reduction may result in a reduction in local complement components and MAC and thus contribute to cancer resistance to the complement attack (Oka et al., 2001). When tumor cells acquire resistance to complement attack, the deposition of sublytic MAC on the cell membrane is increased. Sublytic doses of the MAC involved in various biological processes of tumor, such as the promotion of angiogenesis, proliferation, differentiation and the inhibition of apoptosis (Pio, Corrales & Lambris, 2014). That is a partial explanation why C6 was down-regulated in OSCC but its high expression indicated a poor prognosis.
CLCN1 is a member of chloride channel genes family, encoding voltage-gated chloride channel (CLC-1). Previous studies have indicated that CLC-1 determines up to 80% of the resting membrane conductance in skeletal muscles and the mutation of CLCN1 reduces chloride conductance and then result in myotonia congenita (Peng et al., 2018;Tsujino et al., 2011). However, the function of CLCN1 in tumors has not been studied. Its down-regulation was observed to indicate a poor prognosis. This shows that CLCN1 may serve as a tumor suppressor gene in OSCC which may be because chloride channels are involved in the regulation of cell cycle (Peretti et al., 2015).
GO enrichment analysis indicated that the 3-mRNA signature may be involved in lipid metabolism. Increasing studies have found that lipid metabolism plays an important role in tumor development, including tumor cell migration (Byon et al., 2009), proliferation andangiogenesis (Lu et al., 2017). In addition, lipid metabolism is associated with oxidative stress which also influences tumor progression (Zablocka-Slowinska et al., 2019). Neuroactive ligand-receptor interaction was found to be the most significant pathway in this study, which participates in multiple biological processes, such as apoptosis and cell proliferation (Zan & Li, 2019).
This study identified three valuable mRNAs that are associated with the prognosis of OSCC. The 3-mRNA signature showed satisfactory performance in both training and validation cohort, however, it is still needed to be further validated in cohort study. In addition, biological function of the three prognostic value genes needs to be explored.

CONCLUSION
A novel 3-mRNA signature (CLEC3B, C6 and CLCN1) successful predicted the survival of OSCC patients in both training and test cohort. In addition, the signature is the independent and the most important risk factor of OSCC.