Identification and validation of a three-gene signature as a candidate prognostic biomarker for lower grade glioma

Background Lower grade glioma (LGG) are a heterogeneous tumor that may develop into high-grade malignant glioma seriously shortens patient survival time. The clinical prognostic biomarker of lower-grade glioma is still lacking. The aim of our study is to explore novel biomarkers for LGG that contribute to distinguish potential malignancy in low-grade glioma, to guide clinical adoption of more rational and effective treatments. Methods The RNA-seq data for LGG was downloaded from UCSC Xena and the Chinese Glioma Genome Atlas (CGGA). By a robust likelihood-based survival model, least absolute shrinkage and selection operator regression and multivariate Cox regression analysis, we developed a three-gene signature and established a risk score to predict the prognosis of patient with LGG. The three-gene signature was an independent survival predictor compared to other clinical parameters. Based on the signature related risk score system, stratified survival analysis was performed in patients with different age group, gender and pathologic grade. The prognostic signature was validated in the CGGA dataset. Finally, weighted gene co-expression network analysis (WGCNA) was carried out to find the co-expression genes related to the member of the signature and enrichment analysis of the Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway were conducted for those co-expression network. To prove the efficiency of the model, time-dependent receiver operating characteristic curves of our model and other models are constructed. Results In this study, a three-gene signature (WEE1, CRTAC1, SEMA4G) was constructed. Based on the model, the risk score of each patient was calculated with LGG (low-risk vs. high-risk, hazard ratio (HR) = 0.198 (95% CI [0.120–0.325])) and patients in the high-risk group had significantly poorer survival results than those in the low-risk group. Furthermore, the model was validated in the CGGA dataset. Lastly, by WGCNA, we constructed the co-expression network of the three genes and conducted the enrichment of GO and KEGG. Our study identified a three-gene model that showed satisfactory performance in predicting the 1-, 3- and 5-year survival of LGG patients compared to other models and may be a promising independent biomarker of LGG.


INTRODUCTION
With the development of sequencing and bioinformatics technologies, accumulating studies have revealed that different patients may be similar in glioma grade but differ greatly in molecular characteristics, clinical prognosis and treatment response. So many central nervous system tumors were named according to molecular parameters and histopathologic diagnosis, especially gliomas, ependymomas and medulloblastomas in the 2016 revision of the WHO classification (Zhang et al., 2019b). As we know, some molecular markers, such as MGMT (O6-methylguanine DNA methyltransferase) (Binabaj et al., 2018), isocitrate dehydrogenase (IDH) (Kwon et al., 2019), epidermal growth factor receptor (EGFR) (Chistiakov, Chekhonin & Chekhonin, 2017) and phosphatase and tensin homolog (PTEN) (Koshiyama et al., 2017) that have contributed to personalized therapeutic approaches and targeted anti-glioblastoma therapies have been routinely tested in glioblastoma patients clinically (Yin et al., 2019). However, there are few specific clinical indicators and therapeutic targets for LGGs compared to glioblastoma, so there is an urgent need to elucidate the mechanism of glioma development and progression, which can provide potential treatment targets for LGGs.
In this study, gene expression RNAseq data and corresponding clinical information of LGG patients were downloaded from UCSC Xena (https://xenabrowser.net/hub/) and Chinese Glioma Genome Atlas (CGGA, http://www.cgga.org.cn/). By analyzing data from UCSC Xena using a robust likelihood-based survival model and Cox regression, we developed a three-gene signature that provides effective survival risk stratification of patients with LGG and validated the signature in the CGGA dataset. These results demonstrate the potential of the three-gene signature for survival prediction of patients with LGG and provide new potential molecular treatment targets for LGGs.

Dataset of patients with LGG
The LGGs RNA sequencing (RNAseq) data and corresponding clinical information were downloaded from The Cancer Genome Atlas (TCGA) hub by the University of California, Santa Cruz, Xena browser (https://xenabrowser.net/hub/) and CGGA data portal (http://www.cgga.org.cn/) respectively. The TCGA RNAseq data (level 3) shows the gene-level transcription estimates, as in log2(x + 1) transformed RSEM normalized count. The CGGA data displays the gene expression level as fragments per kilobase transcriptome per million fragments (FPKM), which has been standardized. Expressed gene defined only if its expressed level is larger than zero at half of samples. Only patients with a clear information of survival and detailed history of radiotherapy and chemotherapy/ molecular therapy were included in the study. Finally, 456 cases from TCGA dataset and 159 cases from the CGGA dataset were included in the training set and validation set respectively. Table 1 summarized the clinical characteristics and therapy information of the training set and validation set. The workflow presentation of this study is shown in Fig. 1.

Identification of survival-related genes and construction of the prognostic model
By using the rbsurv package in R, a robust likelihood-based survival model was conducted to identify survival-related genes (Cho et al., 2009). The rbsurv package is a software program, which selects survival-associated genes based on the partial likelihood of the Cox model and adopts a cross-validation approach for robustness. According to the description of the rbsurv package, prior gene selection such as univariate survival modeling can be performed if necessary and the univariate survival modeling can be performed in this software program. Compared to the survival modeling without an adjustment of risk factors, the robust likelihood-based survival model can improve the ability to discover truly survival-associated genes by modeling genes after adjusting for certain risk factor. Thus, we directly conduct a robust likelihood-based survival model to screen for the prognostic genes. The robustness test was performed on 20,530 genes and 456 samples. After 10 iterations, 29 prognostic related genes were selected. With the help of glmnet and survival package in R, least absolute shrinkage and selection operator (LASSO) regression and the multivariable Cox proportional hazard regression method were used to further identify the survival-related prognostic model. The same approach was used to identify gene signatures for endometrial carcinoma (Ouyang et al., 2019). At last, three prognostic survival-related genes that were independent survival predictors and their regression coefficients were obtained at a threshold of p < 0.05. Based on the median expression value of each survival-related gene, we dichotomized 456 LGGs patients into low and high expression groups and compare the survival rate between the two groups by Kaplan-Meier plots and Log-rank test. According to the estimated regression coefficients, a prognostic risk score for each patients was then calculated . The risk score = (0.4470 × expression level of WEE1) + (−0.1530 × expression level of CRTAC1) + (−0.3723 × expression level of SEMA4G). With the three-gene signature, 456 LGGs patients were divided into high-risk and low-risk groups with the median risk score as the cut-off value. Kaplan-Meier curves were performed to estimate and compare the survival for TCGA LGGs patients with a high score or a low score. The receiver operating characteristic (ROC) curve and area under the curve (AUC) were applied to evaluate the prediction accuracy of the risk score model. Furthermore stratified survival analysis was performed in patients with different age group (younger, old), gender (male, female) and pathologic grade (G2, G3). Univariate and multivariate Cox hazard regression analysis were conducted for the potential prognostic factors such as age group (younger vs. old), gender (male vs. female), pathologic grade (G2 vs. G3), radiation therapy (Yes vs. No), molecular therapy (Yes vs. No) and risk score (High vs. Low).

Validation of the prognostic model in the CGGA
The prognostic model was validated in the CGGA mRNAseq_325 cohort. Only patients with a clear information of survival, detailed history of radiotherapy and chemotherapy were included in the study. Finally, 159 cases from the CGGA mRNAseq_325 cohort were included in the validation set.

Exploring co-expression genes by WGCNA
To explore the regulatory network of the three genes, Gene Co-expression Network Analysis (WGCNA) was performed in training set by the R package WGCNA (Langfelder & Horvath, 2008). The top 50% variance of genes were selected for WGCNA. In other words, WGCNA based on 456 samples and 10,256 genes. First, RNAseq data were filtered to reduce outliers. Using the absolute value of the correlation between the expression levels of transcripts, a co-expression similarity matrix was constructed. Then, the co-expression similarity matrix was transformed to the adjacency matrix by choosing nine as a soft threshold. Co-expression gene module was established by the topological overlap measure. In order to identify the significance of each module, gene significance (GS) was calculated to estimate the correlation between genes and sample traits. Module significance (MS) was defined as the average GS within modules and was calculated to measure the correlation between modules and sample traits (vital status). Finally, the "vital status" related modules that contain the three genes as members and genes belong to such modules were identified. Genes interacted with those three genes were screened and the co-expression network was constructed by Cytoscape software (Shannon et al., 2003).

Functional enrichment analysis
Using Enrichment analysis of Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway were conducted via the clusterProfiler package in R language (Yu et al., 2012) for those genes that belong to the "vital status" related modules associated with the three genes. Benjamini-Hochberg (BH)-adjusted p-value < 0.05 were considered significant.

RESULTS
Three prognostic genes were identified in TCGA dataset and validated in CGGA dataset A total of 456 patients and 20,530 genes were included in the TCGA-LGG to train the prognostic model. The robust likelihood-based survival model found 29 survival-related genes, 13 genes were obtained through LASSO Cox method (Fig. 2). We further reduced the dimensionality of these high-dimensional data by multivariate Cox proportional hazard regression model. Finally, three genes that were independent survival predictors were identified as survival prediction signature. Those three genes included in the model were WEE1, SEMA4G, CRTAC1. It has been reported that WEE1 is closely related to the growth, invasion and migration of glioma . Currently, there is no study revealing the role of SEMA4G and CRTAC1 in gliomas. After calculating the risk score, patients were divided into a high-and low-risk group based on the median cut-off point of the risk score. The three-gene signature risk score distribution is shown in Fig. 3A. Besides, the relationship between risk score and the status of the LGGs was calculated (Fig. 3B). As shown in the heat map of the Fig. 3C, a remarkable high expression was noted for WEE1 in the high-risk group, while a lower expression was observed for the other genes in the high-risk group (Fig. 3C). Patients in the high-risk group were significantly worse off the overall survival time compared to the low-risk group (p < 0.0001) (Fig. 4A). The area under ROC curve of the signature for 1-, 3-and 5-year overall survival was 0.908, 0.878 and 0.827, respectively, in training set. (Fig. 4B). A similar result can be noted in the validation dataset (Fig. 4C). The area under ROC curve of the signature for 1-, 3-and 5-year overall survival was 0.781, 0.806 and 0.807, respectively, in validation set. (Fig. 4D). Moreover, the predicting power of the risk score model was not decreased in subgroup analysis for age group (younger, p = 0.00012; old, p < 0.0001), gender (male, p < 0.0001; female, p < 0.0001), and pathologic grade (G2, p = 0.00013; G3, p < 0.0001) in the training set (Figs. 5A-5F). The same trend can be observed in the validation dataset (Figs. 6A-6F). For the WEE1, the member of high expression group had significantly shorter survival than those in low expression group (p < 0.0001) (Fig. 7A).
For the SEMA4G and CRTAC1, the member of high expression group had significantly longer survival than those in low expression group (p < 0.0001) (Figs. 7B and 7C). The expression level of WEE1 was significantly higher in grade III compared to grade II (p < 0.0001), while the other are opposite (Fig. 8A). These results can also be verified in the validation dataset (Figs. 7D-7F and 8B).

Calculation of module-trait correlation in LGGs and module visualization of the network connections
Using the R package WGCNA, gene modules were identified based on the top 50% variance of genes. To analyze the relationship between gene modules and sample clinical information, we used the module eigengene (ME) as the overall gene expression level of the corresponding modules and calculated correlations with clinical phenotypes, for example, vital status. We obtained 16 gene modules (Figs. S1A-S1D) with size ranging from 31 to 1,501 genes. We assigned each co-expression module an arbitrary color for reference: black, blue, brown, cyan, green, greenyellow, lightcyan, magenta, midnightblue, pink, purple, red, salmon, tan, turquoise and yellow. These modules contained 449,1,352,850,46,519,91,31,201,43,336,135,462,51,90, 1,501 and 845 genes, respectively. As a single group, the non-co-expressed group designated as "grey" based on the WGCNA developer's rationale. Vital status related modules, such as yellow, green, black modules that contain the three genes as members and genes belong to such modules were screened (Fig. S1D). Finally, 32 genes were discovered to be co-expressed with CRTAC1, 181 genes were co-expressed with WEE1, six genes with SEMA4G. We exported the screened genes and three prognostic survival-related genes into Cytoscape and constructed the co-expression network (Fig. 9). GO and KEGG analysis of screened genes interacted with three-gene signature For the "biological processes" (BP), chromosome segregation, nuclear division, mitotic nuclear division, organelle fission, mitotic sister chromatid segregation were the commonly enriched categories (Fig. 10A). For the "cellular component" (CC), the enriched categories were correlated with condensed chromosome, chromosome/centromeric region, chromosomal region, kinetochore, condensed chromosome/centromeric region    10B). For the "molecular function" (MF), these screened genes mainly enriched in microtubule binding, tubulin binding, histone kinase activity, DNA-dependent ATPase activity, protein serine/threonine kinase activity (Fig. 10D). KEGG pathway enrichment analysis suggested that cell cycle was the most important pathway for these selected genes. The following pathway also involved many screened genes, including, oocyte meiosis, progesterone-mediated oocyte maturation, fanconi anemia pathway, homologous recombination (Fig. 10C). Additionally, For the GO analysis, these three co-expression gene modules (yellow, green, black) enriched results can been seen in Figs. S2 and S3.
A comparison between our and other models 2019), TERT promoter mutations (Chan et al., 2015), CIC loss (Sahm et al., 2012), FUBP1 loss (Sahm et al., 2012) and PTEN loss (Sabha et al., 2014) and the above prognostic marker contribute to clinicians to understand the mechanism of low-grade gliomas. The complex pathogenesis of LGG encourages us to explore more prognostic markers for further understand it and develop an efficient treatment.
In this study, we identified three genes that were closely correlated with LGG prognosis. Considering the differentially expressed genes (DEGs) between tumor and normal tissue might not be associated with survival  and the univariate survival modeling can be performed in rbsurv package, the robust likelihood-based survival model was performed using the rbsurv package in R as the first step instead of screening for DEGs and conducting the univariate Cox regression. LASSO and Cox proportional hazard regression model are widely used to generate prognostic genes in the context of high dimensional data, thus were adopted in subsequent analysis. Compared to a single predictive biomarker, integrating multiple biomarkers into a signature is believed to be more predictive. The risk score calculated by the risk model was considered to have good predictive capabilities and was demonstrated to be an independent prognostic factor after adjusting the effects of age, sex, tumor grade, molecular therapy and radiation therapy. The risk score was confirmed to be effective in different age groups, gender and pathologic grade. Regardless of the training set or the validation set, the AUC value of 1-, 3-and 5-year was greater than 0.75. The pathologic grade and the risk level were confirmed to be independent prognostic factors both in training set and validation set.
In order to construct a co-expression network of the three genes, WGCNA was used in the training set. We found the survival related modules to which these three genes belong, and extracted the genes of these three modules to construct a co-expression network. Finally, 32 genes were discovered to be co-expressed with CRTAC1, 181 genes were co-expressed with WEE1, six genes with SEMA4G. The co-expression network of the three genes is visualized by Cytoscape in Fig. 9.
Based on the result of GO and KEGG enrichment analysis of these co-expression genes, "condensed chromosome" was the most significant enrichment in CC. Coincidentally, Rebecca et al. found that interference with chromatin condensation results in failure to fully activate DNA damage response (Burgess et al., 2014) and the DNA damage response triggers multiple cellular events including activation of DNA repair pathway, arrest of the cell cycle to allow time for repair, and, in certain cases, initiation of senescence or apoptosis programs (Ciccia & Elledge, 2010). For the BP category, chromosome segregation was the most enrichment and research has proven that chromosome instability contributes to the development of genetic heterogeneity in tumors and allows the outgrowth of tumorigenic cells with advantageous karyotypes (Conde et al., 2017). Regarding the MF category, microtubule binding was the most influential and the drug targeted microtubule was proven effective in glioma. For example, the drug EM011 functions by disrupting microtubule dynamics and modules several oncogenic mediators causing a decrease in cell viability, proliferation and migration/invasion in the astrocytoma cell lines (Ajeawung, Joshi & Kamnasaran, 2013). For KEGG pathway enrichment analysis, cell cycle was the most significant pathway. Stephen D has explained that signaling pathway converge on the cell cycle machinery to regulate developmental genes and execute cell fate decisions (Dalton, 2015).
The three-gene signature provided a wealth of potential biological and therapeutic information about LGG.WEE1 (WEE1 G2 checkpoint kinase), located on the short arm of human chromosome 11 (11p15.4), encodes a nuclear protein, which is a tyrosine kinase belonging to the Ser/Thr family of protein kinases. The protein catalyzes the inhibitory tyrosine phosphorylation of CDC2/cyclin B kinase, and appears to coordinate the transition between DNA replication and mitosis by protecting the nucleus from cytoplasmically activated CDC2 kinase. WEE1 has been confirmed that its protein expression increases with malignancy grade (Music et al., 2016). Moreover, patients with high WEE1 expression had poor survival than did patients with low WEE1 expression in grade III gliomas (Music et al., 2016). CRTAC1, cartilage acidic protein 1, a novel human marker which allowed discrimination of human chondrocytes from osteoblasts and mesenchymal stem cells in culture can be divided into CRTAC1-A and CRTAC1-B two subtypes according to the last exon. Previous study found that inhibition of CRTAC1 reduces ultraviolet B irradiation induced-apoptosis through P38 mitogen-activated protein kinase and jun Amino-Terminal kinase pathway (Ji et al., 2016). It means that the relationship between the expression of CRTAC1 and apoptosis is positively correlated after ultraviolet B irradiation. To some extent, this is consistent with our finding that CRTAC1 high expression prolongs survival time in LGG patients. However, its detailed mechanism in LGG remains to be further explored. Semaphorins are a large family of conserved secreted and membrane associated proteins which possess a semaphoring (Sema) domain and a PSI domain in the N-terminal extracellular portion. Based on sequence and structural similarities, semaphorins are put into eight classes: invertebrates contain classes 1 and 2, viruses have class 8, and vertebrates contain class 3-7. Semaphorins serves as axon guidance ligands via multimeric receptor complexs, some containing plexin proteins. Semaphorins and Plexins are cognate ligand-receptor families that regulate important steps during nervous system development (Maier et al., 2011). A low-expression of SEMA4G was detected in colorectal cancer tissues compared with normal tissues. It means that SEMA4G might be a tumor suppressor gene related to colorectal cancer (Wang et al., 2008). However, little work has been done to elucidate the role of SEMA4G in glioma. Our study demonstrated that SEMA4G was significantly down-regulated in grade III patients compared to grade II and the high-expression of SEMA4G was associated with a good prognosis in LGG patients. Further work is needed to explore its functions in LGG. To sum up, the three-gene signature could predict LGG survival based on a risk score model. We firmly believed that these genes are potential prognostic markers or therapeutic targets for LGG patients. Nevertheless, the molecular mechanisms how the three-gene signature affected the prognosis of LGG patients should be further elucidated by a series of experiments.

CONCLUSION
In conclusion, our study identified a 3-gene model that showed satisfactory performance in predicting short-and long-term survival of LGG patients compared to other models. Moreover, our finding provided new insights into the pathogenesis and prognosis of LGG.