An in silico approach to the identification of diagnostic and prognostic markers in low-grade gliomas

View article
Bioinformatics and Genomics
  1. May 31, 2023: Minor Correction: Aslı Suner's affiliation omitted the name of the university: Ege University. The affiliation should read, in full: Faculty of Medicine, Department of Biostatistics and Medical Informatics, Ege University, Izmir, Turkey


Tumors located in the central nervous system (CNS) are defined based on their cells of origin and histopathological characteristics. Gliomas are neuroepithelial and highly heterogeneous tumors originating from neuroglial progenitor cells (Forst et al., 2014; Wirsching & Weller, 2016). Gliomas represent 27% out of all primary brain tumors, and account for 80% of the malignant primary brain cancers (Wirsching & Weller, 2016).

Traditionally, gliomas are classified, according to the World Health Organization (WHO), into low grade gliomas (LGGs) (grade I and II tumors) and higher-grade gliomas (HGGs) (grade III and IV tumors) (Wirsching & Weller, 2016). The WHO also added molecular and genetic parameters that improved the diagnostics and prognostics of this type of cancer. Major changes occurred in the subcategorization of gliomas because genetic mutations and prognostic factors may vary highly (Delgado-Lopez et al., 2017; Ferris et al., 2017). Currently, aside from the traditional system, classification is done based on the type of glioma cells such as astrocytoma or oligodendroglioma, integrated with the mutations status of genes such as the isocitrate dehydrogenase gene (IDH) 1 and IDH2, ATRX and TP53, as well as 1p/19q deletion (Dono et al., 2021).

LGGs generally occur in people between 20 and 40 years old. While the most common symptoms among patients is the presence of seizures, patients may have headaches as well. Primary tumor location greatly affects both the symptoms and severity of LGG. Seizures are thought to be due to the invasion of cancer tissue into the cortex region of the brain (Schiff, 2017). Although LGG constitutes 20% of all primary brain tumors, the survival period of LGG patients ranges between 4.7 and 9.8 years. Therefore, LGG patients are expected to survive longer compared to HGG patients (Kumthekar, Raizer & Singh, 2015). The most malignant form of gliomas is glioblastoma (GBM) and is recognized to be one of the deadliest brain tumors affecting adults. GBMs are classified, according to WHO, as grade III and grade IV tumors (Chen, Cohen & Colman, 2016; Claus et al., 2015).

There are different options of treatment for patients with LGG. Notably, some LGG patients are not even aware of this condition. Therefore, timely and accurate diagnosis is critically important before tumor progresses to a higher grade. Compared to the short survival rate of patients with HGGs or GBMs, patients with LGGs show longer rates of survival, which raises controversies regarding the decision of treatment administration, medication dosage and associated side effects (Cancer Stat Facts: Leukemia—Acute Myeloid Leukemia (AML), The conventional treatment options include surgical resection, radiotherapy and chemotherapy (Rossi et al., 2019). Therefore, accurate diagnosis and prognosis are of critical importance, since inappropriate selection of treatment can result to a decrease in LGG patients’ quality of life (Carabenciov & Buckner, 2019; Tom et al., 2019).

The great advancements in high-throughput technologies allowed the generation of a great amount of LGG-relevant gene expression data (‘big data’) deposited to publicly available databases. This allowed us, in the current study, by employing an integrative and robust in silico methodology, diverse state-of-the-art software and stringent criteria to process, analyze and interpret publicly available LGG-relevant transcriptomic data. To this end, large-scale RNA sequencing data were exploited in order to identify genes differentially expressed between LGG samples and corresponding normal brain specimens. Differentially expressed genes detected by three different methodologies were subjected to weighted co-expression network analysis to identify genes with similar expression patterns that comprise functionally distinct modules. Genes within these modules were further correlated with important LGG clinical traits and hub genes were screened by applying network-based methods. In this way, a set of eighteen genes was revealed, which could be considered as potential diagnostic biomarkers for discriminating LGG patients from healthy individuals. Furthermore, the predictive value of these genes for the overall survival of LGG patients was investigated. Previous efforts have focused on the expression of long non-coding RNAs (Reon et al., 2016) or prognostic biomarkers (Liu et al., 2021b) in LGGs. To the authors’ knowledge, the present study is the first comprehensive and updated study, where a massive amount of data derived from three major resources was utilized towards the identification of candidate diagnostic and prognostic markers in LGG.

Materials and Methods

A graphical illustration of the overall procedure followed in this study is depicted in Fig. 1.

Illustration of the overall pipeline followed in this study.

Figure 1: Illustration of the overall pipeline followed in this study.

All analyses were performed in the R statistical computing environment v.4.2.0 (R Core Team, 2022), unless otherwise stated.

Data acquisition and preprocessing

Acquisition of low-grade glioma data from TCGA

LGG data were downloaded from The Cancer Genome Atlas (TCGA) ( by using the ‘TCGAbiolinks’ R package (Colaprico et al., 2016). Harmonized RNA-Seq data were downloaded with the GDCprepare function. A total of 529 samples were downloaded and processed via the GDC portal ( from the TCGA database. Then the downloaded transcriptomic data were preprocessed by using the TCGAanalyze_Preprocessing function of the ‘TCGAbiolinks’ package, in order to detect low correlated samples, termed ‘outliers’, which resulted to a count data matrix to be used in further steps.

Retrieval of brain transcriptomic data from GTEx

A total of 120 corresponding normal brain samples were downloaded from the Recount2 project (Collado-Torres et al., 2017) of the Genotype-Tissue Expression (GTEx) (GTEx Consortium, 2013) ( database, by using the TCGAquery_recount2 function of the ‘TCGAbiolinks’ package, as ranged summarized experiment (RSE) objects. Count data were scaled, by using the scale_counts function of the ‘Recount’ package.

Acquisition and processing of LGG data from GEO

The public repository of NCBI GEO (Gene Expression Omnibus) DataSets (Edgar, Domrachev & Lash, 2002) was thoroughly searched for LGG gene expression data using the keywords: (“low grade glioma” OR “low grade gliomas”) AND (“homo sapiens” OR “human”), following the PRISMA ( guidelines (Fig. 2). In order to be considered eligible, the studies had to fulfill the following inclusion criteria: (i) human LGG tissue samples, (ii) gene expression data available, (iii) more than three LGG tissue samples, (iv) availability of clinical metadata, (v) inclusion of more than 5,000 genes in the dataset, (vi) wild-type genes. Accordingly, the studies were excluded based on the following criteria: (i) studies on animal models or cell lines, (ii) not tissue samples (e.g., blood), (iii) treated samples (e.g., drugs or radiation). Collectively, 654 relevant datasets were retrieved from GEO DataSets (up to 25 April 2022). The eligible dataset series GSE184941, which contains mRNA profiles of human low- and high-grade glioma samples, was selected for further analysis. A total of 79 LGG out of 180 samples in GSE184941 were analyzed in this study.

PRISMA flowchart of selecting GEO records.
Figure 2: PRISMA flowchart of selecting GEO records.

For this analysis, the GTEx RNA-Seq pipeline was followed (Fig. 1). FASTQ files were downloaded by using the ‘SRA Tool Kit’ v.2.11.1 (Leinonen, Sugawara & Shumway, 2011). The raw RNA-Seq reads in the FASTQ files were aligned to the GENCODE Human Reference Genome Release 39 (Frankish et al., 2021) with the aligner ‘STAR’ v.2.6 (Dobin et al., 2013). Those samples where the percentage of the uniquely mapped reads was lower than 70% were excluded, and 34 samples out of the 79 GEO-derived LGG samples were retained for subsequent analyses.

The rsem-prepare-reference function in ‘RSEM’ v.1.3.3 (Li & Dewey, 2011) was used for gene expression level estimation (i.e., quantification).

All count data of the 34 LGG samples were collected and combined into a matrix file with the corresponding sample ID and gene ID using the rsem-generate-data-matrix function of ‘RSEM’.

Merging of TCGA-GTEx and GEO-GTEx data

The preprocessed TCGA-derived LGG count data and the retrieved GTEx count data were merged based on their matching gene IDs. Merged TCGA-GTEx count data were further normalized for ‘GC content’ using the function TCGAanalyze_Normalization of ‘TCGABiolinks’ in the EDASeq protocol (Risso et al., 2011). Then, quantile filtering was applied with a 25% cutoff.

GEO-LGG and GTEx-extracted count data were merged based on matching gene IDs. Then, batch correction was applied by ‘ComBat-seq’ (Zhang, Parmigiani & Johnson, 2020) (, an adjustment tool that uses negative binomial regression to detect batch effects. The results from ‘ComBat-seq’ indicated that the merged GEO-GTEx count data required no batch correction.

Principle component analysis of transcriptomes

Principal Component Analysis (PCA) (Jolliffe & Cadima, 2016) was applied for detecting variances among glioma and normal groups. Before performing PCA, raw read counts of both merged TCGA-GTEx and GEO-GTEx data were normalized by converting them to FPKM values, with the count2fpkm function of ‘RNAAgeCalc’ (Ren & Kuan, 2020). The FPKM data were also filtered, so as the expression value of the individual genes was greater than ‘1’ in at least half of the samples in each group (TCGA-GTEx and GEO-GTEx). Next, the merged and filtered FPKM data were log2 transformed and the R function prcomp was used to generate PCA plots. For PCA plot visualization, the fviz_pca_ind function of the ‘factoextra package (Kassambara & Mundt, 2020) was used.

Differential gene expression analysis

To detect statistically significant differentially expressed genes (DEGs) between the ‘glioma’ and ‘normal’ samples, filtered and merged read count data with sample IDs and genes were provided as input to three software tools for conducting differential gene expression analysis (DGEA), namely, ‘edgeR’ v3.34.1 (Robinson, McCarthy & Smyth, 2010), ‘limma’ v3.48.3 (Ritchie et al., 2015) and ‘DESeq2’ v1.32.0 (Love, Huber & Anders, 2014). The output DEGs were selected based on an absolute log2 fold-change (|log2FC|) ≥ 2; the threshold for the False Discovery Rate (FDR)-adjusted p-value was set less than 0.05. To visualize the DEGs, heatmap plots of DEGs were generated by using the ‘pheatmap’ package v1.0.12 (Kolde, 2019).

Weighted correlation network analysis

Weighted Gene Co-Expression Network Analysis (WGCNA) (Langfelder & Horvath, 2008) was performed of the common DEGs detected by ‘edgeR’, ‘limma’ and ‘DESeq2’ of the LGG samples derived both from TCGA and GEO. In this study, the ‘WGCNA’ R package v1.70-2 (Langfelder & Horvath, 2008) was utilized to identify highly correlated gene patterns, clusters, modules and module-LGG clinical trait relationships.

The FPKM matrix data of the detected DEGs from TCGA and GEO were log2(data+1) transformed, so as to normalize the FPKM data. The function goodSamplesGenes of the ‘WGCNA’ package was used to detect any possible outliers.

The pickSoftThreshold function of ‘WGCNA’ was used for the selection of a suitable soft-thresholding power. The adjacency matrix of gene expression data was created by using the selected power with the adjacency function of the ‘WGCNA’ package.

To eliminate any noise, the adjacency matrix was transformed to Topological Overlap Matrix (TOM) and then dissimilarity was calculated by using the ‘WGCNA’ function TOMsimilarity(adjacency) based on dissTOM = 1 – TOM. In this way, only highly correlated genes are grouped together. Next, trees of genes (i.e., dendograms) were created by hierarchical clustering with the hclust function.

Modules in the dendrograms were detected using the cutreeDynamic function of ‘WGCNA’. The highly correlated genes detected in the previous step were assigned into color-coded modules, and modules with similar expression profiles were detected and merged. To this end, the ‘eigenene’ (i.e., first principal component) of each module was calculated to estimate co-expression similarity and then were clustered again. For this purpose, the functions ‘moduleEigengenes’, ‘cor’, again ‘hclust’ and ‘mergeCloseModules’ of the ‘WGCNA’ package were used. The plot of these modules was generated by using the plotDendroAndColors function of ‘WGCNA’.

To visualize the weighted gene co-expression networks, heatmap plots were generated with the TOMplot function of ‘WGCNA’, where each row and column correspond to a gene and sample, respectively.

Association of modules with LGG clinical traits

Clinical data of samples derived from TCGA were acquired using the GDCprepare function of ‘TCGABiolinks’. The clinical traits of interest from TCGA were Primary Diagnosis, Age at Diagnosis, Vital Status, Sample Type, Site of Resection Biopsy, Prior Treatment, Gender, Race, and Tissue Organ Origin.

Clinical data of samples obtained from GEO were downloaded from the GEO database ( Selected clinical traits for the GEO samples were Age, Progressed Tumor Pathology, IDH Mutant Status, Vital Status, Gender, Initial Tumor Pathology, and Patient Initial Grade.

The relationships between the gene modules and LGG clinical traits were investigated by using the eigengene information of each module. To this end, traits without numbers were converted to their mathematical abbreviations by using the cor and corPvalueStudent functions. Heatmaps of this analysis were generated with the labeledHeatmap function, and include the Spearman correlation coefficient of the modules with clinical traits and the corresponding p-value.

Gene co-expression and protein-protein interaction network construction

Gene co-expression and protein-protein interaction networks were generated in this study. Cytoscape v3.9.1 (Shannon et al., 2003) (, an open source platform, was used for network analysis and visualization.

To this end, the adjacency matrix (i.e., co-expression data matrix) of genes of the selected modules were filtered by setting the threshold for the gene co-expression value at 0.5, so as to increase the robustness of the study.

Protein-Protein Interaction (PPI) network analysis was performed by providing the genes of the co-expression data matrix as input to the database Search Tool for the Retrieval of Interacting Genes (STRING) v1.7.0 (Szklarczyk et al., 2021). The direct and indirect as well as the physical and functional associations of the corresponding gene products were investigated.

The gene-gene and protein-protein association data were uploaded to the Cytoscape and the Cytohubba (Chin et al., 2014) plugin of Cytoscape, which allows network investigation by eleven different ranking algorithms: Degree method (Deg), Maximum Neighborhood Component (MNC), Density of Maximum Neighborhood Component (DMNC), Maximum Clique Centrality (MCC), Closeness (Clo), EcCentricity (EC), Radiality (Rad), BottleNeck (BN), Stress (Str), Betweenness (BC), Edge Percolated Component (EPC). In this way, the most highly connected/correlated genes/proteins in the gene co-expression and PPI networks were selected by using the node-degree filter.

Overall survival analysis

The prognostic value of the ‘hub’ genes detected in the previous section was explored. To assess whether the expression levels of these genes are associated with the overall survival (i.e., a person is either alive or dead) of LGG patients, the web-based tool GEPIA (Gene Expression Profiling Interactive Analysis) (Li et al., 2021a; Tang et al., 2017) version 2 (, as well as the Chinese Glioma Genome Atlas (CGGA) (Zhao et al., 2021) (, were applied. GEPIA2 retrieves and analyzes survival data from the integrated TCGA pan-cancer clinical data resource (Liu et al., 2018). The LGG patient cohorts were divided into ‘high-risk’ and ‘low-risk’, by setting the cutoff values for high and low gene expression at 50%; samples with gene expression level higher/lower than the 50% cutoff are considered as high/low-expression patient cohorts, respectively. CGGA is a comprehensive repository of multi-omics data derived from Chinese glioma patients.


Comprehensive characterization of LGG transcriptomic profiles compared to normal brain

To eliminate any batch effects and differences, transcriptomic profiling analysis was performed on the RNA-Seq data of the LGG and normal brain samples from GTEx, as well as GEO and GTEx. To this end, the distribution of the transcriptome profiles of the 529 TCGA, 120 GTEx and 34 GEO samples were investigated via principle component analysis (PCA). As anticipated, the TCGA and GEO vs GTEx samples form separate clusters (Fig. 3A), suggestive of distinct transcriptomic profiles. In the GEO vs GTEx PCA plot (Fig. 3A, right), nine GTEx outlier samples were detected (SRR599510, SRR2157460, SRR612563, SRR821602, SRR1488651, SRR2167030, SRR2166648, SRR1077405 and SRR661973).

Transcriptome profiles and differentially expressed genes between TCGA and GTEx, and GEO and GTEx.

Figure 3: Transcriptome profiles and differentially expressed genes between TCGA and GTEx, and GEO and GTEx.

(A) Principal component analysis of the transcriptomic profiles of TCGA and GTEx (left); GEO and GTEx (right). Red and blue dots represent LGG and normal samples, respectively, from TCGA and GTEx (left); GEO and GTEx (right). (B) Heatmap plots of significant DEGs generated by edgeR between TCGA and GTEx (left); GEO and GTEx (right). Each row represents a gene. LGG and normal brain samples in the top legend are denoted by red and blue color, respectively. Red and blue color in the heatmap represents up-regulated and down-regulated genes, respectively.

After the exclusion of outliers, DGEA was conducted using edgeR, DESeq2 and limma for detecting statistically significant DEGs. DGEA between TCGA and GTEx resulted to 804 up-regulated, 3,972 down-regulated genes from edgeR; 580 up-regulated, 4,332 down-regulated genes from DESeq2; and 709 up-regulated, 4,475 down-regulated genes from limma. Likewise, DGEA between GEO and GTEx resulted to 4,117 up-regulated, 4,382 down-regulated genes from edgeR; 4,203 up-regulated, 4,350 down-regulated genes from DESeq2; and 4,861 up-regulated, 4,090 down-regulated genes. Heatmaps of the DEGs generated by edgeR of TCGA vs GTEx, and GEO vs GTEx, are shown in Fig. 3B. Collectively, 4,465 DEGs were detected between TCGA and GTEx, including 528 up-regulated and 3,937 down-regulated genes (Fig. S1), and 7,975 DEGs between GEO and GTEx, including 3,966 up-regulated and 4,009 down-regulated (Fig. S2). In addition, there was a rather significant 60% overlap between the TCGA-GTEx and GEO-GTEx DEGs with respect to TCGA DEGs.

Weighted gene co-expression network analysis of DEGs

The 4,496 common DEGs of the 529 TCGA samples, as well as the 8,049 common DEGs of the 34 GEO samples, were used for the construction of co-expression networks by WGCNA. To this end, first a soft-thresholding power was chosen by taking into consideration the scale-free network topology criterion; a soft-thresholding power of six for the TCGA samples and four for the GEO samples (Figs. S3 and S4). A dendogram of DEGs was generated by using the TOM-based dissimilarity and hierarchical clustering (Fig. S5). Then, by setting the cutoff height at 0.25, outliers within the detected modules were merged (Fig. 4A). Branches within the dendogram represent modules, and each module is denoted by an assigned color. Genes are represented by short vertical lines within leaves. WGCNA resulted to seven modules based on TCGA data (Fig. 4B, left): blue (223), brown (186), grey (973), red (39), turquoise (1,073), and yellow (140). On the other hand, WGCNA resulted to twenty modules based on GEO data (Fig. 4B, right): black (191), blue (1,096), brown (523), cyan (79), green (337), greenyellow (114), grey (75), grey60 (58), lightcyan (74), lightgreen (49), lightyellow (38), magenta (142), midnightblue (79), pink (162), purple (137), red (270), royalblue (35), salmon (101), tan (102), turquoise (1,147), and yellow (457); the number of genes within each module is shown into parentheses.

Module detection and module-clinical trait relationships by WGCNA.

Figure 4: Module detection and module-clinical trait relationships by WGCNA.

(A) Cluster dendogram tree of module eigengenes. Red line indicates the cutoff height (0.25). TCGA data module dendogram has no outlier modules (left). GEO data module dendogram has the blue and brown modules as outliers (right). (B) Dendogram of DEGs by hierarchical clustering with TOM dissimilarity (upper part). Dendogram of the modules with their assigned colors before and after module merging; TCGA (right) and GEO (left). (C) Heatmap plots of the relationships between merged modules and LGG clinical traits derived from TCGA (right) and GEO (left). In each cell, statistical significance is indicated by the Pearson correlation coefficient and p-value. Red and blue color denotes a positive and negative correlation, respectively.

Associations of detected modules with clinical traits of LGG

The investigated clinically significant traits of LGG samples, from both TCGA and GEO, are described below:

  • Age at Diagnosis: Age the patient was diagnosed based on the number of years.

  • Age: Age of patients.

  • Gender: Sex of the patient (male or female).

  • IDH Mutant Status: If patients carry a mutation in the IDH gene.

  • Initial Tumor Pathology: Initial state of the tumor pathology.

  • Patient Initial Grade: Glioma grade that the patient was diagnosed.

  • Primary Diagnosis: Date of diagnosis based on the number of days.

  • Prior Treatment: Information about early treatments after diagnosis.

  • Progressed Tumor Pathology: Current stage of tumor pathology.

  • Race: Ethnic group of the LGG patients.

  • Sample Type: Type of material extracted from regions of the brain.

  • Site of Resection Biopsy: Part of the brain that the resection was performed.

  • Tissue Organ Origin: Origin of the primary infected organ/tissue.

  • Vital Status: Current vital status of the patient (alive or dead).

In our study, those modules significantly correlated with clinical traits based on a Spearman correlation coefficient above 0.3 and p-value less than 0.05 were detected. Thus, two modules (blue and yellow) comprised of 363 genes, in TCGA samples (Fig. 4C, right), as well as seven modules (salmon, blue, green, cyan, pink, lightgreen, red) including 2,617 genes, in GEO samples (Fig. 4C, left), are significantly associated with clinicopathological traits. The genes contained in each of those modules were selected for further analysis. There was also a 17% overlap between the clinically related DEGs of TCGA and GEO with respect to the TCGA-derived DEGs.

LGG molecular networks

To detect the most biologically important LGG-relevant genes/gene products, the associations among them were examined by network-based methods For this purpose, the genes comprising the modules associated with clinically significant characteristics were loaded to the Cytoscape platform in order to construct co-expression and protein-protein interaction networks. The gene co-expression network analysis resulted to 76 nodes/genes and 522 edges/interactions (Fig. 5A). There was a total of 35 hub genes with a node degree greater than 10. The STRING protein search option of Cytoscape resulted to a total of 73 nodes/proteins and 558 edges/interactions (Fig. 5B). Based on PPI network analysis, there are 38 hub proteins, the node degree of which was greater than 10. Finally, by combining the results of both networks, eighteen significant common hub genes were detected, namely, CD74, CD86, CDC25A, CYBB, HLA-DMA, ITGB2, KIF11, KIFC1, LAPTM5, LMNB1, MKI67, NCKAP1L, NUSAP1, SLC7A7, TBXAS1, TOP2A, TYROBP, and WDFY4. All detected hub genes were found to be up-regulated in the TCGA-derived LGG samples.

Gene co-expression and protein-protein interaction networks.

Figure 5: Gene co-expression and protein-protein interaction networks.

(A) Co-expression network of TCGA DEGs found in clinically significant modules. The detected hub genes are represented by bigger nodes. Blue and yellow color indicates DEGs from the blue and yellow module, respectively. (B) Network analysis of the associations of the protein products of the clinically significant TCGA DEGs. Proteins are clustered into two distinct clusters and the product of the TBXAS1 gene acts as an intermodular node.

Prognostic potential of LGG hub genes

Moreover, the impact of these hubs on the overall survival of LGG patients was investigated. The elevated expression of the CD74, CD86, CDC25A, CYBB, HLA-DRA, ITGB2, KIF11, KIFC1, LAPTM5, LMNB1, MKI67, NCKAP1L, NUSAP1, SLC7A7, TBXAS1, TOP2A, TYROBP and WDFY4 hub genes was found to be significantly associated with unfavorable overall survival in LGG patients, as indicated by hazard ratio (HR) values above 1 and p-values < 0.05. For CYBB, a HR value of 1.3 was not statistically significant (p = 0.14) (Fig. 6). Therefore, LGG patients with enhanced expression of these pivotal genes have an increased mortality risk, and may die at a higher rate per unit time, as compared to patients with decreased expression of the corresponding genes. A trend of statistically significant (p < 0.05) lower survival probability correlated with higher expression of all eighteen genes was also observed in the survival curves of the Chinese primary glioma patients (Fig. 7).

GEPIA-based Kaplan–Meier curves depicting the prognostic significance of seventeen signature genes.

Figure 6: GEPIA-based Kaplan–Meier curves depicting the prognostic significance of seventeen signature genes.

(A) CD74, (B) CD86, (C) CDC25A, (D) HLA-DRA, (E) ITGB2, (F) KIF11, (G) KIFC1, (H) LAPTM5, (I) LMNB1, (J) MKI67, (K) NCKAP1L, (L) NUSAP1, (M) SLC7A7, (N) TBXAS1, (O) TOP2A, (P) TYROBP and (Q) WDFY4 for overall survival in LGG patients. The HR “HR(high)” and the corresponding p-values “p(HR)” are indicated. The 95% confidence intervals (CI) are denoted by dotted lines. The number of high-risk and low-risk LGG patient groups are represented by “n(high)” and “n(low),” respectively.
CGGA-based Kaplan–Meier curves illustrating the prognostic value of eighteen signature genes.

Figure 7: CGGA-based Kaplan–Meier curves illustrating the prognostic value of eighteen signature genes.

(A) CD74, (B) CD86, (C) CDC25A, (D) CYBB, (E) HLA-DRA, (F) ITGB2, (G) KIF11, (H) KIFC1, (I) LAPTM5, (J) LMNB1, (K) MKI67, (L) NCKAP1L, (M) NUSAP1, (N) SLC7A7, (O) TBXAS1, (P) TOP2A, (Q) TYROBP and (R) WDFY4 for overall survival in primary glioma patients. The number of glioma patients with “High” and “Low” expression of the target gene is shown in parentheses.


Gliomas are tumors that originate from glial cells which support the central nervous system (Forst et al., 2014; Wirsching & Weller, 2016). Due to the heterogeneity of gliomas, diagnosis is difficult, even for the experienced medical doctors. Scanned images of brain can be even confusing because some types of gliomas are quite similar. Therefore, glioma treatment options become highly debatable. In the case of misdiagnosis, treatments may result to decreased quality of life, due to after-treatment effects, or even reduced life expectancy (Carabenciov & Buckner, 2019; Tom et al., 2019). The aim of this study was to detect potential diagnostic and prognostic biomarkers between LGG-derived gene expression data and matching normal tissue.

In this way, by exploiting high-throughput transcriptomic data, we identified eighteen ‘hub’ genes with a node degree greater than ten, which represents the maximum number of those connections involving the minimum number of interconnected nodes (i.e., genes or proteins). This is because our goal was to select an optimal number of nodes that could represent a panel of genes, which can potentially discriminate LGG from normal brain tissue. Of those, CD74 codes for a protein that regulates antigen presentation to immune cells. In general, CD74 was demonstrated to play a role in the development of many types of cancers. Xu et al. (2021) revealed that CD74 expression was higher in glioma cells compared to normal brain cells. Moreover, the expression of CD74 was higher in HGG compared to LGG. It was suggested that CD74 is a biomarker of LGG for poor prognosis and also could be a therapeutic target for glioma (Xu et al., 2021). In another study, expression of CD74 was shown to contribute to resistance to treatments (e.g., themozolomide) in GBM (Presti et al., 2018). In our study, CD74 was found to be up-regulated in the TCGA-LGG samples.

CD86 is a protein-coding gene, expressed by antigen presenting cells. In a recent study, Ahmed et al. (2022) suggested that CD86 may actually serve as a potential biomarker for the prognosis of GBM and found that CD86 expression swas high in GBM patients. Also, Qiu et al. (2021) found that CD86 overexpression is an unfavorable marker for LGG prognosis, as in our study.

CYBB, found to be overexpressed both in TCGA- and GEO-LGG samples, encodes the beta chain of cytochrome which has a crucial role in ion channels. In a study, where screening for biomarkers of LGG was performed, CYBB was shown to be one of the key DEGs (Guo et al., 2022).

Although HLA-DMA was previously associated with breast cancer, it is also known to have a role in cancer progression and resistance to drugs. Yin et al. (2022) suggested HLA-DMA to have a prognostic value in GBM. We also found HLA-DMA to be up-regulated in the TCGA-LGG samples and represent a potential prognostic marker of LGG.

ITGB2 codes for the beta chain of an integrin heterodimer. Integrins are basically involved in cell surface-related functions such as cell adhesion or signaling. In a recent study, Xu et al. (2022) assessed the prognostic potential of ITGB2 in glioma patients. They found that the glioma grade increases relatively to the expression level of ITGB2, and suggested that ITGB2 could represent a novel predictor for glioma (Xu et al., 2022). Li et al. (2021b), in an effort to detect hub genes in LGG by using data derived from the TCGA and CGGA, found ITGB2 to be a key gene. Including ITGB2, several key genes of this study are also similar to the ones identified in our work (Li et al., 2021b).

KIF11, shown to be up-regulated in LGG, encodes a motor protein which plays a role in the spindle dynamics of cells. A study about the inhibitory effects of KIF11 on mice revealed that KIF11 is up-regulated in GBM, specifically in proliferating and migrating cells (Venere et al., 2015). One important reason that KIF11 represents a suitable target in GBM is that its mostly explored inhibitors are non neurotoxic (Gampa et al., 2020). Liu et al. (2022) found that KIF11 was up-regulated in gliomas and, also, was negatively associated with the survival of patients, based on TCGA-derived data. Moreover, it was suggested that KIF11 is necessary for the stemness of glioma cells and thereby cell proliferation.

KIFC1, which encodes a kinesin motor known to have a role in the clustering of the centrosomes of cancer cells, is involved in several types of cancers. Wu et al. (2021) found that there is elevated expression of KIFC1 in GBM and its inhibition suppressed proliferation and resistance to temozolomide. Dai et al. (2017) also showed that KIFC1 is up-regulated in grade III gliomas, like in our study.

LAPTM5 encodes protein E3, which is a transmembrane receptor. There are controversial findings regarding the role of LAPTM5 in gliomas. In one study, LAPTM5 was suggested to be a biomarker of poor prognosis in GBMs (Hajj et al., 2020). On the other hand, in another study, LAPTM5 was found to act as a tumor suppressor and its inhibition actually promoted invasiveness of GBMs based on in vitro and in vivo experiments (Berberich et al., 2020). Nevertheless, in our study we found LAPTM5 to be up-regulated in LGG and also to be a poor prognostic marker for the overall survival of LGG patients.

The product of LMNB1 is one of the three component proteins of nuclear lamina. In a recent study based on microarray data, it was shown that LMNB1 expression was higher in gliomas, and was also related with poor survival rates. Moreover, inhibition of LMNB1 suppressed glioma proliferation (Zhou et al., 2021). Pei et al. (2022) showed that LMNB1 is up-regulated in glioma cells, and overexpression of lamin genes causes abnormalities in human astrocytes which are actually glial cells that support the central nervous system.

NCKAP1L, found to be up-regulated in our study, was also shown to constitute a hub gene in LGG based on TCGA- and CGGA-derived data. In the same study, the genes LAMC1, CD74, HLA-DMA, ITGB2, TYROBP, LAPTM5, and CYBB (Su et al., 2019), which were also identified in our work, were shown to constitute hub genes.

SLC7A7 belongs to the family of solute carrier seven genes, which play a role in nutrition intake as transporters. In a study, where the genes of this family were examined in LGG by using TCGA-derived data, it was shown that SLCA7A was significantly up-regulated in patients with LGG and was also a poor prognostic marker (Liu et al., 2021a), similarly to the findings of our study. Moreover, Fan et al. (2013) examined a Chinese population of 736 glioma patients and 793 normal subjects and found that polymorphisms in SLC7A7 actually contribute to the heterogeneity of gliomas.

TBXAS1 codes for a member of the cytochrome P450 superfamily. These proteins are implicated in drug metabolism and also the synthesis of fat molecules. Elevated expression of TBXAS1 was found in diffuse LGG patients, which was also associated with poor overall survival (Wang et al., 2017), like in our study.

The product of TOP2A is a DNA topoisomerase with a critical role in transcription. A novel variant of TOP2A was identified in GBM patients, which was associated with altered transcriptional regulation and decreased survival (Gielniewski et al., 2020). Zhou et al. (2018) revealed a relationship between TOP2A up-regulation and poor prognosis in gliomas, and also suggested that MKI67, another hub gene found to be up-regulated in our study, is correlated with TOP2A overexpression.

TYROBP encodes a transmembrane signaling protein. In a study by Lu et al. (2021), where glioma datasets derived from Oncomine, GEPIA2 and CGGA were examined, TYROBP was found to be one of the hub genes up-regulated in LGG. Their results also revealed that TYROBP was implicated in pivotal signaling pathways, such as JAK-STAT, suggesting that TYROBP could actually contribute largely to LGG through its involvement in signaling networks (Lu et al., 2021).

WDFY4, found to be up-regulated both in TCGA- and GEO-LGG samples, is implicated in autophagy and also the processing and cross-presentation of viral and cancer antigens by dendritic cells (Theisen et al., 2018).

Symptoms of brain tumors, with headache being the most common one, are generally considered to be either non-significant or possibly benign tumors, thereby leading to poor diagnosis and prognosis. It has been demonstrated that blood tests can be used for the timely and accurate detection of circulating tumor cells (Macarthur et al., 2014; Jelski & Mroczko, 2021). In particular, Butler et al. (2019) tested a methodology, called ATR-FTIR, on blood tests, and successfully differentiated brain cancer and normal cells with a 93.2% sensitivity and 92.8% specificity. In another study, Podnar et al. (2019) used machine learning on the blood tests of patients with brain tumors in order to investigate any differences with 96% sensitivity and 74% specificity. In a similar manner, the eighteen signature genes identified in this study could be tested for their diagnostic potential in patients with brain neoplasms through blood tests.


In this study, an integrative in silico approach was applied, including differential gene co-expression analysis and network-based methods, in order to identify gene expression signatures in LGG. Eighteen hub genes were detected, all up-regulated in LGG, and significantly associated with unfavorable prognosis in LGG patients. Their importance in LGG was also assessed on the basis of their relationship with LGG clinical traits. Hence, these genes could be taken into consideration in the clinical setting as candidate diagnostic or prognostic biomarkers for the accurate, timely and cost-effective diagnosis of LGG, and for monitoring LGG patients’ progression. The findings of this study could lay the foundation for further in vitro and in vivo experimental studies towards the elucidation of the underlying mechanisms of low-grade glioma genesis.

Supplemental Information

List of genes differentially expressed between the TCGA-LGG and GEO-LGG and the corresponding normal GTEx-derived samples.

DOI: 10.7717/peerj.15096/supp-1

Genes differentially expressed between TCGA-LGG and normal samples.

Volcano plots of detected DEGs generated by edgeR (left), DESeq2 (middle) and limma (right). Vertical dash lines indicate log2(FC) and horizontal dash line indicates –log10(0.05). Red and blue dots correspond to up-regulated and down-regulated genes, respectively, and gray dots represent insignificant genes.

DOI: 10.7717/peerj.15096/supp-2

Genes differentially expressed between GEO-LGG and normal samples. Volcano plots of detected DEGs generated by edgeR (left), DESeq2 (middle) and limma (right).

Vertical dash lines indicate log2(FC) and horizontal dash line indicates –log10(0.05). Red and blue dots correspond to up-regulated and down-regulated genes, respectively, and gray dots represent insignificant genes.

DOI: 10.7717/peerj.15096/supp-3

Scale independence plot and Mean Connectivity plot of TCGA common DEGs used for selecting soft-thresholding power.

DOI: 10.7717/peerj.15096/supp-4

Scale independence plot and Mean Connectivity plot of GEO common DEGs used for selecting soft-thresholding power.

DOI: 10.7717/peerj.15096/supp-5

Heatmap of the TOM matrix of TCGA (left) and GEO (right).

The outer layer is the dendogram tree of each gene. The middle part is the merged modules with their assigned colors. In the inner heatmap, lighter and darker color indicates higher and lower topological overlaps, respectively.

DOI: 10.7717/peerj.15096/supp-6
2 Citations   Views   Downloads