Breast invasive ductal carcinoma (IDC) is the most common malignant tumor in females worldwide (Harbeck et al., 2019; Hanker, Sudhan & Arteaga, 2020). In 2018, there were more than 266,000 cases of breast IDC among females in the United States, and this cancer accounted for 30% of malignant tumors in females, far more than lung cancer (13%) (Bray et al., 2018; Ahmad, 2019). The prognosis of women with breast IDC is related to the activation or silencing of various biological functions in tumor tissues and signaling pathways. There are prognostic models based on tumor immunity and autophagy, but no models have exclusively focused on IDC (Li et al., 2020a; Li et al., 2020b; Zhang, Zhang & Yu, 2019; Hu et al., 2020) and few models examined genes that function in basic metabolism.
Glycolysis is a series of reactions that catabolize most carbohydrates. “Metabolic reprogramming” is the hallmark of tumors, and glycolysis is the main source of energy for tumor cells, even when there is insufficient oxygen (Wu et al., 2020). Moreover, activation of glycolysis-related genes occurs in almost all tumor cells. For example, Dai et al. (2020) found that glycolysis promoted the progression of pancreatic cancer and induced gemcitabine chemotherapy resistance. Long noncoding RNAs (lncRNAs) that interact with Long Intergenic Noncoding RNA for IGF2BP2 Stability (LINRIS) activate aerobic glycolysis in tumor cells, and can affect the development and prognosis of rectal cancer (Wang et al., 2019a; Wang et al., 2019b).
Research on the function of glycolysis-related genes in breast tumors showed that hexokinase (HK2) had high expression in breast IDC cells, and that HK2 silencing inhibited IDC (Patra et al., 2013; Cao et al., 2020). Other research showed that overexpression of 6-Phosphofructo 2-kinase/fructose 2, 6-bisphosphatase 3 (PFKFB3) promoted the progression of breast IDC, and had negative associations with progression-free survival (PFS), distant metastasis-free survival (DMFS), and overall survival (OS) in patients with breast IDC (O’Neal et al., 2016; Peng et al., 2018a; Peng et al., 2018b). Thus, glycolysis-related genes have a potentially significant impact on the progression of breast tumors and on the survival and prognosis of patients with breast tumors.
Our aim was to develop a prognostic model of breast IDC based on glycolysis-related genes and determine the potential functions of glycolysis-related genes in the progression of breast IDC.
Materials and Methods
Data resources and preprocessing
The transcriptome data and corresponding clinical data of breast invasive ductal carcinoma were downloaded from the TCGA database (https://www.cancer.gov/) (Tomczak, Czerwinska & Wiznerowicz, 2015). The data set of glycolysis-related genes was obtained from the MSigDB database (http://www.hmdb.ca). Using | log2FC | > 0.5 and false discovery rate (FDR) <0.05 as the cut-off value, the data was normalized with the “edgeR” package from R, and then the differential analysis was performed to obtain the differential expression of glycolysis-related genes (DEGRGs) between the tumor tissue and normal tissue.
Construction of risk-scoring model
Based on the above DEGRGs, univariate Cox regression and LASSO regression were used to screen out prognostic-related glycolysis genes. The risk score was evaluated by the coefficient of each prognostic-related glycolysis gene. The risk scoring formula was constructed as , where i is the number of genes used to build the model, coefi is the coefficients of the genes in the model, and expri the expression of genes in the model. Taking the median risk score as the cut-off point, patients were divided into high-risk and low-risk groups. The survival outcome of the two groups was observed by Kaplan–Meier survival analysis. Receiver operating characteristic (ROC) curve was applied to calculate the area under the curve (AUC) to evaluate the predictive ability of the risk-scoring model. Independent GEO (https://www.ncbi.nlm.nih.gov/geo/) data sets are used to verify the above results (Barrett et al., 2013). Univariate Cox regression and multivariate Cox regression were used to identified the independent prognostic factors among risk scores, age, tumor TNM stage, and whether triple negative breast cancer (TNBC) or not. Through clinical survival analysis, the predictive ability of the risk-scoring model in different clinical subgroups was clarified.
External validation of the risk scoring model
The TCGA results were validated using the GEO (https://www.ncbi.nlm.nih.gov/geo/) dataset (GSE131769). For this validation, the outcomes of the two groups were compared using Kaplan–Meier survival analysis. ROC curves were used to calculate AUCs and evaluate the predictive performance of the risk-scoring model.
Construction of the nomogram
A nomogram was constructed based on the results of the multivariate Cox regression, with clinical information such as age, TMN stage, TNBC status, and DEGRG risk scores, to predict 3-year and 5-year OS. The predictive ability of the nomogram was evaluated by calculating the C-index and the calibration chart, clinical decision curve analysis, and an ROC curve.
Function enrichment analysis
We analyzed the genes that were differentially expressed in the high-risk and low-risk groups using the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) to identify pathway enrichment (Kanehisa et al., 2019). This analysis allowed identification of the functions of the differentially expressed genes. We then examined whether the differentially expressed genes were involved in the development of breast cancer.
Construction of interactive network diagram
To determine the relationship of the DEGRGs model with prognosis, a network between genes was developed. The prognosis-related genes were imported into Search Tool for the Retrieval of Interacting Genes Proteins (STRING) to construct an interaction network (Szklarczyk et al., 2019).
DEGRGs in breast IDC
We obtained gene expression data and clinical data of females with IDC of the breast (772 tumor tissues, 88 adjacent normal tissues) from the TCGA database and the glycolysis gene set (286 genes) from the GSEA website. Based on standard cut-off values for fold-change in gene expression (|log2(FC) |> 0.5) and false discovery rate (FDR < 0.05), the IDC tissues had 185 DEGRGs, with 67 down-regulated genes and 118 up-regulated genes (Figs. 1A, 1B, Table S1).
Relationship of DEGRGs with prognosis and risk-scoring model
We then identified patients with follow-up times greater than 30 days, and performed univariate Cox regression and LASSO regression to screen for DEGRGs that were related to prognosis (Figs. 2A, 2B, 2C). This analysis indicated that 13 DEGRGs were closely related to prognosis.
We constructed a risk-scoring model based on multivariate Cox regression and divided patients into high-risk and low-risk groups based on median risk score. Kaplan–Meier survival analysis showed that patients with high-risk had significantly reduced duration of OS (P = 9.795 ×10−8, Fig. 3A). ROC analysis indicated the AUC was 0.755 for 3-year OS and 0.726 for 5-year OS (Fig. 3B). The risk curve and scatterplot (Figs. 3C, 3D) show the risk scores and survival status of all patients, and indicate that the risk coefficient and mortality rate were greater in the high-risk group. We plotted a “heat map” to visualize the expression of the 13 DEGRGs in the high-risk and low-risk groups (Fig. 3E). Taken together, these results confirm that 13 DEGRGs were significant prognostic indicators for patients with IDC of the breast.
We performed univariate and multivariate Cox regression to evaluate the effect of risk score, age, triple-negative breast cancer (TNBC), and TNM stage on patient prognosis (Figs. 4A, 4B). The results confirmed that the risk score was an independent prognostic factor for patients with IDC of the breast (adjusted hazard ratio: 2.71, 95% CI [1.87–3.94]).
We also performed survival analysis of different subgroups based on TNM status (Fig. 5). This analysis indicated that the risk-scoring model had good predictive value in the T1-2 subgroup, T3-4 subgroup, N0 subgroup, N1-3 subgroup, M0 subgroup, TNBC subgroup, and non-TNBC subgroup (all P < 0.001), but not in the M1 subgroup (P = 0.857).
External validation of the risk scoring model
We verified the model using the GEO dataset (GSE131769). Kaplan–Meier survival curves showed that patients with high-risk had a significantly shorter duration of OS (P = 3.245 ×10−3, Fig. 6A). ROC analysis indicated that the AUC was 0.731 for 3-year OS and was 0.728 for 5-year OS (Fig. 6B). These results confirm the validity of our risk scoring model.
Construction of the prediction model
Based on the results of the multivariate Cox regression, we developed a nomogram based on age, TMN stage, risk score, and TNBC status to predict 3-year and 5-year OS (Fig. 7A). We then used the C-index, clinical decision curve, calibration chart, and ROC curve to evaluate the predictive performance of the nomogram (Figs. 7B–7F). The results indicated that the prognostic model had good prediction accuracy, with a C-index of 0.824, an AUC for 3-year OS of 0.842, and an AUC for 5-year OS of 0.808. These results verified the predictive ability of the nomogram.
Gene function enrichment analysis
To explore the potential mechanisms of prognosis-related DEGRGs in breast invasive ductal carcinoma, we performed KEGG enrichment analysis and GO enrichment analysis. KEGG pathway enrichment analysis showed that the gene set was enriched in the cell cycle, DNA replication, glycolysis and gluconeogenesis, RNA degradation, arachidonic acid metabolism, cytokine-cytokine receptor interactions, cytosolic DNA sensing, and ribosome function (Figs. 8A–8H). GO enrichment analysis showed that the gene set was enriched in the cell cycle G1-S phase transition, DNA geometric changes, the meiotic cell cycle process, regulation of cellular response to heat, cytokine-mediated signaling pathways, regulation of homotypic cell–cell adhesion, regulation of production of molecular mediators of immune responses, and T cell differentiation (Figs. 8I–8P).
Construction of an interactive network diagram
We constructed a network diagram to visualize the interactions between hub genes to better understand the potential functions of the different DEGRGs on prognosis. The results from STRING showed there were 6 interacting hub genes: P4HA2, P4HA1, PGK1, G6PD, HK3, and PMM2 (Fig. 9).
Breast IDC is the most common pathological type of breast tumor, and morbidity and mortality from this cancer continue to increase (Harbeck et al., 2019; Badve & Gokmen-Polar, 2019). There is evidence that changes in glycolysis-related genes have multiple effects on the prognosis of these patients (Li et al., 2020a; Li et al., 2020b; Chen et al., 2019). In particular, tumor cells reprogram the glycolysis pathway to accommodate the increased energy required for malignant transformations, including invasion and metastasis (Shen et al., 2020; Abbaszadeh, Cesmeli & Biray, 2020). Given the importance of glycolysis on tumor prognosis, numerous research groups have developed models based on glycolysis-related genes in their studies of different cancers (Zhang, Zhang & Yu, 2019; Wang et al., 2019a; Wang et al., 2019b; Karasinska et al., 2020). However, no previous study developed a prognostic model for patients with breast IDC based on glycolysis-related genes.
In this research, we identified 13 DEGRGs that were related to prognosis in patients with breast IDC, and then established a risk-scoring model. The results showed that the OS of patients in the high-risk group was significantly shorter than that of patients in the low-risk group. We also examined the impact of patient age, TMN stage, TNBC status, and risk score to construct a nomogram that predicts the 3-year and 5-year OS of these patients. Our application of various evaluation methods indicated that the nomogram had good performance in the prediction of OS. Our ROC analysis showed that the AUC was 0.842 for 3-year OS and 0.808 for 5-year OS, higher than the AUC values reported in previous models (Lin et al., 2020; Xie et al., 2019), thus indicating that our model had better predictive ability.
Among the 13 DEGRGs we used to construct the risk model, increased levels of P4H2A, NUP155, ALDH3B1, SDC1, G6PD, COPB2, B3GNT3, PMM2, and PGK1 were associated with poor prognosis and increased levels of HK3, AGRN, P4HA1, and ISG20 were associated with favorable prognosis (Table S2). Previous research reported increased levels of P4HA2 and P4HA1 (the two isomers of collagen prolyl 4-hydroxylase) in several types of human cancers and that both enzymes promoted glycolysis in tumor cells (Li et al., 2019). PGK1 is the first key enzyme to produce ATP in the glycolytic pathway, PGK1 is not only a metabolic enzyme but also a protein kinase, which mediates the tumor growth, migration and invasion through phosphorylation some important substrates (Fu & Yu, 2020). There is also evidence that SDC1 can promote the migration of breast cancer cells across the blood–brain barrier by regulating the expression of cytokines, thus promoting brain metastasis (Sayyad et al., 2019). Mele et al. found that the overexpression of G6PD can induce lapatinib resistance in breast cancer and also found a significant correlation between high expression of G6PD and tumor recurrence (Mele et al., 2019). Sauter et al. found that the level of HK3 in the nipple aspirate of patients with breast cancer was significantly lower than that of healthy women, and considered an elevated HK3 level as a possible sign of early breast cancer (Mannello & Gazzanelli, 2001). Thus, these previous studies are consistent with our finding that glycolysis-related genes were closely related to the occurrence and development of breast cancer and the prognosis of patients.
To further characterize the potential roles of the 13 DEGRGs in our risk model, we performed GO and KEGG enrichment analysis. The results showed that the gene set was enriched in cell cycle, DNA replication, glycolysis gluconeogenesis, RNA degradation, arachidonic acid metabolism, cytokine cytokine receptor interaction, cytosolic DNA sensing pathway and ribosome. Previous studies showed that under hypoxic conditions, metabolic reprogramming of breast tumor stem cells helped to maintain their growth and proliferation (Peng et al., 2018a; Peng et al., 2018b). Breast cancer occurs in patients with impaired immune function, in which cytokines function as growth signals for tumor cells. Many studies showed that interactions between the immune system and cancer cells, which are mediated by cytokines and chemokines, affect the initiation and progression of breast cancer and the response to treatment (Lim et al., 2018; King, Mir & Singh, 2017; Fabre et al., 2018). DNA replication is a fundamental biological process, and replication disorders can lead to genomic instability, an important feature of breast cancer. Many experimental and clinical studies have identified disorders of DNA replication during the development and progression of breast cancer (Kitao et al., 2018). Aerobic glycolysis pathway includes hexokinase, phosphofructokinase (PFK), and other genes (Wu et al., 2020). These previous findings therefore support our conclusion that the 13 glycolysis-related genes identified here play an important role in the progression of breast tumors.
Our study has some limitations. Firstly, the predictive model lacks information about patient treatments, thus limiting its predictive performance. Secondly, this study was based on bioinformatics analysis, and further studies are needed to determine the potential functional mechanisms.
In conclusion, our prediction model, which is based on 13 DEGRGs and the clinical characteristics of patients, can reliably predict the OS of patients with breast IDC. These 13 DEGRGs and several related miRNAs thus appear to play an important role in the development and progression of breast IDC.
Differentially expressed DEGRGs between IDC and normal tissue
Based on standard cut-off values for fold-change in gene expression (|log2(FC)| > 0.5) and false discovery rate (FDR < 0.05), the IDC tissues had 185 DEGRGs, with 67 down-regulated genes and 118 up-regulated genes.