Identification of hub genes and small molecule therapeutic drugs related to breast cancer with comprehensive bioinformatics analysis

Breast cancer is one of the most common malignant tumors among women worldwide and has a high morbidity and mortality. This research aimed to identify hub genes and small molecule drugs for breast cancer by integrated bioinformatics analysis. After downloading multiple gene expression datasets from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) database, 283 overlapping differentially expressed genes (DEGs) significantly enriched in different cancer-related functions and pathways were obtained using LIMMA, VennDiagram and ClusterProfiler packages of R. We then analyzed the topology of protein–protein interaction (PPI) network with overlapping DEGs and further obtained six hub genes (RRM2, CDC20, CCNB2, BUB1B, CDK1, and CCNA2) from the network via STRING and Cytoscape. Subsequently, we conducted genes expression verification, genetic alterations evaluation, immune infiltration prediction, clinicopathological parameters analysis, identification of transcriptional and post-transcriptional regulatory molecules, and survival analysis for these hub genes. Meanwhile, 29 possible drug candidates (e.g., Cladribine, Gallium nitrate, Alvocidib, 1β-hydroxyalantolactone, Berberine hydrochloride, Nitidine chloride) were identified from the DGIdb database and the GSE85871 dataset. In addition, some transcription factors and miRNAs (e.g., E2F1, PTTG1, TP53, ZBTB16, hsa-miR-130a-3p, hsa-miR-204-5p) targeting hub genes were identified as key regulators in the progression of breast cancer. In conclusion, our study identified six hub genes and 29 potential drug candidates for breast cancer. These findings may advance understanding regarding the diagnosis, prognosis and treatment of breast cancer.


INTRODUCTION
Among cancers affecting females, breast cancer has a particularly high incidence, recurrence and mortality rate. Although encouraging progress has been made in the early diagnosis and systemic treatment over several decades, the overall 5-year survival rate for patients with breast cancer is still low, and the incidence rate continues to increase annually How to cite this article Hao M, Liu W, Ding C, Peng X, Zhang Y, Chen H, Dong L, Liu X, Zhao Y, Chen X, Khatoon S, Zheng Y. 2020. Identification of hub genes and small molecule therapeutic drugs related to breast cancer with comprehensive bioinformatics analysis.

Download of datasets and identification of DEGs
Five independent gene expression profiles (GSE3744, GSE21422, GSE42568, GSE61304, and GSE65194) based on GPL570 Platform (Affymetrix Human Genome U133 Plus 2.0 Array) were downloaded from GEO database to identify DEGs. After normalization between arrays, we investigated DEGs among each dataset with the threshold of |log2FoldChange (log2FC)|> 1 and adj.P.Val < 0.05 using LIMMA package of R (Ritchie et al., 2015). For validation, GEPIA2 online tool (http://gepia2.cancer-pku.cn/#index) was used to analyze the differential expression of TCGA Breast invasive carcinoma (BRCA) RNA-seq dataset composed of 1,085 tumor samples and 112 normal samples according to the cut-off standard (|log2FC|>1 and q-value <0.05). Lastly, the common results of TCGA BRCA dataset and GEO datasets were selected as the overlapping DEGs of breast cancer, which could reduce the influences resulted from the heterogeneity of the different datasets. Venn diagram and volcano plot were drawn by VennDiagram and other packages of R. Table 1 listed the details of datasets.

Protein-protein interaction (PPI) network integration and hub genes screening
STRING (version 11.0; https://string-db.org/) is a biological database designed to analyze functional interactions between proteins (Szklarczyk et al., 2019). In this study, we used STRING to construct a PPI network with overlapping DEGs under the premise of an Interaction score of 0.7. Then, we utilized the cytoHubba plug-in of Cytoscape (version 3.7.2) which provided the calculated results by maximal clique centrality (MCC), maximum neighborhood component (MNC) and Degree methods to identify hub genes from the PPI network (Chin et al., 2014).

Verification of hub genes
The Oncomine database (https://www.oncomine.org) was used to verify the mRNA expression of hub genes with the threshold of P <0.05 and fold change >2. Next, the Human Protein Atlas database (HPA, https://www.proteinatlas.org) was used to validate the protein expression of genes by immunohistochemistry data. BC-GenExMiner (http://bcgenex.centregauducheau.fr/BC-GEM/GEM-Accueil.php) is a statistical mining tool that contains published breast cancer transcription data (10,716 DNA microarray samples and 4,712 RNA-seq samples) (Jézéquel et al., 2012). We performed correlation analysis between hub genes in breast cancer using BC-GenExMiner and GEPIA2 online tool.

GO functional and KEGG pathway enrichment analysis
ClusterProfiler package of R can automatically classify biological terms and gene clusters (Yu et al., 2012). To elucidate the biological characteristics of breast cancer-related genes, we performed GO functional and KEGG pathway enrichment analysis by ClusterProfiler with p-value <0.05 and q-value <0.05.

Analysis of genetic alterations of hub genes
The cBioPortal for Cancer Genomics (cBioPortal; http://cbioportal.org) provides online resources for the exploration, visualization and analysis of multidimensional cancer genomics data (Gao et al., 2013). In this study, 6618 breast cancer samples from 13 related reports in cBioPortal were used as research materials to explore genetic alterations connected with the selected hub genes. Afterward, we utilized COSMIC (https://cancer.sanger.ac.uk/cosmic), the most comprehensive resource for studying somatic mutation information in human cancer, to analyze hub genes alterations in breast cancer (Forbes et al., 2010).

Evaluation of clinicopathological characteristics and immune infiltration
We used BC-GenExMiner to analyze the correlations between hub genes expression and clinicopathological variables such as Oestrogen receptor status (ER), Progesterone receptor status (PR), HER2 receptor status (HER2), Nodal status (N), Scarff Bloom & Richardson grade status (SBR), Nottingham Prognostic Index status (NPI), Age status, P53 status, Basal-like and Triple negative breast cancer (TNBC) subtypes. P < 0.05 was considered to be statistically significant.
TIMER (https://cistrome.shinyapps.io/timer/) is a comprehensive resource for systematic analysis of tumor-infiltrating immune cells across 32 different cancers from TCGA database (Li et al., 2017). In this experiment, we estimated the associations between hub genes expression and immune cell populations (B Cell, CD8+ T Cell, CD4+ T Cell, Macrophage, Neutrophil, and Dendritic Cell) in breast cancer using TIMER.

Survival analysis
In this section, we applied the Kaplan-Meier Plotter (http://kmplot.com/analysis/) to evaluate prognostic information of previously identified hub genes and important reporter regulatory molecules (Nagy et al., 2018). The expression values of these genes were split into either high (expression value ≥ median) or low (expression value <median). Hazard ratio (HR) was calculated to evaluate the association between genes expression and survival, and p < 0.05 was considered statistically significant.

Small molecule drugs analysis
The DGIdb online tool (http://www.dgidb.org/)-an available resource containing drug-gene interaction information from more than 30 databases-was used to screen antineoplastic drugs targeting hub genes (Cotto et al., 2018). We also downloaded the GSE85871 dataset, which is a gene expression data of MCF7 cells treated with 102 traditional Chinese medicine (TCM) ingredients recorded in the GEO database. LIMMA package of R was used to analyze the differential expression genes in each TCM ingredient treatment group compared with the untreated group (adj .p. val < 0.05). Then, we evaluated the reversal effects of each TCM ingredient on overlapping DEGs induced by breast cancer.

Dataset processing and DEGs acquisition
After data normalization, the black lines of gene expression box plots of all samples in each single dataset were almost on the same level, which was an important marker to predict the reliability and accuracy of the experimental results (Fig. S1). Next, DEGs (1,738 in GSE3744, 2,430 in GSE21422, 3,116 in GSE42568, 990 in GSE61304 and 4,181 in GSE65194) were identified (Figs. 2A-2E; Tables S1-S5), and 302 integrated DEGs (110 upand 192 down-regulated) from 5 GEO datasets were found (Figs. 2F and 2G). Similarly, we identified 3,559 DEGs from TCGA BRCA dataset with the cut-off criteria of |Log2FC| > 1 and q-value < 0.05 (Table S6). Then, 283 overlapping DEGs which might play promoting or inhibitory roles in breast cancer progression were confirmed from the analysis results of GEO datasets and TCGA BRCA dataset (Figs. 2H and 2I).

PPI network construction and hub genes filtering
The PPI network around proteins encoded by 283 overlapping DEGs was constructed using STRING (Fig. S2A, Table S7). We found that 165 of the 283 overlapping DEGs were related to each other and were visualized using Cytoscape -165 nodes and 1,861 edges were included in PPI network and six hub genes (RRM2, CDC20, CCNB2, BUB1B, CDK1, and CCNA2) based on MCC, MNC and Degree methods were identified (Fig. S2B). Notably, these hub genes were all up-regulated in overlapping DEGs (Tables S1-S6) and might play important roles in the pathogenesis of breast cancer.

Verification of hub genes
Based on the large-scale breast cancer-related data in the Oncomine database, we confirmed that hub genes were significantly up-regulated in multiple cancer types, including Breast Cancer, Brain and CNS Cancer, Lymphoma, Lung Cancer, and so on (Fig. 3A). Also, immunohistochemistry staining data obtained from the HPA database demonstrated the up-regulated expression of proteins encoded by RRM2, CDC20, CCNB2, CDK1 and CCNA2 (Figs. 3B-3K). However, we did not find the association between BUB1B and breast cancer in HPA database. According to the current analysis, we predicted that BUB1B might also be associated with breast cancer, but experimental data were needed to confirm this specific connection. Meanwhile, CDK1, CCNA2, CCNB2, BUB1B and CDC20 were classified as cancer-related genes, and RRM2 was an FDA approved drug target. The data of BC-GenExMiner and GEPIA2 both confirmed a powerful correlation among hub genes (Fig. S3), suggesting that these genes might be the functional partners in breast cancer.

GO function enrichment and KEGG pathway analysis
The GO function annotations of overlapping DEGs were mainly classified into biological processes (BP), cell component (CC) and molecular function (MF). As for BP, up-regulated overlapping DEGs were significantly related to mitotic nuclear division, organelle fission,    and regulation of chromosome segregation, which was consistent with the biological characteristics of the abnormally rapid proliferation of breast cancer cells. And the down-regulated genes were closely related to regulation of cellular response to growth factor stimulus, temperature homeostasis, retinoid metabolic process and regulation of vasculature development. Within CC, the up-regulated genes were remarkably correlated to spindle, chromosome, centromeric region, condensed chromosome kinetochore, and midbody, whereas the down-regulated genes were related to collagen-containing   extracellular matrix and sarcolemma. MF analysis displayed the up-regulated genes were involved in microtubule binding, microtubule motor activity, and cyclin-dependent protein serine/threonine kinase regulator activity, whereas the down-regulated genes were mainly enriched in glycosaminoglycan binding, heparin binding, extracellular matrix structural constituent and growth factor binding (Figs. 4A and 4B; Tables S8 and S9). KEGG pathway analysis showed the up-and down-regulated overlapping DEGs were all significantly attached to Cell cycle, Oocyte meiosis, Tyrosine metabolism, ECM-receptor interaction, Progesterone-mediated oocyte maturation, p53 signaling pathway, PPAR signaling pathway and Phenylalanine metabolism (Fig. 4C). Furthermore, the related pathways of hub genes included Cell cycle, Oocyte meiosis, Progesterone-mediated oocyte maturation, and p53 signaling pathway (Fig. 4C). Table S10 presented the detailed results of KEGG enrichment analysis.

Clinicopathological characteristics and immune infiltration
We investigated the relevance of six hub genes and clinicopathological features using BC-GenExMiner (Figs. S5-S7). Data analysis showed that higher expression of hub genes was found in higher NPI and SBR grade (p < 0.001). And the expression of these hub genes was significantly higher in ER-, PR-, HER2+, Nodal+, P53-mutated, Basal-like and TNBC clinical subtypes of breast cancer. Surprisingly, significantly increased expression of hub genes was found in patients not more than 51 years old (p < 0.001). Moreover, our current results demonstrated that hub genes were correlated with 6 types of immune cell infiltrates (B Cell, CD8+ T Cell, CD4+ T Cell, Macrophage, Neutrophil and Dendritic Cell) with various degrees based on the TIMER database (Fig. S8).

Prediction of TF-hub genes and miRNAs-hub genes interaction
We screened the potential regulatory relationships between TFs and hub genes via TRRUST database to further study the functional roles of hub genes. A total of 19 associations between 17 TFs and six hub genes were shown (Fig. 5A). E2F1 could activate the transcriptional process of RRM2 and CDK1. In contrast, TP53 inhibited the expression of CDK1 and CCNB2. Noticeably, PTTG1 gene, the transcriptional promoter of CDK1, was up-regulated in breast cancer; ZBTB16 gene was the transcriptional suppressor of CCNA2, and its expression was down-regulated in overlapping DEGs. Additionally, we obtained 273 targeted miRNAs with regulatory effects on hub genes using ENCORI online tool, and identified 103 DEmiRNAs (58 up-and 45 down-regulated; Table S11) from GSE97811 dataset with |log2FC|>1 and adj.P.Val < 0.05. As a result, 41 associations between 30 miRNAs and 4 hub genes (RRM2, CDK1, CCNA2, and BUB1B) were found from targeted miRNAs analysis and DEmiRNAs data (Fig. 5B). The top 2 hub genes with the most miRNAs targets were RRM2 and CCNA2. In addition, hsa-miR-340-5p, hsa-miR-130a-3p, hsa-miR-200b-3p, hsa-miR-200c-3p, hsa-miR-204-5p, hsa-miR-219a-5p, hsa-miR-27a-3p, hsa-miR-27b-3p, hsa-miR-301a-3p and hsa-miR-429 were the top 10 miRNAs with the most target genes. Unfortunately, we only used TRRUST and ENCORI databases, as well as limited samples of GSE97811 dataset to analyze TFs and miRNAs targeting hub genes, which may potentially limit the completeness of this study.

Small molecule drugs analysis
Regarding six hub genes as potential therapeutic targets for breast cancer, we identified several antineoplastic drugs based on the DGIdb database. At present, only RRM2, CDK1 and CCNA2 were identified as tumor therapeutic targets. Therefore, we speculated the other three genes (CCNB2, BUB1B, CDC20) might be the novel targets in the future. Statistical analysis revealed that 21 candidate drugs such as Cladribine, Gallium nitrate, Dinaciclib, Alvocidib and Suramin targeted RRM2, CDK1 and CCNA2 (  experimental data are needed to further confirm the potential of these drug candidates in the treatment of breast cancer. In addition, we identified 8 TCM ingredients (1β-hydroxyalantolactone, Andrographolide, Berberine hydrochloride, Britanin, Hyodeoxycholic acid, Japonicone A, Nitidine chloride and Tanshinone IIA) that reversed breast cancer-induced gene expression from GSE85871 dataset. Japonicone A reversed the expression of 87 overlapping DEGs, including six hub genes, and its potential therapeutic effects on breast cancer were related to Cell cycle, Oocyte meiosis, p53 signaling pathway, Progesterone-mediated oocyte maturation, Viral carcinogenesis, HTLV-I infection, Pathways in cancer and TGF-beta signaling pathway (Fig. 7A). Also, Nitidine chloride, Berberine hydrochloride, 1β-hydroxyalantolactone, Britanin and Tanshinone IIA reversed the expression of 37, 35, 31, 30 and 19 overlapping DEGs, respectively, and the potential therapeutic effects of these ingredients on breast cancer were related to biological pathways such as Cell cycle, Oocyte meiosis, p53 signaling pathway and Progesterone-mediated oocyte maturation (Figs. 7B-7F). Details of TCM ingredients were shown in Table S12.

DISCUSSION
Although the detection and treatment of breast cancer have improved, it is still one of the most prevalent malignant tumors with the highest increase in prevalence among women (Ghoncheh, Pournamdar & Salehiniya, 2016). The diagnosis, treatment and prognosis of breast cancer have always been concerned with the world. Gene expression profiles are widely used to explore the molecular mechanisms related to tumorigenesis, which have provided valuable reference and information for clinical applications (Mohr et al., 2002).
In this study, we identified 283 overlapping DEGs (105 up-and 178 down-regulated) and six hub genes (RRM2, CDC20, CCNB2, BUB1B, CDK1, CCNA2) associated with breast cancer tumorigenesis and progression based on multiple datasets. By integrating the Oncomine, HPA, GEPIA2 and BC-GenExMiner databases, we confirmed that hub genes were over-expressed at mRNA and protein levels in breast cancer tissues compared with normal and non-cancerous tissues, and there was a powerful correlation between these genes, suggesting that hub genes were potential functional partners closely related to breast cancer. Furthermore, higher expression of hub genes was found in ER-, PR-, HER2+, Nodal+, Basal-like, P53-mutated and TNBC clinical subtypes of breast cancer, and there was a higher hub genes expression in patients not more than 51 years old. We also discovered that the over-expression of each hub gene was associated with poor OS, RFS and DMFS among patients with breast cancer, suggesting that these genes might be potential prognostic biomarkers and promote the progression of breast cancer. Further, we found the expression of hub genes was significantly correlated with immune cell infiltrates and purity, implicating that these hub genes played important roles in manipulating breast cancer immune microenvironment.
KEGG enrichment analysis showed that overlapping DEGs including 6 hub genes were significantly associated with Cell cycle and Oocyte meiosis biological pathways, and overlapping DEGs were also involved in Tyrosine metabolism, ECM-receptor interaction, Progesterone-mediated oocyte maturation and PPAR signaling pathways. To our knowledge, abnormal regulations of cell cycle and cell growth were the major causes of tumorigenesis (Zhuang et al., 2015). In the current study, we found that two important pathways related to cell growth and apoptosis-Cell cycle and Oocyte meiosiswere dysregulated in breast cancer. In addition, ECM-receptor interaction pathway played important roles in tumor shedding, adhesion, degradation, movement and hyperplasia. So far, some studies have shown that ECM receptor interaction pathway was closely related to breast cancer, colorectal cancer, head and neck cancer and other human tumors (Bao et al., 2019;Islam et al., 2018;Rahman et al., 2019). Gasco, Shami & Crook (2002) concluded that molecular pathological analysis of specific components of p53 signaling pathway may be helpful for the diagnosis and prognosis of breast cancer. In addition, it has been found that signal transduction pathways such as Tyrosine metabolism, Progesteronemediated oocyte maturation and PPAR signaling pathway may also be associated with the occurrence of human cancers (Chen et al., 2012;Liu & Ye, 2017;Pietras et al., 1995). These pathways provided insights into the molecular mechanisms of breast cancer initiation and development.
Accumulating studies have demonstrated that CDK1, CCNB2, CCNA2, CDC20 and BUB1B, as genes related to cell cycle, are involved in the occurrence and development of tumors. CDK1, also known as CDC2, plays an important role in the precise cell division (Kang et al., 2014). Inhibiting the expression of CDK1 can suppress tumor cells growth and induce apoptosis in TNBC clinical subtype of breast cancer (Liu et al., 2014). In addition, high expression of CDK1 led to worse 5-year RFS in breast cancer patients (Kim et al., 2008), similar to the experimental results (Fig. 6K). As an important component in cell cycle regulation, CCNB2 seems to functions as the oncogene and independent prognostic factor for survival in patients with breast cancer (Shubbar et al. 2013). Tang et al. (2018) proved that there was a significant correlation between CCNB2 and molecular subtypes of breast cancer (Fig. S6A). CCNA2, a key regulator of cell cycle, could promote the transformation and progression of cancer (He et al., 2017). Gao et al. (2014) found the over-expression of CCNA2 in breast cancer was related to the unfavorable prognosis , similar to our current findings (Fig. 6). We understand that there have been previous reports that have associated CDC20 over-expression with tumor progression and poor prognosis of breast cancer, indicating that CDC20 may be a useful marker for monitoring breast cancer progression (Sewart & Hauf, 2017). Similarly, BUB1B played a pivotal role in the proliferation and progression of many tumors (Takagi et al., 2013). RRM2, a breast cancer hub gene participated in phenylalanine metabolic pathway in our study, was closely linked to tumor growth, invasion, angiogenesis, tumor metastasis and other cellular functions, as well as the prognosis of breast cancer patients (Bell, Barraclough & Vasieva, 2017;Chen et al., 2019). It has been previously reported that RRM2 expression was associated with the resistance of tumorigenic breast cancer cells to chemotherapy (Shah et al., 2015). Taken together, our findings were consistent with other previous studies that six hub genes may serve as predictive biomarkers for diagnosis and prognosis of patients with breast cancer.
Breast cancer also shows changes in the expressions of noncoding RNAs (e.g., miRNAs, additional elements). Several studies have previously proven that miRNAs played vital roles in many biological processes, including cell growth, differentiation, metabolism, apoptosis and signal transduction. For instance, (Ma (2019) demonstrated that miR-219-5p inhibited the cell proliferation and cell cycle distribution of ESCC cells by inhibiting the expression of CCNA2, highlighting the role of miR-219-5p and CCNA2 in cell cycle and tumor growth. Liang et al. (2019) confirmed the over-expression of miR-204-5p not only inhibited the high expression of RRM2 in breast cancer cells but also inhibited cell migration and invasion of breast cancer. The above conclusions further demonstrated the miRNAs identified in this study were promising biomarkers for breast cancer. Recent publications described that additional elements including REP522, D20S16, HERVKC4-INT and HERV1_LTRc were also abnormally expressed in ER+/HER2-breast cancer, showing that it is necessary to further strengthen the study of these additional elements related to cancer (Karakülah et al., 2019;Yandım & Karakülah, 2019).
Next, we further identified 21 anti-tumor drugs (e.g., Cladribine, Gallium nitrate, Dinaciclib, Alvocidib, Suramin) targeting RRM2, CDK1 and CCNA2 (Table 2). Nevertheless, whether these drugs could exert therapeutic effects on breast cancer by inhibiting the over-expression of RRM2, CDK1 and CCNA2, or whether CCNB2, BUB1B and CDC20 are promising therapeutic targets still need to be supported by further research. In addition, we identified 8 TCM ingredients (1β-hydroxyalantolactone, Andrographolide, Berberine hydrochloride, Britanin, Hyodeoxycholic acid, Japonicone A, Nitidine chloride and Tanshinone IIA) that reversed breast cancer-induced overlapping DEGs expression. Previous limited studies have reported that some of TCM ingredients mentioned above affected the cell cycle of tumor cells (Du et al., 2015;Li et al., 2020a;Lu, 2009;Pan et al., 2011;Wang et al., 2016;Zou et al., 2017), similar to our current study that these ingredients inhibited proliferation as well as promoted apoptosis of breast cancer cells through more than one biological pathway (Fig. 7).

CONCLUSIONS
In the current study, we obtained 283 overlapping DEGs and six hub genes (RRM2, CDC20, CCNB2, BUB1B, CDK1, and CCNA2) related to the occurrence and development of breast cancer via comprehensive bioinformatics analysis. Furthermore, we considered the associations between hub genes expression and clinicopathological factors (e.g., age, subtypes) of patients with breast cancer. The study also established TFs-hub genes and