Identifying multiple collagen gene family members as potential gastric cancer biomarkers using integrated bioinformatics analysis

Background Gastric cancer is one of the most common malignant cancers worldwide. Despite substantial developments in therapeutic strategies, the five-year survival rate remains low. Therefore, novel biomarkers and therapeutic targets involved in the progression of gastric tumors need to be identified. Methods We obtained the mRNA microarray datasets GSE65801, GSE54129 and GSE79973 from the Gene Expression Omnibus database to acquire differentially expressed genes (DEGs). We used the Database for Annotation, Visualization, and Integrated Discovery (DAVID) to analyze DEG pathways and functions, and the Search Tool for the Retrieval of Interacting Genes (STRING) and Cytoscape to obtain the protein–protein interaction (PPI) network. Next, we validated the hub gene expression levels using the Oncomine database and Gene Expression Profiling Interactive Analysis (GEPIA), and conducted stage expression and survival analysis. Results From the three microarray datasets, we identified nine major hub genes: COL1A1, COL1A2, COL3A1, COL5A2, COL4A1, FN1, COL5A1, COL4A2, and COL6A3. Conclusion Our study identified COL1A1 and COL1A2 as potential gastric cancer prognostic biomarkers.


INTRODUCTION
Gastric cancer (GC) is the fifth most common malignant cancer and the third leading cause of cancer-related mortality worldwide (Bray et al., 2018). In 2018, there were more than 1,000,000 new cases of GC and approximately 783,000 deaths (Bray et al., 2018;Siegel, Miller & Jemal, 2015). GC poses a great threat to public health, particularly in East Asia where the incidence has increased remarkably. Over the last decade, considerable progress has been made with finding and applying GC biomarkers in clinical diagnosis and treatment. For example, HER2, a member of the human EGFR family, was recognized as the most significant GC biomarker. GC's HER2 overexpression rate reported across the literature fluctuates between 9% and 38% (Gravalos & Jimeno, 2008;Okines et al., 2013). Trastuzumab, a HER2-targeting drug beneficial for HER2-positive GC patients, is the only targeted drug currently approved for advanced GC treatment (Gomez-Martín et al., 2014). However, we still do not fully understand HER2's role in gastric carcinogenesis. Programmed death ligand 1 (PD-L1) is overexpressed in approximately 40% of GC cases, designating it as a GC biomarker (Raufi & Klempner, 2015). PD-L1 and programmed cell death protein 1 (PD-1) affect immune tolerance. Tumors evade immune surveillance through the PD-1 pathway. The anti-PD-1 monoclonal antibody Pembrolizumab has shown clinical efficacy in GC patients with high PD-1 expression (Fife & Pauken, 2011). PD-1 pathway-blocking GC treatments and the potential biomarkers MET and E-cadherin (Durães et al., 2014;Ferreira et al., 2005) deserve further study. It is important to explore more clinically valuable GC biomarkers and therapeutic targets.
Microarray technology and bioinformatics analysis have recently become popular tools in cancer research and are used to identify differentially expressed genes (DEGs). These tools can also identify underlying biomarkers and therapeutic targets and their roles in biological processes, molecular functions, and different pathways.
In order to avoid potential false positives from using one single microarray, we screened three mRNA public datasets in our study to obtain DEGs between GC tissues and adjacent noncancerous tissue samples. Additionally, we carried out Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and protein-protein interaction (PPI) network analyses to show the molecular pathogenesis underlying carcinogenesis. Overall, we identified 159 DEGs and nine hub genes as potential GC biomarkers.

Obtaining microarray data
We downloaded three gene expression profiles (GSE65801, GSE54129, and GSE79973) from the Gene Expression Omnibus (GEO) dataset, an open data storage platform. GSE65801's microarray dataset consisted of 32 GC tissue samples and 32 paired noncancerous tissue samples (Li et al., 2015). GSE54129's dataset consisted of 111 GC tissues and 21 normal tissue samples. GSE79973's gene expression profile consisted of 10 GC samples and 10 normal adjacent samples (He et al., 2016).

Identifying DEGs
We utilized an online tool called GEO2R (https://www.ncbi.nlm.nih.gov/geo/geo2r/) to calculate the DEGs between GC tissues and normal samples (Barrett et al., 2013). If one gene had more than one probe set or if one probe set did not have the corresponding gene symbols, we averaged or removed them, respectively. We set the cut-off criteria as: |log 2 FC|>1.5 and adj. p-value <0.05 (fold change (FC) = GC tissue sample expression/adjacent noncancerous sample expression).

Functional DEG annotation using KEGG and GO analyses
GO enrichment analysis and KEGG pathway enrichment analyses were conducted using the Database for Annotation, Visualization, and Integrated Discovery (DAVID, version 6.8), which provides functional annotations for DEGs (Huang et al., 2007;Kanehisa, 2002). We identified promising signaling pathways and functional annotations related to the DEGs. P < 0.05 was considered statistically significant.

PPI network construction and module analysis
We used the Search Tool for the Retrieval of Interacting Genes (STRING) database to construct the PPI network, and applied Cytoscape to visualize the network (Szklarczyk et al., 2015). We set the cut-off criterion as confidence score >0.4. Next, we utilized the Molecular Complex Detection (MCODE) tool to identify the significant PPI network module with a node score cutoff = 0.2, a degree cutoff = 10, a maximum depth = 100, and a k-core = 2 (Bader & Hogue, 2003). We then used DAVID to perform the functional and pathway enrichment analyses for the significant module. We chose hub genes and constructed a co-expression network of significant genes using cBioPortal (http://www.cbioportal.org) (Cerami et al., 2012).

Hub gene validation and analysis
We used the online database Oncomine (http://www.oncomine.org), which integrates numerous published microarray data, to validate the expression levels of the top nine GC DEGs. We then reverified the expression of the nine selected genes using Gene Expression Profiling Interactive Analysis (GEPIA, http://gepia.cancer-pku.cn), a new interactive online tool. Additionally, we continued to explore the differences in gene expression across pathological stages. Overall survival analysis of the nine hub genes was performed using the Kaplan-Meier plotter (http://kmplot.com/analysis/) (Cerami et al., 2012).

Prediction and enrichment analysis of microRNAs related to hub genes
We used Targetscan (http://www.targetscan.org), an online database that reveals potential relationships between genes and microRNAs, to predict the microRNAs associated with hub genes. Then, we performed enrichment analysis of the predicted microRNAs using DNA Intelligent Analysis (DIANA-miRPath v3.0).

Screening differentially expressed genes in GC
Our analyses of GSE65801, GSE54129, and GSE79973 identified 1248, 1665, and 791 DEGs, respectively. By intersecting the three GEO datasets, we also obtained 159 overlapping genes: 105 up-regulated genes and 54 down-regulated genes ( Table 1). The DEGs are shown in volcano plots and a Venn diagram in Fig. 1.

Functionally annotating DEGs with GO and KEGG analyses
We utilized the DAVID database to identify the 159 genes' potential biological functions through GO and KEGG pathway enrichment analyses. In regards to biological processes (BP), our results showed significantly enriched variations in cell adhesion, extracellular matrix organization, oxidation-reduction processes, skeletal system development, collagen catabolic processes, proteolysis, collagen fibril organization, xenobiotic metabolic processes, digestion, and ion transmembrane transport. In terms of molecular functions (MF), our results showed close correlations with calcium ion binding, extracellular matrix structural constituents, oxidoreductase activity, heparin binding, integrin binding, protease binding, collagen binding, platelet-derived growth factor binding, serine-type endopeptidase inhibitor activity, and aldo-keto reductase (NADP) activity. Regarding cellular components (CC), the genes mainly interacted with extracellular components, such as the extracellular exosome, extracellular region, extracellular space, extracellular matrix, and proteinaceous extracellular matrix. KEGG pathway enrichment analysis showed the pathways and functions closely associated with metabolism-associated signaling, such as the PI3K-Akt signaling pathway, protein digestion and absorption, gastric acid secretion, and focal adhesion (Fig. 2).

PPI analysis and significant module identification
We conducted PPI analysis of the DEGs using the STRING database to identify the hub genes and to show their interactions in GC development. The PPI network included 89 nodes and 252 edges. We further identified the candidate hub genes by calculating the PPI network degree and set the cut-off criteria at degree ≥13. The top 10 candidate hub nodes were COL1A1, COL1A2, COL3A1, COL5A2, COL4A1, FN1, MMP9, COL5A1, COL4A2, and COL6A3. Additionally, we performed module analysis to identify the most significant module with the highest score. The identified module contained nine candidate hub genes except MMP9, indicating that it may perform critical PPI network biological functions. We suggest that the nine candidate genes are hub genes of the PPI network (Fig. 3).
We performed GO analysis on the significant module in the PPI network, and the results showed that the top module was highly involved in the collagen catabolic process, extracellular matrix organization, extracellular matrix structure, collagen fibril organization, platelet-derived growth factor binding, SMAD binding, endoplasmic reticulum lumen, extracellular matrix, and collagen trimer. KEGG pathway analysis showed that the genes in the hub module were closely connected with protein digestion and absorption, ECM-receptor interaction, and amoebiasis (Table 2). Subsequently, we utilized the cBioPortal online platform to construct the co-expression network of the nine hub genes (Fig. 4) and analyze their biological characteristics (Fig. 5).

Hub gene validation and analysis
The Oncomine database collects great quantities of tumor gene expression data, while GEPIA utilizes cancer sample sequencing expression data from the Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) projects. We selected datasets related to GC to verify their hub gene expression. Based on the Oncomine and GEPIA analysis results between cancer and normal tissue, we further verified that the hub gene expression rose significantly across the different GC datasets (Figs. 6 and 7). We also analyzed the expression levels of selected genes in different GC stages. COL1A1, COL1A2, COL3A1, COL5A1, COL5A2, and COL6A3 showed notable differences across different GC stages.  COL4A1, FN1, and COL4A2 showed no clear differences across various stages (Fig. 8). We performed overall survival analysis of the nine hub genes using the Kaplan-Meier plotter and the results showed a close correlation with survival time. Figure 9 shows the remarkable difference in overall survival between the low-and high-expression groups. GC patients with high COL1A1, COL1A2, COL3A1, COL5A2, COL4A1, FN1, COL5A1, COL4A2, and COL6A3 expression levels showed worse overall survival (Fig. 9).

Prediction and enrichment analysis of hub gene-related miRNAs
We predicted the miRNAs associated with the hub genes' mechanisms and regulatory network (Table 3) and conducted enrichment analysis (Fig. 10). GO analysis showed that the miRNAs were significantly enriched in the toll-like receptor TLR1:TLR2 signaling pathway, neurotrophin TRK receptor signaling pathway, and Fc-epsilon receptor signaling pathway. KEGG pathway enrichment analysis showed that they were mostly enriched in the prolactin signaling pathway, Ras signaling pathway, Hippo signaling pathway, and MAPK signaling pathway.

Figure 6 Heat map of differential expression between clinical GC samples and normal samples in the Oncomine dataset.
The overexpression (red) or underexpression (blue) of target genes in eight validation datasets. In each dataset, all genes were sequenced from high to low according to their expression differences between tumor and normal tissues, and then the target gene sequencing percentiles were analyzed. Cell color was determined by the gene rank percentile for the dataset analyses (the more overexpressed the gene, the redder the dataset color, and the more underexpressed genes were blue). 1. Diffuse gastric adenocarcinoma vs. normal (Chen et al.,, 2003). 2. Gastric intestinal type adenocarcinoma vs. normal (Chen et al.,, 2003). 3. Gastric mixed adenocarcinoma vs normal (Chen et al.,, 2003). 4. Diffuse gastric adenocarcinoma vs. normal (Cho et al.,, 2011). 5. Gastric intestinal type adenocarcinoma vs. normal (Cho et al.,, 2011). 6. GC vs. normal (Cui et al.,, 2011). 7. Gastric intestinal type adenocarcinoma vs. normal (DÉrrico et al.,, 2009). 8. GC vs. normal (Wang et al.,, 2012).

DISCUSSION
The study of a cancer's molecular mechanism guides its classification, treatment, and the progress of its targeted immunotherapy. Large-scale research and clinical trials have provided individualized GC treatment possibilities. Despite substantial progress in understanding the underlying molecular mechanism and implementing new therapeutic strategies, the five-year survival rate remains low. The tumorigenesis mechanism remains poorly understood. Therefore, it is crucial to identify novel biomarkers and therapeutic targets involved in GC tumor progression. Recent studies achieved preliminary results by screening biomarkers of different pathological GC types (Durães et al., 2014;Gravalos & Jimeno, 2008;Rong et al., 2018). In our study, we screened multiple datasets to find more GC biomarker candidates and prove their prognostic value. From three microarray datasets, we identified nine major hub genes: COL1A1, COL1A2, COL3A1, COL5A2, COL4A1, FN1, COL5A1, COL4A2, and COL6A3. We found that eight of the nine hub genes came from the collagen gene family, which participates in the formation of collagen in extracellular matrix proteins. The collagen family consists of 28 members numbered with Roman numerals (Ricard-Blum, 2011). Previous studies have found that abnormal collagen gene expression is usually related to connective tissue disease or osteoporosis (Wu, Yu & Zhou, 2017;Yamaji, 2017). Type I collagen is the most abundant protein in connective tissue and its increased expression is closely related to fibrotic lesions (Yamaji, 2017). Recent evidence has shown that COL1A1 and COL1A2's mRNA and protein levels are commonly overexpressed in GC patients (Li, Ding & Li, 2016). Moreover, high COL1A1 and COL1A2 expression may predict poor clinical outcomes for GC patients (Rong et al., 2018). COL3A1 overexpression has been confirmed in multiple cancers, such as bladder cancer, while the impact of COL3A1 expression level in GC is not completely understood (Gao et al., 2016;Liu et al., 2018;Yuan et al., 2017). Lower COL5A2 expression indicates better overall survival in bladder cancer patients, suggesting that it is a prognostic biomarker (Li et al., 2017;Zeng et al., 2018).
Although bioinformatics analysis has suggested that COL5A2 is a candidate GC biomarker, its precise regulatory mechanism is still unclear (Wang, 2017). Previous studies suggested that COL4A1 was upregulated in multiple malignancies including GC, and that elevated COL4A1 expression might confer trastuzumab resistance in GC patients (Huang et al., 2018;Miyake et al., 2017). COL4A1's expression level and mechanism requires further study. There have also been few reports on FN1, COL5A1, COL4A2, and COL6A3 genes. Previous studies on COL1A1, COL1A2, COL4A1, and COL5A2 have confirmed the expression dysregulation of these biomarkers (Huang et al., 2018;Li, Ding & Li, 2016;Zeng et al., 2018). In our study, we identified potential new biomarkers COL3A1, COL5A1, COL4A2, and COL6A3. Prior research on collagen genes mainly focused on connective tissue disease, muscle and ligament-related diseases, or the association between hub genes and tumors (Huang et al., 2018;Li, Ding & Li, 2016;Yuan et al., 2017;Zeng et al., 2018). More attention should be given to the collagen gene family's prognostic and therapeutic value in tumorigenesis, particularly in GC. The collagen gene family may play a role in the proliferation, invasion, and metastasis of GC cells. Enrichment analysis and miRNA prediction results provide the potential mechanisms for the collagen gene family's involvement in GC development (Figs. 2 and 10, Table 3). Previous studies have confirmed these mechanisms (Ao et al., 2018). Silencing COL1A2 and COL6A3 can inhibit the proliferation, migration, and invasion of GC cells, and can promote cell apoptosis through the PI3k-Akt signaling pathway (Ao et al., 2018). MiR-129-5p and let-7i miRNA are reported to participate in GC development via COL1A1 expression (Shi et al., 2019;Wang & Yu, 2018). The key genes' involvement in GC is currently poorly understood, but our enrichment analysis results provided potential pathways that can be validated by further experiments.
It is worth noting that intratumor stroma proportions have been proposed as significant indicators of GC prognosis (Lee et al., 2017). Previous studies have suggested that a high matrix proportion in GC patients means a poor prognosis (Ham, Lee & Hur, 2019;Kemi, Eskuri & Kauppila, 2019;Lee et al., 2017). Collagen genes play a crucial role in cell matrix formation, and their abnormal expression may lead to changes in matrix proportions. Our study found that a high key collagen gene expression indicated a poor GC prognosis. This partly confirms the above view from a bioinformatics perspective. However, the regulatory mechanism of key collagen genes in GC remains unclear. The identified enrichment pathway and microRNAs may help clarify the mechanism, and can be further applied in clinical prognosis evaluation and treatment.
The current study is only a preliminary report. We cannot ignore the possibility of heterogeneous results in integrated bioinformatics analysis due to sample source and quantity limitations. Although the nine hub genes showed remarkable clinical potential in our survival analysis, further basic and clinical trial studies are needed.

CONCLUSION
In summary, we identified DEGs that may be involved in GC initiation or progression. We identified a total of 159 DEGs and nine hub genes as potential prognostic GC biomarkers, and validated them using preliminary survival analysis. Our study provides new therapeutic targets for GC treatment and suggests data mining and integration as promising tools in malignant tumor biomarker detection. Since tumor biomarkers need to be validated by clinical data, further experiments should be conducted to confirm our conclusions.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.