The clinical significance of collagen family gene expression in esophageal squamous cell carcinoma

Background Esophageal squamous cell carcinoma (ESCC) is a subtype of esophageal cancer with high incidence and mortality. Due to the poor 5-year survival rates of patients with ESCC, exploring novel diagnostic markers for early ESCC is emergent. Collagen, the abundant constituent of extracellular matrix, plays a critical role in tumor growth and epithelial-mesenchymal transition. However, the clinical significance of collagen genes in ESCC has been rarely studied. In this work, we systematically analyzed the gene expression of whole collagen family in ESCC, aiming to search for ideal biomarkers. Methods Clinical data and gene expression profiles of ESCC patients were collected from The Cancer Genome Atlas and the gene expression omnibus databases. Bioinformatics methods, including differential expression analysis, survival analysis, gene sets enrichment analysis (GSEA) and co-expression network analysis, were performed to investigate the correlation between the expression patterns of 44 collagen family genes and the development of ESCC. Results A total of 22 genes of collagen family were identified as differentially expressed genes in both the two datasets. Among them, COL1A1, COL10A1 and COL11A1 were particularly up-regulated in ESCC tissues compared to normal controls, while COL4A4, COL6A5 and COL14A1 were notably down-regulated. Besides, patients with low COL6A5 expression or high COL18A1 expression showed poor survival. In addition, a 7-gene prediction model was established based on collagen gene expression to predict patient survival, which had better predictive accuracy than the tumor-node-metastasis staging based model. Finally, GSEA results suggested that collagen genes might be tightly associated with PI3K/Akt/mTOR pathway, p53 pathway, apoptosis, cell cycle, etc. Conclusion Several collagen genes could be potential diagnostic and prognostic biomarkers for ESCC. Moreover, a novel 7-gene prediction model is probably useful for predicting survival outcomes of ESCC patients. These findings may facilitate early detection of ESCC and help improves prognosis of the patients.


INTRODUCTION
Esophageal cancer is the seventh most commonly diagnosed cancer and the sixth leading cause of cancer death (Bray et al., 2018). It is classified into two histological subtypes, esophageal adenocarcinoma and esophageal squamous cell carcinoma (ESCC), the latter of which is the predominant type worldwide (Pennathur et al., 2013). Despite the effective treatments (e.g., surgery, chemotherapy and radiotherapy) for ESCC, the 5-year survival rates of patients with advanced ESCC are still less than 20% (Codipilly et al., 2018). However, the survival rates could be improved to over 80% if patients were diagnosed with an early stage (Lao-Sirieix & Fitzgerald, 2012;Wang et al., 2004). Although a few tumor markers, carcinoembryonic antigen, carbohydrate antigen 19-9, and squamous cell carcinoma antigen, have been used in the diagnosis of ESCC, they are not suitable for early detection due to the lack of sensitivity (Kosugi et al., 2004). Thus, it is urgent to search for novel biomarkers to help early detection of ESCC and improve survival rates of the patients.
Collagen is the most abundant extracellular matrix protein that promotes cell growth and provides mechanical resilience of connective tissues (Sorushanova et al., 2018). The collagen family comprises 28 types with different a chains encoded by more than 40 genes (Ricard-Blum, 2011). It has been reported that the expression of collagen-encoding genes was significantly related to the prognosis of certain types of cancers (Giussani et al., 2018;Liu et al., 2018;Rong et al., 2018;Shen et al., 2016;Zhang et al., 2018c). In addition, a couple of collagen genes, such as COL11A1 and COL6A1, were expressed aberrantly in ESCC tissues and possibly affected the progression of ESCC (Fan et al., 2012;He et al., 2017;Zhang et al., 2018a). However, most of these works focused on specific collagen genes, and the potential roles of other members remain to be clarified.
Here, we provided a systematic analysis of gene expression of the whole collagen family and its corresponding clinical significance in ESCC. Clinical data and gene expression profiles of ESCC patients were extracted from The Cancer Genome Atlas (TCGA) and the gene expression omnibus (GEO), two public databases with substantial information about cancers. Different bioinformatics methods, including differential expression analysis, survival analysis, pathway analysis and co-expression network analysis were used to analyze the data to sift important hits possibly involved in the initiation and development of ESCC. According to collagen family genes, we also established a prediction model with high performance to predict the prognosis of ESCC patients. Collectively, our works mainly explored the relation of collagen gene expression to ESCC and illuminated the potential mechanism.

Patient data
Basic data of ESCC patients were downloaded from the TCGA database (https://portal.gdc.cancer.gov/) and the GSE53625 dataset of the GEO database; 95 cases from TCGA and 179 cases from GSE53625. Univariate and multivariate Cox regression analyses were carried out to investigate the correlation between overall survival and clinicopathological characteristics of the patients by SPSS (v23.0). The relations between collagen family gene expression and clinicopathological characteristics of the patients were examined using Pearson correlation analysis via SPSS.

Differential expression analysis
Gene expression profiles of tumor and adjacent normal tissues in ESCC patients were also obtained from the two datasets. 81 of 95 patient cases in TCGA and all patient cases in GEO had RNA-sequence data. In total, 81 tumor samples with 11 normal controls from TCGA and 179 tumor samples with 179 normal controls from GEO (Li et al., 2014) were included in analysis (each sample was taken from a different patient). Differential expression analysis was conducted using the edgeR (Robinson, McCarthy & Smyth, 2010) and the limma (Ritchie et al., 2015) packages, respectively, for TCGA and GEO data by R software (R Core Team, 2018). Gene expression levels were normalized by the calcNormFactors function in edgeR (Law et al., 2016) and by the normalizeBetweenArrays function in limma (Smyth & Speed, 2003), to make sure the expression distributions of each sample are similar across the entire matrix. Then based on the exact test in edgeR which is analogous to Fisher's exact test (Robinson, McCarthy & Smyth, 2010) and the Empirical Bayes statistical test in limma (Phipson et al., 2016), fold change, P-value and false discovery rate (FDR) (or adjusted P-value) were figured out to show the expression difference between tumor and normal samples. Genes with P < 0.05 and FDR < 0.05 were considered as differentially expressed genes (DEGs). Accordingly, DEGs of collagen family were identified. Then heatmaps, boxplots and Venn diagram were drawn by R software.

Survival analysis
First, hazard ratio (HR) and P-value of each DEG of collagen family were figured out based on gene expression and overall survival of patients by the univariate Cox regression model with the survival package through R software. The HR is an estimate of the ratio of the hazard rate in the treated versus the control group (Spruance et al., 2004), while in this study it is defined as the hazard in the high expression group divided by the hazard in the low expression group. HR > 1 and HR < 1 mean higher expression of the gene is associated with worse and better overall survival, respectively. Survival curves were plotted according to the Kaplan-Meier method and compared by the log-rank test using the survival and the q-value packages in R. P < 0.05 was considered statistically significant.

Prediction models
Prediction models were established to predict patient survival based on gene expression of 22 DEGs of collagen family and overall survival of patients by the multivariate Cox regress analysis with the survival package via R software. Several candidate genes were eventually selected out by the analysis to form the model, with a formula calculating the risk score of each patient. The general formula is given below: where n, Coef and Exp indicate the number of included genes, the coefficient of each gene, and gene expression level, respectively. The coefficients were estimated based on the relative contributions of each collagen gene. A patient's risk score was calculated as the sum of the expression levels of each gene multiplied by its corresponding coefficient. Similar methods have been adopted by earlier studies (Beer et al., 2002;Lossos et al., 2004;Wang et al., 2018). Then, receiver operating characteristic (ROC) curves were plotted based on the risk scores and overall survival of patients by the survivalROC package in R, with area under curve (AUC) values which represented the accuracy of predicting 3-year survival. Also, survival curves were obtained by dividing the patients into high-and low-risk groups according to the median risk score using the survival package.

Pathway analysis
Potential mechanism of collagen family genes was explored by the gene sets enrichment analysis (GSEA), a method to determine whether members of a previously defined gene set are correlated with the phenotypic class distinction (Subramanian et al., 2005). GSEA was conducted using the gene expression profiles of patients' tumor samples via javaGSEA software (http://software.broadinstitute.org/gsea/downloads.jsp), and the patient samples were divided into high-and low-risk groups in half according to the risk scores obtained by the collagen-DEGs-based prediction models (Chai et al., 2018;Zhang et al., 2017;Zhao et al., 2017). Oncogenic Signatures Gene Sets (v6.2), Hallmark Gene Sets (v6.2) and KEGG Gene Sets (v6.2) (http://software.broadinstitute.org/gsea/msigdb/collections.jsp) were, respectively, used as references. Based on these gene sets databases, the expression profiles were analyzed to find out if a set of genes were mostly up-regulated (or down-regulated) in the high-risk group (or low-risk group). Normalized enrichment score reflected the degree to which a gene set was overrepresented in the groups, and gene sets in the results with P < 0.05 and FDR < 0.25 were considered as significant ones (Subramanian et al., 2005).

Co-expression network analysis
Patients' tumor samples from TCGA were separated into high-and low-risk groups by the risk scores calculated by the 7-gene prediction model. Risk-score-based DEGs that were differentially expressed between the two groups were determined using the gene expression profiles of tumor samples by the same method as differential expression analysis. Then the relationships between collagen family genes and the risk-score-based DEGs as well as the representative enriched gene sets from GSEA were assessed by the Weighted Gene Co-Expression Network Analysis (WCGNA) with the WGCNA package through R software, which is a method to describe the correlation patterns among genes across different samples (Langfelder & Horvath, 2008). Genes of each gene set were extracted from http://software.broadinstitute.org/gsea/msigdb/genesets.jsp. Finally, the genes co-expressed with collagen family genes were obtained, and the networks of them were drawn via Cytoscape (http://www.cytoscape.org/, v3.7.1).

Clinicopathological information of the ESCC patients
A total of 95 patient cases in TCGA and 179 cases in GEO were collected and analyzed by univariate and multivariate Cox regression analyses. As a result, poor overall survival was significantly correlated with sex, tumor-node-metastasis (TNM) stage and N stage in TCGA (P = 0.020, P = 0.015, and P = 0.012, respectively) ( Table 1), and was notably associated with age, TNM stage and N stage in GEO (P = 0.021, P < 0.001, and P = 0.030, respectively) ( Table 2). Besides, investigation into the correlation between collagen family gene expression and the clinicopathological characteristics revealed that the expression of several collagen genes was significantly related to advanced TNM stages or tumor grades. (Tables 3 and 4).

Identification of DEGs of collagen family in ESCC tissues
Differential expression analysis showed that more than 2/3 of the 44 collagen family genes were up-regulated in tumor tissues in both TCGA and GEO (Tables S1 and S2)

Survival analysis of collagen family genes in ESCC patients
HRs and P-values of the 22 DEGs were calculated and shown by heatmaps (Figs. 2A and 2B). Among them, HRs of COL6A5 and COL18A1 were the lowest and highest, respectively. Survival curves of the DEGs were plotted according to the Kaplan-Meier method. Consistently, COL6A5 and COL18A1 were the two genes most relevant to the overall survival of ESCC patients. Patients with lower COL6A5 expression exhibited poorer overall survival (P = 0.008 in TCGA, Fig. 2C; P = 0.060 in GEO, Fig. 2D). By contrast, patients with higher COL18A1 expression had worse overall survival (P = 0.393 in TCGA, Fig. 2E; P = 0.009 in GEO, Fig. 2F). These results suggested that COL6A5 and COL18A1 are tightly associated with the prognosis of ESCC.

DEGs-based prediction models to predict the prognosis of ESCC patients
Receiver operating characteristic curves have been extensively used to evaluate the predictive effect of one or more genes. The AUC value represents predictive accuracy and usually makes sense when it exceeds 0.60 (Lüdemann et al., 2006;Metz, 1978;Obuchowski,  Notes: Characteristics with P < 0.3 in the univariate analysis were further screened in the multivariate analysis. HR, hazard ratio; CI, confidence interval; TNM stage, tumor-node-metastasis stage; T stage, stage of tumor invasion; N stage, stage of regional lymph node invasion.

2003)
. ROC curves of COL6A5 and COL18A1 indicated that good predictive performance could only be attained by COL6A5 in TCGA (AUC = 0.679, Fig. S1A), while COL18A1 had no predictive ability (Figs. S1C and S1D), suggesting that a single gene is not suitable for survival prediction of ESCC patients. Therefore, we established multi-gene prediction models based on expression levels of the DEGs to assess the joint effect of selected collagen genes on patient survival. There were seven genes in TCGA and nine genes in    GEO finally included to form the models, respectively, and risk scores of the patients were calculated according to the below formulas: For instance, the positive coefficient for COL1A1 suggests that higher expression of COL1A1 was associated with worse survival. The negative value allocated to COL6A5 means that higher expression of COL6A5 was related to prolonged survival, in agreement with the survival analysis (Fig. 2). Notably, AUCs on the ROC curves of the DEGs-based models in TCGA and GEO reached 0.86 and 0.68, respectively (Figs. 3A and 3C), which were higher than those of the prediction models based on TNM staging in the two datasets with AUCs of 0.625 and 0.646, respectively (Figs. 3E and 3G). The TNM staging system is a generally recognized standard for classifying the spreading extent of cancer (D'Journo, 2018) and is commonly used to predict prognosis of cancer in clinical application. The prediction models, respectively, based on T-stage and N-stage were also examined but the AUCs were all less than 0.6 ( Fig. S2). Furthermore, survival curves showed that patients with high risk were significantly correlated with poor survival (Figs. 3B, 3D, 3F and 3H).
The 7-gene model in TCGA with true positive rate of 86% was more accurate than that of the TNM staging-based model, whereas predictive accuracy of the 9-gene model in GEO exhibited no difference. Therefore, the model in TCGA was used for our further studies. Finally, a heatmap was plotted to show the expression patterns of the seven genes in TCGA between high-risk and low-risk groups (Fig. 3I). The risk score distribution was exhibited in ascending order, and patients were divided into high-and low-risk groups by the median point (Fig. 3J). Overall, it can be seen that patients with high risk score had higher mortality rates and shorter survival time than those with low risk score (Fig. 3K). Taken together, above results indicated that the 7-gene model could be more accurate to predict patient survival.

Pathway analysis of collagen family genes
Gene sets enrichment analysis results showed that most of the gene sets were up-regulated in the high-risk group, and the top 20 enriched gene sets were given in Tables S3-S8. The gene sets that were closely associated with tumorigenesis were shown in Fig. 4

Co-expression network analysis
WGCNA was performed to find out the genes that were co-expressed with collagen family genes in ESCC tissues. Risk-score-based DEGs that were differentially expressed between high-and low-risk groups were determined and presented by the volcano plot (Fig. S3). The co-expression network of collagen genes and the risk-score-based DEGs were given in several modules (Fig. 5). Collagen family genes were displayed as red nodes, and the genes included in the 7-gene prediction model in TCGA were marked as bigger red nodes. The blue nodes represented the co-expressed genes. Another network was drawn to show the association between collagen family genes and seven representative enriched gene sets (PDGF, RB/p107, PI3K/Akt/mTOR pathway, p53 pathway, oxidative phosphorylation, apoptosis and cell cycle) from the GSEA results (Fig. S4). The red nodes were the collagen family genes with close connections to those gene sets. A big blue circle represented a gene set and the blue nodes were genes included in each set. Genes closer to the center were more tightly associated with the collagen genes.

DISCUSSION
Although extensive research efforts have been focused on this field in past decades, efficient detection methods for early ESCC and accurate prediction against complicated Recently, studies have found that the expression of certain genes, such as MCT4, ZNF750, Gli1, etc. was highly related to the occurrence and development of ESCC, and they might be applied as ideal biomarkers for ESCC (Cheng et al., 2018;Nambara et al., 2017;Yang et al., 2017;Zhang et al., 2018b). In addition, the aberrant expression of a few collagen family genes has also been reported to be significantly associated with the prognosis of ESCC patients. However, most works only focused on single or limited genes, and the predictive ability was barely satisfactory. Herein, we provided a more systematic analysis of the whole collagen family gene expression to evaluate the potential roles and clinical significance of collagen genes in ESCC.
We found that most of the collagen genes were up-regulated in ESCC tissues when compared to normal controls, half of which were identified as DEGs (Figs. 1A and 1B). Among them, the expression of COL1A1, COL10A1 and COL11A1 was particularly higher, and that of COL4A4, COL6A5 and COL14A1 was especially lower in tumor tissues, indicating their possible roles as diagnostic markers for ESCC. Consistently, several studies have shown that COL1A1, COL10A1 and COL11A1 were notably overexpressed in ESCC compared to normal tissues (Fang et al., 2019;He et al., 2017;Karagoz et al., 2016;Senthebane et al., 2018;Zhang et al., 2018a). Also, COL4A4 was also found to be down-regulated in esophageal tumor tissues (Chattopadhyay et al., 2009). Additionally, among the DEGs, COL7A1 was observed to be up-regulated in ESCC tissues (Kita et al., 2009  In our works, COL6A5, COL14A1 and some other collagen genes were reported to be significantly up-or down-regulated in ESCC tissues for the first time. In the survival analysis, COL6A5 and COL18A1 were validated to be significantly related to overall survival of ESCC patients. Previous studies demonstrated that the COL6A5 expression was significantly associated with depressed behavior and atopic dermatitis (Söderhäll et al., 2007;Zhan et al., 2017), but no articles manifested its correlation with cancer. In addition, COL18A1 has been proved to be a promising biomarker for ovarian cancer and was possibly involved in the progression of bladder cancer (Fang et al., 2013;Peters et al., 2005). In this study, ESCC patients with low COL6A5 expression or high COL18A1 expression showed poor overall survival (Figs. 2C-2F), implying the expression of COL6A5 or COL18A1 as a potential indicator for the prognosis of ESCC patients. Moreover, the variations that affect the expression of COL6A5 and COL18A1 possibly have effects on the progression of ESCC. Activating COL6A5 or inhibiting COL18A1 might improve the therapeutic efficiency and the life-span of ESCC patients.
Because the expression of one gene is usually influenced by various factors, ideal effect may not be attained by using a single gene as a predictor. Indeed, COL6A5 achieved an AUC value over 0.60 only in TCGA (Fig. S1), making the requirement of another more powerful prediction method. Based on the selected collagen DEGs (7 genes in TCGA and 9 genes in GEO, both including COL6A5), we established two new prediction models. Importantly, such DEGs-based models exhibited better predictive ability than conventional prognostic models according to TNM staging. The 7-gene model in TCGA had especially higher predictive accuracy of 86%. One possible reason was that the RNA sequencing technology applied to TCGA was more accurate than the gene chip technology used in GEO. In summary, this 7-gene prediction model is greatly promising to predict the prognosis of ESCC patients and help determine next therapeutic regimens.
Furthermore, GSEA was used to identify significantly enriched gene sets and potentially relevant pathways (Fig. 4). The results showed that based on Oncogenic Signatures Gene Sets, gene sets of PDGF, RB/P107 and AKT/MTOR were significantly enriched in the high-risk group. It has been reported that PDGF receptor-beta increased the expression of COL1A2 through Akt/mTORC1 signaling pathway (Das et al., 2017). According to Oncogenic Signatures Gene Sets and Hallmark Gene Sets, the high-risk group was significantly related to p53 and p53 pathway, which suggested that collagen genes might be highly associated with the p53 or its related pathway in ESCC. Earlier studies proved that enhanced expression of ectopic p53 in dermal fibroblasts inhibited basal and TGF-beta-stimulated collagen gene expression, and the absence of cellular p53 was correlated with increased transcriptional activity of the Type I collagen gene (COL1A2) and collagen synthesis (Ghosh, Bhattacharyya & Varga, 2004). Moreover, the type IV collagen expression was inversely related to p53 in malignant tumors (Bar et al., 2004). Oxidative phosphorylation related genes were found to be up-regulated in the high-risk group by both Hallmark Gene Sets and KEGG Gene Sets. Indeed, some reports demonstrated that oxidative phosphorylation signature occurred when collagen density was decreased, and the change of collagen density microenvironment regulated the metabolism of cancer cells (Mah et al., 2018;Morris et al., 2016). As for apoptosis, an earlier study has shown that Type IV collagen could stimulate cancer cell proliferation, migration and inhibit apoptosis (Öhlund et al., 2013). Additionally, the gene sets of mitotic spindle, G2/M checkpoint and cell cycle were enriched in the high-risk group as well, implying that collagen might regulate the cell cycle of ESCC cells. Furthermore, it was indicated that the high-risk group was markedly associated with renal cell carcinoma, bladder cancer and small cell lung cancer. These results were consistent with previous studies that collagen gene expression was correlated with the poor prognosis of those cancers (Koskimaki et al., 2010;Wan et al., 2015;Xu et al., 2017;Zeng et al., 2018).
As shown by the co-expression network (Fig. 5), a few collagen family genes such as COL1A1, COL11A1, COL6A6 and COL19A1, were co-expressed with NETO1, NEUROD2 and NRG3, which are the genes involved in neural functions. These findings could be verified by earlier articles to some extent (McCarthy & Hay, 1991;Perris et al., 1993aPerris et al., , 1993b. COL11A1 was also observed to be co-expressed with tumor suppressor candidate 7 (TUSC7), further validating the possible role of COL11A1 in the occurrence of ESCC. Beyond that, some potassium channel related genes (KCNA2, KCNE1B, KCNH1, KCNJ4 and KCNK4) were co-expressed with collagen genes in a way, revealing that collagen genes might be correlated with the regulation of potassium channels in ESCC. As for the two potential prognostic biomarkers, COL18A1 only showed close relations with collagen family members, while COL6A5 was associated with two other genes in this network, ROBO2 and MIR548A3. ROBO2 has been identified as a candidate tumor suppressor (Trifonov et al., 2013), and the alteration of its expression might play a role in malignant tumors of digestive tract including gastric and colorectal cancers (Je et al., 2013).
Apart from what is aforementioned, there are still some limitations of this research. For instance, the prediction model was comprised of several genes, making it difficult to conduct cellular experiments by targeting a single gene to confirm its predictive effect. Aside from it, the characteristics of patient samples, as well as the methodology utilized in TCGA, were somewhat different from that in GEO, which may explain the different results coming from the two datasets. For example, TCGA uses the RNA sequence technology while GEO applies the gene chip technology to detect gene expression of patient tissues. Besides, TCGA mainly collected data from white people, whereas the majority of patients in GEO (GSE53625) were Asian. Therefore, there was no a single gene that exhibited significant P-values in both datasets in the survival analysis, and the selected genes driving the prediction model in one dataset were not completely identical to those in another dataset. Further validation of these outcomes requires more clinical information and biological experiments in the future.

CONCLUSIONS
In summary, this study identified 22 collagen family genes that were significantly expressed higher or lower in ESCC compared to normal tissues. Among them, COL1A1, COL10A1, COL11A1, COL4A4, COL6A5 and COL14A1 were the most distinct ones and possessed the potential in ESCC diagnosis. Besides, COL6A5 and COL18A1 showed strong correlations with overall survival of ESCC patients and might be robust prognostic biomarkers for ESCC. Furthermore, we established a 7-gene prediction model with high performance to predict the prognosis of ESCC patients. In terms of the underlying mechanism, collagen genes might be associated with PI3K/Akt/mTOR pathway, p53 pathway, oxidative phosphorylation, apoptosis and cell cycle during the progression of ESCC. Our works may further benefit the diagnosis, prognosis and treatments for ESCC patients.
Shouxia Xie conceived and designed the experiments, authored or reviewed drafts of the paper, approved the final draft. Shaoxiang Wang conceived and designed the experiments, contributed reagents/ materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

Data Availability
The following information was supplied regarding data availability: Raw data was downloaded from public databases including TCGA and GEO (GSE53625). The raw data from TCGA is available as Supplemental Files.

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj.7705#supplemental-information.