Bioinformatics analysis of microarray data to identify the candidate biomarkers of lung adenocarcinoma

Background Lung adenocarcinoma (LUAD) is the major subtype of lung cancer and the most lethal malignant disease worldwide. However, the molecular mechanisms underlying LUAD are not fully understood. Methods Four datasets (GSE118370, GSE85841, GSE43458 and GSE32863) were obtained from the gene expression omnibus (GEO). Identification of differentially expressed genes (DEGs) and functional enrichment analysis were performed using the limma and clusterProfiler packages, respectively. A protein–protein interaction (PPI) network was constructed via Search Tool for the Retrieval of Interacting Genes (STRING) database, and the module analysis was performed by Cytoscape. Then, overall survival analysis was performed using the Kaplan–Meier curve, and prognostic candidate biomarkers were further analyzed using the Oncomine database. Results Totally, 349 DEGs were identified, including 275 downregulated and 74 upregulated genes which were significantly enriched in the biological process of extracellular structure organization, leukocyte migration and response to peptide. The mainly enriched pathways were complement and coagulation cascades, malaria and prion diseases. By extracting key modules from the PPI network, 11 hub genes were screened out. Survival analysis showed that except VSIG4, other hub genes may be involved in the development of LUAD, in which MYH10, METTL7A, FCER1G and TMOD1 have not been reported previously to correlated with LUAD. Briefly, novel hub genes identified in this study will help to deepen our understanding of the molecular mechanisms of LUAD carcinogenesis and progression, and to discover candidate targets for early detection and treatment of LUAD.


INTRODUCTION
Lung cancer remains the one of leading healthy issues worldwide, with as estimated 2.1 million new cases and 1.8 million deaths in 2018 (Bray et al., 2018). It has been ranked the first and second cancer morbidity of male and female in China, respectively, and has the highest mortality rate (Sun et al., 2018). Lung adenocarcinoma (LUAD) is the most common subtype of lung cancer (Maemura et al., 2018;Walters et al., 2013). More than 60% of LUAD patients were observed harboring targetable gene alterations, which leading to remarkable responses in treating with tyrosine kinase inhibitors, and associated with improved survival rate (Kris et al., 2014). Despite the substantial advance in combined therapies, the prognosis of LUAD is still dismal, the 5-year survival rate is not over 20% (Chen et al., 2014b;Ettinger et al., 2013). Lacking sensitive and specific early biomarkers, a high possibility of drug resistance and metastasis is considered to contribute the high mortality of this disease. Therefore, there has a pressing need for identifying the more sensitive and specific biomarkers or drug targets of LUAD for developing effective diagnosis and treatment strategies.
Microarray technology provides an all-in-one system biology solution from hardware to software systems. It can simultaneously scan the hybridization signals of tens of thousands of gene probes in the chip and carry out quantitative analysis on the transcriptome profile of samples. Recent advances especially in the algorithms of probe signal detection and analysis, such as the introduction of artificial intelligence technologies, will make the results of microarray more accurate and reliable (Gan et al., 2019a(Gan et al., , 2019bPeng, 2006). The microarray technique also provides a powerful tool for exploring the gene regulation pattern and molecular mechanisms involved in oncogenesis and progression of LUAD. Recently, different types of biomarkers including coding genes, miRNAs, long non-coding RNAs and circRNAs have been identified in lung cancer. Dysregulation of these molecules is involved in the tumor progression or is associated with the prognosis of patients (Di et al., 2019;Vargas & Harris, 2016;Vencken, Greene & McKiernan, 2015;Wei & Zhou, 2016). In view of the complexity of the molecular regulatory network of LUAD, current studies on tumor biomarkers are not sufficient. Therefore, it is still necessary to identify novel prognostic biomarkers, which will help us develop more sensitive and effective diagnostic and therapeutic strategies. However, limited sample size and significant variability among different projects make it hard to obtain credible results. In this study, four microarray datasets containing mRNA expression data between LUAD and non-cancerous tissues were downloaded from Gene Expression Omnibus (GEO) and the differentially expressed genes (DEGs) were screened out. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and protein-protein interaction (PPI) network analyses were performed to explore the key modules and hub genes involved in LUAD progression. In sum, 349 DEGs and 10 hub genes were screened out, which may be candidate biomarkers for LUAD.

Data download and pre-processing
Four datasets (GSE118370, GSE32863, GSE85841 and GSE43458) which contain the gene expression data of LUAD and normal tissues, were downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/) by getGEO function in package GEOquery (Davis & Meltzer, 2007). The detail information of GEO datasets was listed in Table 1. The raw expression files of four microarray datasets were pre-processed according to the method described previously with minor modifications (Giulietti et al., 2016). Briefly, the CEL format files were input and background correction and normalization were conducted using the Robust multichip average function implemented in affy package in R environment (Bolstad et al., 2003;Irizarry et al., 2003). Next, the array probes were converted into matched gene symbols according to annotation information. In case of multiple probes corresponding to a single gene, the value of gene expression was designated as the mean of the probes. Then, the batch effects among different platforms were removed by ComBat function of the sva package (Leek et al., 2012). Finally, the normalized microarray-based data of four datasets were merged into a single global dataset which contained a total of 12,926 common genes in all four GEO datasets.

Identification of DEGs
Identifying DEGs in different disease states and investigating their functions and interactions may help to unravel potential regulatory mechanisms for disease occurrence and progression. In the present study, the DEGs between LUAD and normal tissues were identified by limma package (Ritchie et al., 2015). The Benjamini-Hochberg procedure was introduced to reduce the false positive rate (FDR) in multiple comparisons (Benjamini & Hochberg, 1995). Genes with |log2 Fold Change| 1 and FDR < 0.05 were considered as DEGs.

GO and KEGG enrichment analysis
Gene Ontology and KEGG analyses were conducted using enrichGO and enrichKEGG functions of clusterProfiler package, respectively (Yu et al., 2012). p.adjust (FDR) < 0.05 was considered to be statistically significant.

Construction of PPI network and module analysis
The network of proteins interaction provides valuable clues for understanding the molecular mechanisms underlying the progress of carcinoma. The PPI network was constructed by Search Tool for the Retrieval of Interacting Genes (STRING) (https://string-db.org/) with interaction score of 0.9 as the threshold. Subsequently, the candidate modules were detected by Cytoscape plugin molecular complex detection (MCODE) with default parameters: degree cut-off ¼ 2, node score cut-off ¼ 0.2, k-core ¼ 2, and max depth ¼ 100.

Hub gene analysis
The seed genes in modules referred to hub genes. The overall survival analyses were performed using online tool Kaplan-Meier Plotter (http://kmplot.com/) (Gyorffy et al., 2013). The logrank P < 0.05 was considered statistically significant. The association of expression level of hub genes with clinical traits were analysis using the Oncomine database (Rhodes et al., 2007).

Data preprocessing and DEG screening
Four GEO datasets were downloaded, pre-processed and merged into a global dataset which contained 112 LUAD and 102 normal samples. Totally, 349 DEGs were identified by limma package (Ritchie et al., 2015), including 74 up-regulated genes and 275 down-regulated genes. The most statistically significant up-regulated and down-regulated genes are listed in Table 2. The distribution of DEGs was presented by volcano plot (Fig. 1).

Construction of PPI network and module analysis
Protein-protein interaction network reflect the spatiotemporal relationship of macromolecules within the cell which will provide valuable information about molecular mechanisms in physiological and pathological process. To explore the molecular mechanisms underlying LUAD progression, the online STRING database was applied to construct the PPI network. The interaction score of 0.9 (highest confidence) was set as threshold, and nodes without connections were removed from network. Finally, the PPI network consisted of 349 nodes with 277 edges, and average local clustering coefficient was 0.337 (PPI enrichment P-value < 1.0E-16) (Fig. 3A). Then, the key modules were identified via MCODE plugin. A total of 11 functional clusters of modules and related hub genes were detected. The top three significant modules were presented in Figs. 3C-3E. The KEGG analysis of module genes revealed that the top three modules were mainly associated with the chemokine signaling pathway (hsa04062), complement and coagulation cascades (hsa04610), human cytomegalovirus infection (hsa05163) and vascular smooth muscle contraction (hsa04270) (Fig. 3B).

Hub genes analysis
A total of 11 genes were identified as hub genes. The overall survival analysis of the hub genes was performed using Kaplan-Meier curve. Except VSIG4, LUAD patients with downregulated ADAMTS8, AOX1, EFEMP1, METTL7A, MYH10, PTGER4, TMOD1, CDH13 and upregulated PRC1 showed worse overall survival (Figs. 4A-4I). It is worth noting that FCER1G is downregulated in LUAD patients, but the low expression level is associated with better overall survival (HR ¼ 1.87) (Fig. 4J). Subsequently, the expression status of hub genes with HR < 0.5 or HR > 2 were further validated using the Oncomine database. The result showed that ADAMTS8, METTL7A and MYH10 were significantly downregulated and PRC1 was markedly overexpressed in LUAD in the different datasets (Fig. 5). In the Okayama Lung dataset, the alternation of ADAMTS8, METTL7A, MYH10 and PRC1 were associated with tumor grade (Fig. 6), implicating vital roles of these genes in the carcinogenesis or progression of LUAD.

DISCUSSION
In this study, four GEO datasets were analyzed and 349 DEGs were identified, including 74 up-regulated and 275 down-regulated genes. The KEGG analysis revealed the top three The confusing experimental results suggest that the complement system may be involved in complex tumor regulatory processes by reshaping tumor microenvironment, which is worthy of further study. In recent years, malaria and prion disease, which had not previously attracted much attention, have also been found to be associated with tumors. Epidemiological study has shown that the incidence of malaria is negatively correlated with the mortality of colorectal cancer, breast cancer and lung cancer (Qin et al., 2017). The proposed anti-tumor mechanisms included systemly stimulating the immune responses (Chen et al., 2011) and inhibiting key pathways in tumor progress (Deng et al., 2018). Previously, prion protein (PrPc) was thought to play a role only in the central nervous system. However, accumulating evidence shows that PrPc has wider biological functions that were not previously expected (Mehrpour & Codogno, 2010). PrPc may be associated with the biology of many cancers, and overexpression of PrPc promotes the proliferation, invasion and metastasis of the gastric cancer cell (Liang et al., 2009;Mehrpour & Codogno, 2010;Pan et al., 2006). These data may provide new ideas and directions for the mechanism research and therapeutic strategy of LUAD. The GO analyses indicated that DEGs were significantly related to BP of ECM organization and process. Previous studies have reported that ECM remodeling promotes cancer progression and is associated with a poor prognosis in lung cancer patients (Kopparam et al., 2017;Xia et al., 2012).
The up-and down-regulated DEGs were simultaneously enriched into ECM process, which is consistent with the propensity for metastasis and highly invasive characteristics of LUAD.
In line with the GO analysis, among 11 hub genes identified by PPI network and modules analysis, EFEMP1, PTGER4, ADAMTS8, CDH13, MYH10 and METTL7A are associated with ECM process, and survival analysis indicated that down-regulation of these hub genes was associated with worse overall survival. EFEMP1 plays distinct biological functions in different tumors. In osteosarcoma and gliomas, EFEMP1 is overexpressed and promotes the invasion and metastasis of tumor cells in vitro and in vivo by activating the expression of MMP-2 and notch signaling, respectively (Hu et al., 2009(Hu et al., , 2012Wang et al., 2015), while in gastric cancer, endometrial carcinoma, hepatocellular carcinoma and lung cancer, EFEMP1 is downregulated, and is proposed as a prognosis biomarker (Chen et al., 2014a;Kim et al., 2014;Nomoto et al., 2010;Yang et al., 2013; The above report is inconsistent with our results, which may be due to the heterogeneity of the tumor and the limited number of samples. Therefore, subsequent large sample functional verification is required. ADAMTS8 encodes an inactive proenzyme and forms mature active enzyme by proteolysis (Apte, 2009). ADAMTS8 is identified as a secretory angiogenesis inhibitor that inhibits VEGF-mediated angiogenesis by blocking the EGFR signaling pathway (Choi et al., 2014;Dunn et al., 2006;Vazquez et al., 1999). Downregulated ADAMTS8 in some cancers including NSCLC (Heighway et al., 2002;Huang et al., 2019;Masui et al., 2001;Porter et al., 2004;Rodriguez-Rodero et al., 2013;Zhao et al., 2018) is associated with poorer prognosis (Drilon et al., 2014;Li et al., 2015;Porter et al., 2006). CDH13 encodes a member of cadherin superfamily. The hypermethylation in promoter region of CDH13 was frequently observed in lung cancer, and proposed to correlate to drug sensitivity and poorer prognosis (Kontic et al., 2012;Toyooka et al., 2006;Zhai & Li, 2014;Zhong et al., 2015). As the only upregulated hub gene, PRC1 is involved in the process of cytokinesis and is upregulated in breast cancer, hepatocellular carcinoma, and lung cancer (Jiang et al., 1998;Liu et al., 2018;Shimo et al., 2007;Zhan et al., 2017a). Recent research has found that it promoted proliferation and metastasis of LUAD cells via potentiating the Wnt/β-catenin pathway (Zhan et al., 2017b), and inhibiting the expression of PRC1 in LUAD cells by miR-1-3p was reported to suppress tumorigenesis (Li et al., 2019). Literature retrieval showed that the relation of LUAD and hub genes MYH10, METTL7A, FCER1G and TMOD1 has not been reported. MYH10 belongs to the myosin superfamily, which can regulate ECM remodeling (Kim et al., 2018;Kim, Kurie & Ahn, 2015). Methyltransferase METTL7A was found to facilitate Hepatitis C Virus propagation via recruitment of NS4B (Park et al., 2015). FCER1G encodes a high affinity IgE receptor which is associated with prognosis of renal clear cell carcinoma (Chen et al., 2017). As an aldehyde oxidase, TMOD1 is an actin-capping protein involved in regulation of the length and depolymerization of actin filaments. Overexpression of TMOD1 promotes cell proliferation of breast cancers and metastasis of oral squamous cell carcinoma (Ito-Kureha et al., 2015;Suzuki et al., 2016). However, the molecular mechanism of these newly identified hub genes in LUAD remains largely unknown, and further functional studies were warranted.
In present study, we have identified a set of candidate biomarkers that may play an important role in the progression of LUAD. These newly identified hub genes could be used as research subjects for exploring their roles in the disease process, so as to further deepen our understanding of the molecular mechanism of LUAD, and also as potential prognostic biomarkers for clinical validation studies to clarify their prognostic effects. However, there are still some limitations in this study. First, this study is based on bioinformatics analysis of published data and lacks experimental verification, and the study cannot determine whether there is a causal relationship between the differential expression of hub genes and disease progression. Finally, although we combined four GEO datasets, the number of samples is still relatively small, which may lead to potential unreliable results. Therefore, subsequent bioinformatics analysis and experimental verification with larger samples are necessary.

CONCLUSIONS
In summary, this study identified several DEGs by integrating four GEO datasets and extracted 11 hub genes from PPI network, among which 10 hub genes were shown to be related to the occurrence and development of LUAD and four hub genes have not been previously reported but may play an important role in LUAD. The molecular mechanism of these novel hub genes in LUAD is worthy of further study, and relevant prognostic model can also be constructed based on these genes for risk assessment, classification and prognostic judgment of patients with LUAD.

Competing Interests
The authors declare that they have no competing interests.

Author Contributions
Tingting Guo performed the experiments, analyzed the data, prepared figures and/or tables. Hongtao Ma performed the experiments, analyzed the data, prepared figures and/or tables. Yubai Zhou conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

Data Availability
The following information was supplied regarding data availability: Data is available at NCBI GEO under accession numbers GSE118370, GSE32863, GSE85841 and GSE43458.