CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis

Most of the traditional gene selection approaches are borrowed from other fields such as statistics and computer science, However, they do not prioritize biologically relevant genes since the ultimate goal is to determine features that optimize model performance metrics not to build a biologically meaningful model. Therefore, there is an imminent need for new computational tools that integrate the biological knowledge about the data in the process of gene selection and machine learning. Integrative gene selection enables incorporation of biological domain knowledge from external biological resources. In this study, we propose a new computational approach named CogNet that is an integrative gene selection tool that exploits biological knowledge for grouping the genes for the computational modeling tasks of ranking and classification. In CogNet, the pathfindR serves as the biological grouping tool to allow the main algorithm to rank active-subnetwork-oriented KEGG pathway enrichment analysis results to build a biologically relevant model. CogNet provides a list of significant KEGG pathways that can classify the data with a very high accuracy. The list also provides the genes belonging to these pathways that are differentially expressed that are used as features in the classification problem. The list facilitates deep analysis and better interpretability of the role of KEGG pathways in classification of the data thus better establishing the biological relevance of these differentially expressed genes. Even though the main aim of our study is not to improve the accuracy of any existing tool, the performance of the CogNet outperforms a similar approach called maTE while obtaining similar performance compared to other similar tools including SVM-RCE. CogNet was tested on 13 gene expression datasets concerning a variety of diseases.


INTRODUCTION
Due to recent advances in DNA gene expression technology, it is now feasible to obtain gene expression profiles of tissue samples at relatively low costs. Data from genome-wide gene expression analyses are helping scientists and physicians understand the disease mechanisms and use that information to design platforms to assist in diagnosis, to assess prognosis, and to inform treatment plans. For instance, a study by Van 't Veer et al. (2002) collected gene expression data on primary breast tumors of 117 young patients. Machine learning with feature selection was used to identify a gene expression signature strongly predictive of a short interval to distant metastases ("poor prognosis" signature), even in patients that were lymph node negative.
Gene expression technologies are now producing large datasets associated with a variety of diseases. Due to the high dimensionality of the data and relatively small sample sizes, reliable interpretation of the data is a complicated and often overwhelming, and this is an important problem in bioinformatics research. Although sample sizes have continued to grow in recent years, new and efficient feature selection algorithms are still needed to overcome challenges in the existing methods (Vanjimalar, Ramyachitra & Manikandan, 2018), in order to achieve the full potential of this data in the development of gene-based diagnostic tests, drug discovery and therapeutic strategies for improving public health.
Most of the traditional gene selection approaches are borrowed from other fields such as statistics and computer science. There is a need for new computational tools that integrate the biological knowledge about the data in the process of gene selection and classification. Integrative gene selection incorporates biological domain knowledge to selection processes from external biological resources. The main aim of integrative gene selection is to generate a ranked list of features that provides high model performance and takes into consideration both statistical metrics applied on the gene expression data and the biological background information provided as external datasets. For example, biological background information may be Gene Ontology (GO) where it provides for each gene its product as Cellular Components (CC), Molecular Functions (MF), and Biological Processes (BP). GO is a way to capture biological knowledge in a computable form that consists from a set of concepts and their relationships with each other.
The various methods that have been applied to the process of selecting disease-specific features from large gene expression datasets were reviewed recently (Pan, 2002;Lazar et al., 2012) and fall into three major categories: "filters", "wrappers", and "embedded approaches". Briefly, the filter approach, not based on any machine learning algorithm, uses a statistic (ANOVA, t-test, etc.), wrappers use learning techniques to evaluate which features are useful, and embedded techniques combine the feature selection step and classifier construction. Pan (2002) recently compared different filtering methods, highlighting similarities and differences between three main methods: the t-test, a regression modeling approach, and a mixture model approach. Additional comparisons of filtering techniques are available in Lazar et al. (2012). Inza et al. (2004) also carried out a comparison between a filter metrics and a wrapper sequential search procedure applied on gene expression datasets.
Integrative approaches become important topics (Bellazzi & Zupan, 2007;Fang, Mustapha & Sulaiman, 2014) in the emerging field of gene expression. GO (Ashburner et al., 2000) was used by Qi & Tang (2007) for genes ranking based on not only their individual discriminative powers but also the powers of biological information contained in GO annotations. The algorithm is an iterative algorithm that starts by applying Information Gain (IG) to compute discriminative scores for each gene. Genes with a score of zero are removed from the analysis. The second step is to integrate the biological knowledge by annotating those surviving genes with GO term. The third step is to score the GO terms as the mean of its associated gene IG score. Move the gene with the highest IG from the GO term with the highest score, to the final list. This procedure is repeated until the final goal is reached.
SoFoCles (Papachristoudis, Diplaris & Mitkas, 2010) is an interactive tool that enables semantic feature filtering in microarray classification problems with the use of external biological knowledge retrieved from the Gene Ontology. SoFoCles involves the calculation of semantic similarities between two feature sets in order to derive an enriched, semantically-aware final feature set. The GO terms are used in order to give a similarity score for each annotated gene. Fang, Mustapha & Sulaiman (2014) proposed an integrative gene selection based on filter method and association analysis for selecting genes that are not only differentially expressed but also informative for classification. Association analysis was employed to integrate microarray data with Gene Ontology (GO) and KEGG Pathways (KEGG) simultaneously. The performance of the integrative models verified the efficiency and scalability of association analysis in mining microarray data.
An additional study that integrated KEGG with genetic meta information (DisGeNET (Piñero et al., 2019)) was proposed by Raghu et al. (2017). Their approach was a two-step analytical workflow that incorporates a new feature selection paradigm as the first step and that utilizes graphical causal modeling as the second step to handle the automatic extraction of causal relationships. Quanz, Park & Huan (2008) apply the method of pathways-as-features using the KEGG pathway database for the pathway extraction component and global test method for the pathway selection component. The genes in each pathway are then transformed into one single feature by mean normalization or logistic regression. The number of features of the transformed data is the number of pathways. For instance, for the diabetes data for which 17 pathways are selected, the dimensionality is reduced from 22, 283 to 17 for the classification task.
Unsupervised gene selection using biological knowledge-based GO terms was suggested by Acharya, Saha & Nikhil (2017). They have utilized gene annotation data, where each gene is represented as a structural information content (IC) based gene-GO term annotation vector which intuitively forms a gene-GO term annotation matrix for a selected data set. IC is the information content of a GO term is related to how often the term is applied to genes in the database.
A very interesting study that emphasizes the need for an integrative approach was conducted by Perscheid, Grasnick & Uflacker (2019). Their work compared the performance of traditional and integrative gene selection approaches. Moreover, they propose a straightforward approach to integrate external knowledge with traditional gene selection approaches. The framework enables automatic external knowledge integration, gene selection, and evaluation. The study shows that the integration of external knowledge improves overall analysis results.
Feature selection and discovering the molecular explanation of disease describe the same process, where the first one is a computer science term and the second one is used in the biomedical sciences.
Several tools are now available that allow users to break the fixed set paradigm in assessing the statistical enrichment of sets of genes. In this regard, the gene set enrichment analysis is a very important method. Recently, different approaches were developed and become useful tools in gene expression analysis (Cohn-Alperovich et al., 2016;Ulgen, Ozisik & Sezerman, 2019). PathfindR (Ulgen, Ozisik & Sezerman, 2019) is a tool for pathway enrichment analysis utilizing active subnetworks (An active subnetwork can be defined as a group of interconnected genes in a protein-protein interaction network (PIN) that predominantly consists of significantly altered genes). It identifies gene sets that form active subnetworks in a protein-protein interaction network using a list of genes provided by the user. It then performs pathway enrichment analyses on the identified gene sets. In most enrichment approaches, relational information captured in the graph structure of a PIN is overlooked as genes in the network neighborhood of significant genes are not taken into account. The approach pathfindR uses for exploiting interaction information to enhance pathway enrichment analysis is active subnetwork search. Briefly, active subnetwork search enables inclusion of genes that are not significant genes themselves but connect significant genes. This results in the identification of phenotypeassociated connected significant subnetworks. Initially identifying active subnetworks in a list of significant genes and then performing pathway enrichment analysis of these active subnetworks efficiently, pathfindR exploits interaction information between the genes. This, in turn, helps pathfindR uncover relevant mechanisms underlying the disease. Support Vector Machines-Recursive Cluster Elimination (SVM-RCE) is a machine learning algorithm based on grouping/clustering gene expressions for scoring each cluster of genes (Yousef et al., 2007). Interest in this approach has grown over time and a number of publications based on SVM-RCE that have successfully applied this approach to identifying those features directly associated with a disease/condition are being published. This increasing interest is based on a reconsideration of how feature selection in biological datasets can benefit from considering the biomedical relationships of the features in the selection process. The usefulness of SVM-RCE then led to the development of additional computational tools. Similar studies for SVM-RCE were carried out (Harris & Niekerk, 2018;Lazzarini & Bacardit, 2017) indicating the importance of the merit of SVM-RCE approach. The study of Deshpande et al. (2010) is a derivative of SVM-RCE algorithm with small modification for disease state prediction. Additionally, they have used our invented term "recursive cluster elimination".
Most interestingly, the study of Zhao, Wang & Chen (2017) has used the SVM-RCE tool for comparison tasks applied for detection on expression profiles for identifying microRNAs related to venous metastasis in hepatocellular carcinoma. SVM-RNE (Yousef et al., 2009) is a similar approach to SVM-RCE, and uses the GXNA (Nacu et al., 2007) tool to extract the genes networks from the gene expression data.
Those networks serve as groups/clusters of genes that are subject to the rank procedure. A similar study to SVM-RNE was carried out by Johannes et al. (2010) for the integration of pathway knowledge into a reweighted recursive feature elimination approach for risk stratification of cancer patients.
The term knowledge-driven variable selection (KDVS) is a similar term of integration of biological knowledge in the process of feature selection. In Zycinski et al. (2013), the authors proposed a KDVS framework, which uses a priori biological knowledge in highthroughput data analysis, and applied this framework to SVM-RNE.
The most recent tool that integrates biological knowledge for grouping the genes was maTE (Yousef, Abdallah & Allmer, 2019), which uses the same approach based on the interactions of microRNAs (miRNA) and their gene targets. The maTE approach is different from SVM-RCE and SVM-RNE in that it integrates additional input to the algorithm which is the information about miRNA and its target set.
The benefit of integration of biological knowledge led us to suggest a new tool called CogNet, that integrates biological knowledge derived from integrating the pathfindR tool into an integrative approach. In CogNet, the pathfindR tool serves as the biological grouping function allowing the main algorithm to rank active-subnetwork-oriented KEGG pathway enrichment analysis results. The details of the tool will be described in the following sections.

MATERIALS AND METHODS
The computational tool CogNet that we developed is based on the concept of integration of biological knowledge with machine learning in order to perform two tasks: the first task is ranking the groups of genes (in this case, pathway genes) and then use the top groups (significant groups) to build a machine learning model. Figure 1 displays the general Figure 1 General workflow for integrating biological information for grouping the genes by bioF() function. BioF() could be microRNA targets association, KEGG pathway association or other association.
Full-size  DOI: 10.7717/peerj-cs.336/ fig-1 workflow of integrating biological information for grouping the genes by a biological grouping functions. The CogNet components meet the general approach of the integration of biological knowledge with machine learning as described in Fig. 1. In fact, the tools SVM-RCE, SVM-RNE, and maTE also fit the general approach described in Fig. 1. Let us assume that we are given a two-class data D, which consists of k samples and n genes. Let us assume that the biological grouping function groups the n genes into m groups as following: bioF(g 1 ,g 2 ,…,g n ) = {grp 1 ,grp 2 ,…,grp m }. bioF() can also assign some genes to a specific group that contains genes about which there is no biological knowledge.
The bioF() function is used in order to group/cluster the genes using biological information that could be associated with a specific biological concept. For example, bioF() could group genes according to their miRNA targets (as the tool maTE did), or might be according to disease, meaning that groups of genes are associated with specific diseases.
The ranking step is based on the machine learning algorithm used. In order to estimate the significance of each group grp i of genes, the following algorithm is applied: 1. Create a new data D Ã that contains only genes from the grp i . 2. Apply cross validation using the ML algorithm.
3. Assign a score to grp i . The score is the average of the performance metric (could be accuracy, the area under the curve, f-measure, etc.)

pathfindR
In this section, we describe the pathfindR tool that serves as the bioF() function (Fig. 1) in the CogNet tool. Active-subnetwork-oriented KEGG pathway enrichment analysis of the proteins was conducted using the R package pathfindR (Ulgen, Ozisik & Sezerman, 2019). Active subnetworks are subnetworks within the PIN (BioGRID, by default) that have a locally maximal score (based on the provided significance values). Active subnetworks define distinct disease-associated sets of interacting genes, whether discovered through the original analysis or discovered because of being in interaction with a significant gene.
The workflow of pathfindR is presented in Fig. 2. After processing the input (to filter the differential expression results, the p value threshold was chosen as 0.2), pathfindR maps the input genes onto a PIN. Using the mapped genes, an active subnetwork search (with the greedy approach as default) is performed. The resulting active subnetworks are then filtered based on their scores and the number of significant genes they contain. This filtered list of active subnetworks is then used for enrichment analyses (overrepresentation analysis via hypergeometric-distribution-based tests), that is, using the genes in each of the active subnetworks, the significantly enriched pathways are identified. Enriched pathways with adjusted p values larger than the given threshold (0.2 was used) are discarded and the lowest adjusted p-value (overall active subnetworks) for each term is kept. This process of "active subnetwork search + enrichment analyses" is repeated for a selected number of iterations (default is 10), performed in parallel. Over all iterations, the lowest and the highest adjusted-p values, as well as the number of occurrences over all iterations are reported for each significantly enriched pathway.

The CogNet tool
The CogNet algorithm is based on the general approach of integrating biological information for grouping the genes as described in Fig. 1. Overall, the tool is composed of two components. The first component is the pathfindR step that is serving to generate the groups of genes, which are enriched KEGG pathways. Then, the second component is applied to rank those groups in terms of their contribution to separate the two-class data D. The workflow of the tool is presented in Fig. 3.
The CogNet starts by splitting the data into two parts. The first part is the training part that is used in order to rank the pathways produced by the tool pathfindR. The test part is utilized at the end of the algorithm in order to estimate the performance of the CogNet.
A list of genes and their p-values represented in a table is created to serve as input to pathfindR (Table 1). The list is computed using Student's t-test to assign for each gene its differential expression significance (p-value). The pathfindR tool is invoked to create n significant KEGG pathways, from now on, we refer to this as n groups of genes grp 1 ,grp 2 ,…, grp n . As a preprocess to the step of the ranking, each KEGG pathway is producing one data set out of the final data D that containing only genes of the specific pathway. Thus, n datasets D Ã i i = 1,2,…n, (with |grp i | as the number of genes) are created to be subject for the Rank step. Table 2 presents an example input table to Rank step. The Rank step is computed as a Monte Carlo cross-validation (MCCV), where for each D Ã i a score is assigned as the mean of computing accuracy of r iteration of splitting the data into two parts one for training and second for testing (for example, 80% training and 20% testing) applying random forest (or another ML algorithm such as SVM). The output of the stage is a sorted list of KEGG pathways according to the score assigned in the Rank step. Let refer to this list as grp Ã 1 , grp Ã 2 , …, grp Ã n . Table 3 presents an example of the output of Rank step.
The final step is to compute the performance of the CogNet by training the RF on top pathways and test on the main test data (from the first stage of the algorithm).
gene_list={} for i = 1 to n: a. gene_list = gene_list ∪ grp Ã n b. create Tr to be the data D Train with just genes belongs to gene_list.  One of the outputs of the CogNet is a table that reports the performance on top 1, 2,…, n. However, we have created a table only for the top 10. We should state that n is a variable and is dependent on the data. Sometimes, n may not reach 10 because pathfindR reports only the significant KEGG pathways. In the worst case, the output of pathfindR could be an empty list.
Another output is the significant list of KEGG pathways and also the significant genes. We run the CogNet algorithm multiple times (k times) where each time the data is split into 90% for training and 10% for testing. The CogNet keeps track of each KEGG pathway and the number of times appears on top of the list. Additionally, CogNet keeps track of the genes that belong to the pathway and how many times they were on the top. An example of input of top pathways and genes produced by CogNet is presented in Table 4.

Implementation
We have decided to use the free and open-source platform Knime (Berthold et al., 2009) due to its simplicity and very useful graphical representations. Additionally, Knime is a highly integrative tool. A script for performing analysis with pathfindR was imported as an R node to the Knime workflow.
The Knime workflow consists mainly of nodes where each node has its own functionality. Meta-node is created as a collection of nodes that has a specific task to perform.
The workflow Multi-File CogNet is presented in Fig. 4. It starts by uploading a list of the names of the datasets (URL) by the "List Files node". Then a loop over those datasets to read each data by the node "Table Reader" and send it as input to the CogNet tool (CogNet meta-node).
While Fig. 3 presents the flowchart of CogNet algorithm, Fig. 5 presents the implementation of CogNet as a Knime workflow with its meta-nodes. The input has two ports, where the first port is for the test data, the second port is for the training data. The training data is passed to the tool pathfindR (one of the meta-nodes) to process the data and get as an output a sorted table of significant pathways. The "RankPathWays" meta nodes perform the task of the ranking, while the flow between "Counting Loop start" and "Loop End" is for performing testing on top i pathways. It ranges from 1 to 10.
Additional task for the node "Loop End" is to collect all the results and send them out to be processed and save the results.

Gene expression data
A total of 13 human gene expression datasets were downloaded from the gene expression omnibus (Clough & Barrett, 2016) at NCBI. For all datasets disease (positive) and control (negative) data were available (Table 5). Those 13 datasets served to test the CogNet and for comparison with other two tools maTE and SVM-RCE.

Model performance evaluation
For each established model, we calculated a number of statistical measures like sensitivity, specificity, and accuracy to evaluate model performance. The following formulations were    (Bradley, 1997) is an estimate of the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
All reported performance measures refer to the average of 10-fold MCCV. Some of the data sets used by the classifier are imbalanced, which can influence the classifier to the advantage of the set with more samples. This is known as the problem of the imbalanced class distribution. We have applied an under-sampling approach that reduces the number of samples of the majority class to the minority class, thus reducing the bias in the size distribution of the data subsets. We choose to apply the under-sampling of ratio 1:2.

RESULTS
We have considered 13 gene expression data sets to test CogNet and for comparison with other similar tools. To our knowledge, no tools similar to CogNet exists. Nonetheless, we compare CogNet with tools that have similar merit of grouping and rankings, maTE and SVM-RCE. Although the purpose of the comparison is not to prove a higher performance, it outperforms maTE and gets similar performance to SVM-RCE with advantage of using a very smaller number of genes than SVM-RCE. Additionally, we have run CogNet with two versions, where the CogNet_SVM uses the SVM for the scoring and classification while the CogNet_RF uses Random Forest (RF) for Scoring and classification. The results for CogNet-SVM is provided in the Supplemental Data. Table 5 Description of the 10 data sets used in our study. The data sets are obtained from GEO. Each entry has the GEO code the name of the data, the number of samples and the classes of the data.

GEO Accession Title
For each tool other than SVM-RCE, we have obtained the performance over the top 1-10 groups that were ranked by the scoring stage. For SVM-RCE we obtained the performance starting by 1,000 genes and 100 clusters, then reduced 10% at each iteration. For comparing purposes, we consider the last 10 clusters. Table 6 presents in detail all the results obtained applying the CogNet_RF tool. The AUC measure is presented. The columns #G present the average of the number of genes over the 10 iterations was applied as a cross-validation. Figure 6 presents the average of all AUC results over the 13 datasets on the 10 clusters/ groups, for each tool (Avg AUC bar), while the Avg#Genes bar represent the average of the number of genes over the 13 datasets on the 10 clusters/groups.

Validation of the results
We have conducted further analysis using the datasets (GSE15573, GSE4107 and GSE55945) that was considered in the pathfindR study. GSE15573 consists of 33 samples corresponding to 18 Rheumatoid Arthritis (RA) Patients and 15 Controls. The aim of this study is to identify peripheral blood gene expression profiles for RA patients.
The study considers the standard statistical approaches in order to detect the significant genes by ANOVA with False Discovery Rate (FDR < 5%). The Gene Ontology (GO) in the PANTHER database is applied to identify biological processes. GSE4107 consists of 10 colonic mucosa of healthy control samples and 12 patient samples. Patients and controls were age-(50 or less), ethnicity-(Chinese) and tissuematched. The analysis to detect significant genes was based on T-test, hierarchical clustering, mean fold-change and principal component.
GSE55945 consists of 13 prostate cancer samples and 8 normal. The aim of the study is to compare the expression levels between malignant and benign prostate tissues.
We have run the CogNet tool on those three datasets obtaining the performance on top significant pathways. The AUC values are presented in Table 6. under study. Literature support for the top 10 pathways per each dataset are presented in Table 7.
For GSE15573, the rheumatoid arthritis dataset, 4 out the top 10 pathways were found to be supported by literature to be associated with rheumatoid arthritis biology. For GSE4107, comparing colorectal cancer patients to healthy controls, 5 out of the top 10 pathways were supported by literature to be associated with colorectal cancer. Finally, for GSE55945, the dataset comparing human prostate benign and malignant tissue, 6 out of the top 10 pathways were found to be associated with prostate cancer.
The CogNet results highlighted several new pathways that could contribute to the identification of innovative clinical biomarkers for diagnostic procedures and therapeutic intervention.

DISCUSSION AND CONCLUSIONS
We have presented a novel tool called CogNet that is based on ranking and classification and integration of biological knowledge. CogNet is developed on top of the tool pathfindR to add to its functionality the ability to perform classification using the enriched KEGG pathways as features. CogNet outputs to the user the performance and a list of significant KEGG pathways that we believe will contribute to a better and deep understanding of the data under investigation.
There is similarity between CogNet and maTE in that both rank a group of genes for classification tasks. However, CogNet is using the biological information of the KEGG pathways that was processed by an enrichment procedure to suggest a list of significant pathways, then CogNet ranks those pathways in terms of their contribution for separating the two-class of the given data (classification task). However, maTE uses prior information about microRNA and its target genes to group the genes. This grouping is not related to the expression of the genes as CogNet is considering. Moreover, SVM-RCE is using the clustering algorithm k-means in order to group the genes into clusters, that means that groups are related to the expression of the genes.
As a future work, we would develop CogNet to explore the effectiveness of different combinations of the KEGG pathways in the data, that means instead of ranking each pathway's genes individually, we will use a different approach to rank those groups simultaneously. In the current version, CogNet ranks each KEGG pathway individually by performing internal cross validation.
Additionally, we are working to integrate CogNet and maTE where the process of the rank will be applied on the groups generated by the biological functions used in each tool. As a result, the discovery of the significant pathways and microRNA targets genes will suggest to the biology researcher to explore the role of pathways and microRNA together in the same data.
The success of CogNet, maTE and SVM-RCE is suggesting that more computational approaches need to be developed based on the merit of integration of biological information into the machine learning algorithm.