Dissecting molecular network structures using a network subgraph approach

Biological processes are based on molecular networks, which exhibit biological functions through interactions of genetic elements or proteins. This study presents a graph-based method to characterize molecular networks by decomposing the networks into directed multigraphs: network subgraphs. Spectral graph theory, reciprocity and complexity measures were used to quantify the network subgraphs. Graph energy, reciprocity and cyclomatic complexity can optimally specify network subgraphs with some degree of degeneracy. Seventy-one molecular networks were analyzed from three network types: cancer networks, signal transduction networks, and cellular processes. Molecular networks are built from a finite number of subgraph patterns and subgraphs with large graph energies are not present, which implies a graph energy cutoff. In addition, certain subgraph patterns are absent from the three network types. Thus, the Shannon entropy of the subgraph frequency distribution is not maximal. Furthermore, frequently-observed subgraphs are irreducible graphs. These novel findings warrant further investigation and may lead to important applications. Finally, we observed that cancer-related cellular processes are enriched with subgraph-associated driver genes. Our study provides a systematic approach for dissecting biological networks and supports the conclusion that there are organizational principles underlying molecular networks.


INTRODUCTION
Molecular networks are the basis of biological processes, with biological functions emerging through interactions among the various genetic components. In network analysis, we often use hierarchical decomposition approaches (Freyre-González et al., 2008;Jothi et al., 2009) to simplify network complexity. Decomposition approaches assume that a complex network can be decomposed into small modules, providing explanations that can address increasing complexity. Based on this assumption, a network can be modeled by a collection of smaller modules known as network motifs. Each module is Network subgraphs (N-node subgraphs) vs. network motifs In this study, we hypothesized that network subgraphs are the fundamental building blocks of a network. In other words, subgraphs are treated as the core network components. This is similar to the work of Mowshowitz (Mowshowitz, 1968a), who proposed that a finite graph (V vertices and E edges) can be decomposed into equivalence classes (h classes). Each class contains v i vertices and a probability is assigned to each class; that is, p i = v i /V. The Shannon entropy associated with the graph can be viewed as a measure of the graph complexity relative to the given decomposition of its equivalence classes. Therefore, we propose that network properties are captured by subgraphs comprising N nodes, which are referred to as N-node subgraphs. To systematically characterize a large network, the 3-node subgraphs, 4-node subgraphs, and up to the N-node subgraphs embedded in the network are identified. These subgraph patterns are identical to the network motifs defined by (Alon, 2006).
We identify network subgraphs embedded within a molecular network by matching the network topology to the N-node subgraphs. However, we do not consider randomized versions of the studied network. Therefore, one question is whether the N-node subgraphs results differ from the N-node motif findings for a given network-that is, does the number of identified subgraph patterns match the number of motif patterns? We determined that the number of subgraph patterns identified in a network is the same as the number of motif patterns.
For a directed graph, a total of 2, 13,199,9,364 and 1,530,843 possible patterns can be defined for 2-node, 3-node, 4-node, 5-node and 6-node motifs, respectively (Harary & Palmer, 1973;Sloane & Plouffe, 1995). These sets of N-nodes motifs were used as our sets of N-node subgraphs. Since the problem of identifying N-node subgraphs in a large network is NP-complete (Kim, Diko & Rawson, 2013) we used 3-node and 4-node subgraphs only. Subgraphs composed of five or more nodes are neglected as a first approximation, which could provide useful insights into dissecting the design principles underlying molecular networks. Subgraphs composed of five nodes are currently under study.
An earlier work (Konagurthu & Lesk, 2008) showed that certain motifs do not appear significantly more frequently than motifs appearing in corresponding random networks. Nevertheless, those motifs still play functional roles. This justifies our approach because the present work identifies all possible 3-node and 4-node subgraphs, regardless of their occurrence frequency. In other words, we adopt the notion that N-node subgraphs are the basic building blocks, but do not necessarily occur frequently in a network. Adami et al. (2011) studied undirected colored graphs (where nodes are labeled with different colors) and showed that the relative frequency of the colored motifs can be used to define the information content of the network. Here, we consider subgraphs that are directed graphs which could contain cycles.

Spectral graph theory, reciprocity, complexity measures and information theory
Seminal works regarding the concepts of information content, topology and entropy in biology were carried out by Mowshowitz and Rashevsky (Mowshowitz, 1968a;Rashevsky, 1954Rashevsky, , 1955. In particular, Mowshowitz presented an entropy-based method to measure the complexity of a graph by decomposing it into equivalence classes (Mowshowitz, 1968a). To characterize network subgraphs, we utilized the following concepts: spectral graph theory (SGT), reciprocity, and complexity measures. SGT is a powerful approach that has been applied in many areas, including computer science and computational biology (Banerjee & Jost, 2009;Cvetković & Simić, 2011). The energy of a graph is an invariant (Gutman, 1992(Gutman, , 2001Li, Shi & Gutman, 2012) equal to the sum of the absolute values of the eigenvalues of the adjacency matrix A. Originally, the concept of graph energy introduced by Gutman was applied to study undirected graphs, and was used to estimate the ðelectron energy of hydrocarbons (Gutman, 1992).
The eigenvalues of the adjacency matrix A defined on a graph play an essential role in inferring the structural properties of the chemical graphs (Estrada & Benzi, 2017). Estrada and Benzi showed that the graph energy is a weighted sum of the traces of even powers of the adjacency matrix. Indeed, this finding can be used to obtain energy bounds for hydrocarbon molecules. The graph energy concept was extended from undirected graphs to digraphs by Brualdi (2010).
Molecular networks are directed graphs composed of feedback interactions. Reciprocity is a parameter that quantifies the degree of bidirectional connection of a network subgraph. Complexity arises from interactions among the constituent components. Many complexity measures have been proposed, but there is no standard or formal definition of complexity metrics that can be applied in all circumstances. Each complexity measure has strengths and weaknesses (Weyuker, 1988). Early work on defining complexity for directed graphs and infinite graphs can be traced back to Mowshowitz (1968b). The concept of graph complexity indices has been applied to infer the hierarchical order of chemical structures (Minoli, 1976). Given a network subgraph pattern, we used two commonly used complexity measures to characterize subgraphs.
It is possible that some network subgraphs are associated with the same graph energy and spectrum (the set of eigenvalues). In other words, different subgraphs with the same energy and spectrum are indistinguishable; that is, the cospectrality problem. Wilson & Zhu (2008) proposed to combine the spectra of two of the four graph matrices (the adjacency, Laplacian, signless Laplacian, and normalized Laplacian matrices) to reduce the cospectrality problem for undirected graphs and trees. Their method can effectively reduce the number of cospectral pairs of graphs, although the pairs are not completely distinguishable.
In addition, graph descriptors are a useful concept to classify complex networks (Mueller et al., 2011) In this study, we used a greedy algorithm to search for optimal parameters that maximize degenerate subgraph removal. The parameters we suggested include the subgraph spectrum, graph energy, reciprocity, and complexity measures.
The information-theoretic quantity, Shannon entropy, was proposed by Cover &Thomas (1991) andShannon (1948) to provide a precise definition of information randomness. Lower Shannon entropy indicates less information or more structure embedded within the data (i.e., a biased system). Shannon entropy has been extensively applied in cancer biology studies. For instance, a previous study reported a negative correlation between cancer protein-protein interaction network entropy and cancer aggressiveness (Conforte et al., 2019). In a recent study, Sen et al. (2019) studied the protein family sequences of four cancer types, and observed that the sequences tend to associate with higher Shannon entropy. Higher entropy implies a larger bias distribution of amino acid composition and protein structure disorder. Additionally, cancer networks exhibit high information entropy (Schramm, Kannabiran & König, 2010), increased network entropy (West et al., 2012), and signaling entropy (Teschendorff et al., 2015). Being motivated by the effectiveness of Shannon entropy in cancer studies, we used that quantity to measure the randomness of subgraphs frequency distributions of molecular networks.
In our previous work (Hsieh et al., 2015), we laid a foundation for the present study. In another recent work (Lee, Huang & Ng, 2016) we extended the previous work (Hsieh et al., 2015) by developing a subgraph-finding algorithm, PatternFinder, to identify 3-node and 4-node subgraphs in cancer networks, signal transduction networks, and cellular processes. Since the network size considered in this study is less than 160 nodes, computational time is not an issue. Figure 1 depicts the workflow of the present study.

Input data
Cancer is a highly heterogenous and complicated disease. The signal transduction networks are critically involved in modulating cellular processes of lung cancer cell , in cancer invasion (Di Domenico & Giordano, 2017), and in cellular invasion (Neth et al., 2007). We selected cancer-related molecular networks and analyzed their network structures. The KEGG database (Nakaya et al., 2013) provides a comprehensive collection of biochemical network information that were prepared in the KGML format (August 2017). Four families of networks were employed in the present study, including: (i) Environmental Information Processing, (ii) Cellular Processes, (iii) Organismal Systems, and (iv) Human Cancers.
Not every network recorded by KEGG was imported. After inspection, we disregarded networks composed of several disjoint components with repetitive regulatory structures ("Two-component system" and "MicroRNAs in cancer"), small networks with size less than 10 ("Chemical carcinogenesis" and "Viral carcinogenesis"). In addition, we collected the networks labeled with the name "signaling pathway", and called them "signal transduction networks (STNs)". We note that STNs range across different families in the KEGG classification, including "Signal transduction", "Immune system" and "Endocrine system".
In total, we collected 17 cancer networks, 45 STNs and nine cellular processes. We downloaded KEGG pathway KGML files and made use of the KEGGScape (Nishida et al., 2014) and KEGGparser (Arakelyan & Nersisyan, 2013) packages to visualize and save the node and edge information for each network.

Adjacency matrix
By analyzing the connectivity of each gene, one constructs an adjacency matrix A, to represent the interaction network. In total, there are 13 3-node subgraphs and 199 4-node subgraphs (Alon, 2006;Shen-Orr et al., 2002). In Supplemental File 1- Table S1 summarizes the nodes, edges, and subgraph-associated node information for the 17 cancer networks. The complete list of node and edge information of the 45 STNs and nine cellular processes can be found in Supplemental File 1-Tables S2 and S3, respectively. Real world molecular networks compose of thousands of genes, which is larger than the networks we analyzed; however, the regulatory and feedback interaction information among thousands of genes are not available in KEGG yet, and it can be considered if the data are available. Each subgraph can be represented by a decimal, the graphical representation of the 3-node subgraphs and 4-node subgraphs are depicted in Supplemental File 2.
It is possible that some subgraphs are subgraphs of other subgraphs (structural subgraphs). In a previous work (Sporns & Kötter, 2004), such subgraphs are called functional motifs. In a brain network, a structural motif and functional motif represent an anatomical building block and the elementary processing mode of a network, respectively.
We have developed an algorithm named PatternFinder to enumerate all possible subgraphs embedded in the 3-node subgraphs and 4-node subgraphs. Details about PatternFinder are given in Supplemental File 1- Table S4. As we highlighted in "Network subgraphs (N-node subgraphs) versus network motifs", we addressed the question whether the number of subgraph patterns identified in a network is the same as the number of motif patterns or not. Firstly, we selected three cancer networks, that is, acute myeloid leukemia (AML), breast cancer and colorectal cancer, obtained from the KEGG database. Secondly, we extracted the subgraph patterns and the motif patterns from the three networks by using the PatternFinder, LoTo (Martin et al., 2017) and acc-Motif (Meira et al., 2014) algorithms respectively. The number of randomized networks is set equal to 1,000 times in the acc-Motif experiment. The acc-Motif algorithm identified network motifs with size up to five nodes, and the algorithm was improved to find motifs up to six nodes (Meira et al., 2014).
Characterization of network subgraphs: graph energy, reciprocity and graph complexity The adjacency matrix A can be expressed in terms of its eigenvectors and eigenvalues. Since A is a nonsymmetric matrix in general, its eigenvalues may be complex and all of its eigenvectors are nonorthogonal. Let n, e and d i denote the number of nodes, number of Figure 1 The workflow of the present study. Molecular network information were obtained from the KEGG database (August 2017). Network subgraphs were identified using PatternFinder and then network subgraphs were characterized using: graph energy, reciprocity and graph complexity. A code was developed to determine the minimal set of parameters required to label network subgraphs based on a greedy strategy. The Shannon entropy for 3-node subgraphs and 4-node subgraphs of 71 molecular networks were computed.
Full-size  DOI: 10.7717/peerj.9556/ fig-1 edges and degree of the ith node of graph G, respectively; G is called an (n, e)-graph. The energy of a graph G, E(G), is defined by where a i denotes the ith eigenvalue of A. The sum of all of the eigenvalues is always equal to zero. Assume that the graph energy eigenvalues are labeled in descending order: that is, a 1 ≥ a 2 ≥ … ≥ a n , while the whole spectrum is denoted by Sp(G) = [a 1 , a 2 , … a n ]. The largest eigenvalue is referred to as the spectral radius of graph G (Cvetkovic, Doob & Sachs, 1980).
In spectral graph theory, there are two other matrices-Laplacian (Chung & Graham, 1997) and signless Laplacian (Cvetkovic & Simic, 2009, 2011) that can be defined to characterize graphs. The Laplacian matrix L and signless Laplacian matrix Q of a graph G are defined as L = D -A and Q = D + A respectively, where D is a diagonal matrix in which the diagonal elements are the node degrees. The Laplacian energy of a graph G, LE (G), is defined by where |β i | denotes the absolute value of the ith eigenvalue of L. There is an analogy between the properties of E(G) and LE(G), but some significant differences remain between these two quantities (Gutman & Zhou, 2006). The signless Laplacian energy of graph G, QE(G), is defined by where jg i j denotes the absolute value of the ith eigenvalue of Q. A more general definition of graph energy was suggested by Nikiforov (Nikiforov, 2007(Nikiforov, , 2016. Let M be an n × n real matrix and the singular values be denoted by s 1 , s 2 , … s n .
The singular values of M are equal to the positive square roots of the eigenvalues of MM t , where t denotes matrix transpose. Let M equals A, L, or Q and consider the eigenvalues of AA t , LL t , and QQ t . The total energy, ME, obtained from M, is defined by is called generalized energy. We extend the definition to consider matrix products of the form MN t , and therefore define three additional energies: AL t , AQ t , and LQ t . We call these asymmetric generalized energies. The sums of the absolute values of the eigenvalues of MM t and M t M are the same. This also holds for MN t and NM t . Therefore, one needs to compute MM t and NM t only. The advantages of using asymmetric generalized energies will be demonstrated later in this article. To the best of our knowledge, no (or few) previous studies have made use of asymmetric generalized energies to characterize network subgraphs. In total, we have devised nine graph energies to describe the subgraphs. We also note that Adiga, Balakrishnan & So (2010) proposed a parameter named skew energy, obtained from the skew-adjacency matrix, to characterize directed graphs; however, this parameter does not apply to graphs consisting of multiple arcs (multigraphs).
Several studies (Garlaschelli & Loffredo, 2004;Squartini et al., 2013) have suggested that reciprocal links in directed networks play an important role in dynamical processes and network growth. The traditional definition of reciprocity is R = L $ =L, where L $ and L denote the number of "edges pointing in both directions" and the total number of edges respectively. This definition of reciprocity was modified by Garlaschelli and Loffredo (McCabe, 1976), who defined reciprocity r as the correlation coefficient between the entries of the adjacency matrix A, given by where r equals one if there is an edge from node i to node j; the average, a, is defined by A positive value of r indicates that the subgraph has bidirectional connections, whereas a negative r implies that the subgraph has either an in-connection or out-connection.
To further understand the connectivity structure of network subgraphs, we seek metrics that can be used to measure graph complexity. In software engineering, the cyclomatic complexity (CC) is a metric developed by (McCabe, 1976) to measure the complexity of a program by using its control flow graph. CC is defined by the expression CC = e -N + 2P, where e and N denote the number of edges and number of nodes of the graph, and P denotes the number of predicate/exit nodes (McCabe, 1976;Weyuker, 1988). Node and edge denote a program unit and the execution order of the program. CC depends only on the global decision structure (the number of edges and nodes) of a program. In addition to CC, we utilize the algorithmic complexity measure, the Kolmogorov complexity (KC), to characterize graph complexity. Essentially, the KC of a bit string is given by the smallest computer program that can generate the string. Given the adjacency matrix (or the equivalent bit string), we use the block decomposition method (BDM) (Soler-Toscano et al., 2014) to determine the KC for both 3-node (Zenil, Kiani & Tegner, 2016) and 4-node subgraphs. A bit string with a high KC has a higher degree of randomness, contains more information, and is less compressible. A complete graph has a smaller KC value, whereas a random graph has higher KC and is less compressible.

Unique identifiers for network subgraphs
Every 3-node subgraph and 4-node subgraph has a different KC value, so the KC can be used as a unique identifier. However, given the graph energy, asymmetric graph energies, graph energy spectrum, reciprocity, and CC, we seek to determine a minimal set of parameters that can serve as a label of the network subgraphs. This set of parameters describes certain aspects of the subgraphs differently than the algorithmic complexity measure. To the best of our knowledge, the concept of using energy, reciprocity, and CC in labeling network subgraphs is novel. The pseudocode for determining the minimal set of parameters is based on a greedy strategy and is described in Supplemental File 1- Table S5.
Frequently found subgraphs, network entropy Given a molecular network, PatternFinder identifies both the sets of 3-node subgraphs and 4-node subgraphs. Two subgraphs with the same ID may partially embed the same genetic element(s); these two subgraphs are counted twice in our calculations. We expect that certain subgraph patterns that occur with higher probabilities are the dominant underlying network structure. Let p k ð Þ 3 denote the frequency (probability) distribution of a 3-node network subgraph, where k denotes one of the 13 patterns. The Shannon entropy for 3-node subgraphs and 4-node subgraphs, H 3 and H 4 , of a molecular network are computed. The normalized Shannon entropies for the 3-node subgraphs and 4-node subgraphs are given by H 3R = H 3 /log 2 (13) and H 4R = H 4 /log 2 (199), respectively.

Association of network subgraphs and driver genes
In this part of calculation, we propose to examine the association of network subgraphs and driver genes for cancer networks, STN and cellular processes. It is commonly believed that driver genes are genes that give selective advantage to cancer development, whereas passenger mutation genes do not alter selectivity pressure, they contribute indirectly to cancer formation (Bozic et al., 2010;Greenman et al., 2007).
We noted that most of the studies are focus on mutation driver genes prediction (Tokheim et al., 2016). Also, it was found that certain motif positions; such as, the source nodes and target nodes, are enriched in cancer-associated genes (Awan et al., 2007;Carson et al., 2015). There is no or relative few works on estimating how often is the driver genes embedded within network subgraphs. In this study, we conducted an analysis to determine whether driver genes are enriched or depleted in network subgraphs or not. The level of enrichment or depletion was evaluated by using odds ratio. Driver gene data were collected from the Cancer Gene Census (CGC) database (Futreal et al., 2004). The CGC resource provides list of genes known to be involved in cancer. CGC has documented the mutation information of cancer driver genes which are supported by rather extensive evidence in the literature (both from mutation studies and activity measurements). In the present CGC release, the database has collected 576 driver genes.
Given a molecular network, consider the 2 × 2 contingency table (Table 1), which depicts the statistics of driver genes and non-driver genes embedded in "subgraph module" and "non-subgraph module". Subgraph module denotes the collection of the 3-node subgraphs and 4-node subgraphs found by PatterFinder; whereas, non-subgraph module denotes otherwise. We use the odds ratio (OR) to estimate the level of enrichment.
OR measures the relative odds of finding driver genes embedded in network subgraph modules relative to non-subgraph modules. The OR is defined by OR ¼ pðdriver gene j subgraph moduleÞ 1 À pðdriver gene j subgraph moduleÞ pðdriver gene j non subgraph moduleÞ 1 À pðdriver gene j non subgraph moduleÞ If OR is greater than one, it means that network subgraphs are enriched with driver genes.

RESULTS
The results of subgraph and motif patterns identified by PatternFinder, acc-Motif, and LoTo are summarized in Table 2. We set the number of randomized networks equal to 1,000 when executing acc-Motif. Note that the tool LoTo identifies motifs composed of three nodes only. The number of subgraphs and motifs identified by the three algorithms are similar. However, there are two major differences. First, not every motif identified by acc-Motif is associated with a significant p-value. For instance, the number of 3-node subgraphs identified for AML is four. None of the identified motifs is significant according to acc-Motif (Table 2). In addition, both PatternFinder and LoTo identified four 3-node subgraphs for the breast cancer network, but acc-Motif identified only one significant 3-node motif. Second, the results for the 4-node subgraphs and motifs are given in the bottom half of Table 2. Clearly, PatternFinder can identify the complete set of subgraphs, but acc-Motif identified relatively few network motifs with p < 0.05. Furthermore, acc-Motif did identify two 4-node subgraphs, "id_456" and "id_2186", for AML and breast cancer, respectively. This suggests that acc-Motif has certain limitations and cannot enumerate all possible network substructures. Essentially, the results suggest that the subgraph-based approach delivers similar results to the motif-based approach. The graphical and ID representation of the 3-node and 4-node motifs defined by acc-Motif are depicted in Supplemental File 3. This can help the reader to relate subgraph IDs to the IDs defined by acc-Motif.

Subgraph module
Non-subgraph module Total Driver genes a b a + b Non-driver genes c d c + d Total a + c b + d a + b+ c + d Table 2 The results of the subgraph patterns and motif patterns identified by   there are eight subgraphs that are not composed of any 4-node functional subgraphs: subgraph "id_14" (SIM), subgraph "id_28," subgraph "id_74," subgraph "id_76" (MIM), subgraph "id_280," subgraph "id_328" (cascade), subgraph "id_392," and subgraph "id_2184". Thus, these eight subgraphs exhibit irreducibility. However, each of the eight subgraphs is embedded with exactly one 3-node functional subgraph. In other words, given the 4-node subgraphs, the irreducible property does not apply if we consider subgraphs composed of three nodes. Supplemental File 4 summarizes the 3-node functional subgraphs embedded in the 3-node subgraphs. Supplemental File 5 summarizes the 3-node and 4-node functional subgraphs embedded within the 4-node subgraphs. In these files, integers "1" and "0" denote the presence and absence of a functional subgraph, respectively. Table 3 summarizes the results of the nine graph energies and edge information for the 3-node subgraphs. First, since some matrices, such as L and Q, are asymmetric, their eigenvalues are complex. In fact, among the 3-node subgraphs, subgraph "id_98" has a pair of complex conjugate eigenvalues, and their associated eigenvectors are composed of complex components. Second, graph energy is correlated with the number of subgraph edges. For instance, the graph energies of fully-connected 3-node and 4-node subgraphs are maximal, despite having different energy definitions. Third, it is common for certain subgraphs to have the same graph energy. That is, energy-degenerated subgraphs are common. Two subgraphs are said to be equienergetic if they have the same total energy. For instance, two subgraph pairs ("id_6" and "id_36," and "id_14" and "id_74") are equienergetic, regardless of the graph energy definition. The 3-node and 4-node subgraph Table 3 The results of the nine graph energies for the 3-node subgraphs. E is the graph energy, LE is the Laplacian energy, QE is the signless Laplacian energy. Matrix product of the bilinear form MM t is the so-called generalized energy, where M and N = A, L and Q. t denotes matrix transpose. The asymmetric generalized energy matrix is denoted by NM t . energy and eigenvector results are given in Supplemental File 6. Fourth, although the results of the nine graph energies are quite similar, there are differences among them. For instance, the energy level multiplicity is somewhat different. For the 3-node subgraphs, the multiplicities of graph energy E, 0, 2 and 2.83 are 4, 4 and 1, respectively. For QE, there are three energy values: 2.67, 4.00 and 5.33, which are associated with the multiplicity of 3, 3 and 3, respectively. Fifth, energy-degenerated subgraphs may or may not have identical spectra, Sp(G). This suggests that Sp(G) could allow for further distinction between subgraphs. More details are given below in the "Unique identifiers for network subgraphs" section. In Supplemental File 1- Table S6 summarizes the lower (E min ) and upper (E max ) graph energy bounds and ratios for the 3-node and 4-node subgraphs. For the 3-node subgraphs, the ratios (E max /E min ) are bounded between 2 and 4.91. The ratios are slightly larger for 4-node subgraphs. They are bounded between 3.00 and 6.88. We observed that most molecular biological networks are not composed of subgraphs with large graph energies. Therefore, the maximum ratio cannot be achieved. Details are reported below in the "Network subgraphs absent from the network" section. Table 4 depicts the traditional reciprocity R, reciprocity r, and a for the 3-node subgraphs.

Subgraph reciprocity
Most R values are zero, which indicates that no edges point in both directions. Positive and negative values of r denote the presence of cycles. Of the 13 reciprocity values, nine are negative, meaning that most 3-node subgraphs have only in-or out-connections. We note that subgraphs containing one or two cycles can still have negative reciprocity values. The complete sets of R, r, edges, and a values of the 4-node subgraphs are given in Supplemental File 7.

Cyclomatic complexity and Kolmogorov complexity
For the 3-node subgraphs, Table 5  A network subgraph with a large CC value suggests a more complex decision structure. From Table 5, it is apparent that KC can serve as a parameter for distinguishing subgraph patterns without any degeneracy. In other words, no two subgraphs have the same KC. This is also true for 4-node subgraphs. Subgraph "id_238" is a complete graph described by the binary string "011101110," which corresponds to lower algorithmic complexity (fifth rank).
Next, we examined the correlations between the two complexity measures. We ranked CC and KC in ascending order and computed their Spearman Rank Correlation Coefficients (SRCC). We observed that the correlation is not perfect. For example, subgraph "id_238" is associated with the largest CC value (rank), but this is not the case for KC (fifth rank). CC and KC show a relatively weak correlation-0.083 and 0.381-at the 3-node and 4-node levels, respectively. This is because CC and KC have different meanings: CC measures the complexity of the subgraph decision structure (the number of independent gene regulation paths), while KC is an algorithmic measure which characterizes the randomness and compressibility of a bit string. Finally, we investigated the relationship between graph energy and complexity (Supplemental File 1- Table S7). KC exhibits a modest correlation with all the graph energies at the 3-node and 4-node levels. In contrast, CC exhibits relatively weak and modest correlations with graph energy at the 3-node and 4-node subgraph levels.
Supplemental File 1- Table S8 summarizes the results of strength of SRCC (including minimum, maximum, and ranges) between graph complexity and energy for 3-node and 4-node subgraphs. Our results suggest that there are relatively weak (3-node CC) and modest correlations (3-node subgraph KC and 4-node subgraph C. and KC) between graph complexity and graph energy.

Unique identifiers for network subgraphs
This section reports the results of determining an optimal parameter combination that maximizes the removal of degenerated subgraphs. As shown in Table 6, three cases are considered. "Case A" uses graph energy only, "Case B" utilizes graph energy, r and CC, and "Case C" employs energy, r, CC, and the energy spectrum. After including r and CC, we can distinguish more subgraphs. Using AL t , r, CC, and energy spectrum can fully distinguish the 3-node subgraphs. For 4-node subgraphs, the use of LL t , QQ t and LQ t achieves the best level of distinguishability: 136 out of 199 (68.3%). Compared with E. LE, and QE, both symmetric and asymmetric generalized energies serve as superior measures for distinguishing different subgraph patterns.

Network subgraphs found with high occurrence frequency
Among the 17 cancer networks, 45 STNs, and nine cellular processes, there are 15 (88.2%), 40 (87.0%) and seven (77.8%) networks, respectively, where more than 70% of nodes are embedded in both 3-node subgraphs and 4-node subgraphs. Therefore, subgraphassociated nodes account for a major portion of each network. We enumerated all possible Table 6 The number of distinguishable subgraphs using optimal combinations of graph energy, r, CC and energy spectrum. Note: "Case A" uses graph energy only, "Case B" uses graph energy, r and CC and "Case C" uses graph energy, r, CC, and graph energy spectrum.
To determine frequently-occurring subgraphs, we tabulated the occurrence frequency of each subgraph pattern, and normalized the frequency to one. Figure 2 summarizes the normalized frequency distribution of the 3-node subgraphs for the cancer networks, STNs, and cellular processes. Subgraphs id_6, id_12 and id_36 dominate the three classes of molecular networks.
For cancer networks, the three most frequently found 3-node subgraphs are id_6, id_12 and id_36 (Supplemental File 1- Table S9). By examining the top three subgraphs, we observed the following common features: (1) they do not contain any subgraphs (irreducible); (2) they are composed of a minimal edge number (N-1 edges for a N-node subgraph); (3) the reciprocity r values are negative (−1/2) and those subgraphs have no cycles; (4) they account for more than 15% of the frequency counts; and (5) they are associated with the lowest or the second lowest graph energies, regardless of the graph energy definition.
The subgraphs ranked 4 to 7 ("id_38", "id_74", "id_14" and "id_98") have three edges. Subgraph "id_38" is the so-called feed-forward loop (FFL), which does not contain cycles, Figure 2 The plots of the normalized frequency of the thirteen 3-node subgraphs for (A) the cancer networks, (B) the Signal transduction networks and (C) the cellular processes. Certain subgraphs are never present in the three molecular network types. In other words, while networks are built from a finite number of subgraph patterns, certain subgraphs associated with large graph energies are not present.
Full-size  DOI: 10.7717/peerj.9556/ fig-2 whereas both "id_74" and "id_14" contain cycles. The seventh-ranked subgraph ("id_98") is the so-called 3-cycle. Subgraph "id_12" is a subgraph of "id_38", "id_74" and "id_98", while "id_36" (MIM) is a subgraph of FFL, and SIM is a subgraph of "id_14". In other words, the frequently-occurring subgraphs are the simplest and are subgraphs of more complex subgraphs. For 4-node subgraphs, the above features (1-4 but not feature 5) are also valid for the top seven most frequently occurring subgraphs. Interestingly, the irreducible and negative reciprocity value (−1/3) features hold at the 4-node level. In addition, feature (5) holds if we consider graph energies E. LE and QE, but not the other six graph energy definitions. Furthermore, the above five features also hold for STNs and cellular processes. Supplemental File 1-Tables S9-S11 summarize the top seven most frequently occurring 3node subgraphs and 4-node subgraphs in cancer, STNs and cellular processes, respectively.
We observed that the normalized frequency of subgraphs id_6 and id_12 is higher than id_36 for cancer networks and STNs (Fig. 2), which suggests that the underlying network architecture is highly similar. In addition, the tails of the normalized distribution of the three types of networks are zero or nearly zero, indicating that molecular networks are composed of a finite number of subgraph patterns-approximately seven patterns.
We observed that only two of the cellular processes composed of 3-node subgraph patterns are associated with large graph energies. The first is the "adherens junction" network which consists of a 3-node subgraph (id_110) composed of three genes: CTNNA1, ACTB and AFDN. The second is the "Signaling pathways regulating pluripotency of stem cells" network. We identified a fully-connected 3-node subgraph (id_238) with three feedback loops connecting three genes: Oct4, Sox2, and Nanog. Including LIN28, these four genes can reprogram human somatic cells into pluripotent stem cells (Yu et al., 2007). Figure 3 summarizes the normalized frequency distribution of the 4-node subgraphs for the cancer networks, STNs and cellular processes. Eight subgraphs (id_14, id_28, id_74, id_76, id_280, id_328, id_392 and id_2184) dominate the three molecular network classes. The normalized frequency from the first 120 subgraphs is shown (Fig. 3). The rest of the subgraphs have zero or nearly-zero normalized frequency. Again, these results indicate that molecular networks are composed of a finite number of subgraph patterns.
Next, we examined the association of frequently-identified subgraphs and complexity measures. We observed that frequently-identified subgraphs have a lower KC ranking (smaller KC value) ( Table 5). A smaller KC value implies a lower degree of randomness, less information, and higher compressibility. However, this observation is not exact at the 4-node level. Indeed, there are three instances where the rank of KC is larger. For instance, the KC rank is as high as 16, 24 and 33 for id_2184, id_280 and id_392, respectively. No obvious association exists between frequently-occurring subgraphs and CC measures. The ranking information can be found in Supplemental File 8.
Network subgraphs that are absent from the networks It is interesting that certain subgraphs are never present in the three molecular network types. In other words, while networks are built from a finite number of subgraph patterns, certain subgraphs associated with large graph energies are not present. In Table 7, we summarized the results of the 3-node and 4-node subgraph patterns present in the 17 cancer networks, 45 STNs, and 9 cellular processes. Among the 13 possible 3-node subgraph patterns, only 7 patterns can be identified in the 17 cancer networks. The absent patterns are: id_46, id_78, id_102, id_108, id_110 and id_238. These six missing patterns are shown in Fig. 2 as boxes with zero height. For the cellular processes, the complete set of 3-node subgraphs were observed. In the case of 4-node subgraphs, only certain patterns were identified in the three network types. If there are subgraphs absent in a network, there is an associated graph energy cutoff. Hence, the network is characterized by the graph energy cutoff. To the best of our knowledge, we are the first to discover this  Table 7 The results of the number of possible 3-node subgraph patterns and 4-node subgraph patterns found in the 17 cancer networks, 45 STNs and nine cellular processes.
Cancer network STN Cellular process 3-node subgraph 7/13 11/13 13/13 4-node subgraph 38/199 88/199 77/199 Note: The maximal number of possible 3-node subgraphs and 4-node subgraphs are thirteen and 199 respectively. feature. We depict the results of the energy cutoffs (cutoff), maximum graph energies (max), and ratios (cutoff/max) for the cancer networks in Supplemental File 1- Table S12. Among the nine graph energy definitions, the ratios may be as high as 0.750 and 0.667 for 3-node subgraphs and 4-node subgraphs, respectively. However, they can also be as low as 0.536 (LQ t energy) and 0.481 (AQ t energy) for the 3-node subgraphs and 4-node subgraphs, respectively. The results of the STN and cellular process graph energy cutoffs and ratios are given in Supplemental File 1-Tables S13 and S14. For cellular processes, the cutoff ratio may be as high as 1.00 for the 3-node subgraphs, because we identified a fully connected 3-node subgraph (id_238). At the 3-node level, two of the cellular processes ("adherens junction" and the "Signaling pathways regulating pluripotency of stem cells") exhibit peculiar network structures. This findings may worth for further investigation in future study. Our results suggest that there is an energy cutoff or ratio that constrains the presence of certain subgraphs embedded within a molecular network. In addition, the data indicate that the ratio for 3-node subgraphs is slightly higher than for 4-node subgraphs. Furthermore, the subgraph-normalized frequency distribution and the graph energy obeys an inverse relationship, with smaller probability associated with higher graph energy.

Characterizing the subgraph frequency distributions
We utilized the entropy-based quantity, normalized Shannon entropy, H R , to quantify the frequency distributions of the subgraph occurrence for the cancer networks. For a randomized distribution, H achieves the maximal values, 3.700 (log 2 (13)) and 7.637 (log 2 (199)), for 3-node subgraphs and 4-node subgraphs, respectively. Figs. 4A-4C depict the H R plots for 3-node subgraphs (H 3R ) and 4-node subgraphs (H 4R ) for the cancer networks, STNs (Supplemental File 1-Tables S15) and cellular processes (Supplemental File 1- Table S16). For the networks we studied, the normalized frequency distributions are not uniformly distributed among the subgraph patterns. Therefore, H 3R and H 4R are different from one. We also note that the HIF-1 signaling pathway and PPAR signaling pathway have relatively small H 3R and H 4R values. This is likely because the transcription factors HIF-1 and PPAR function as master regulators of many genes. Therefore, the SIM subgraph is the dominant subgraph at both 3-node and 4-node levels. The median H 3R and H 4R values for the three network types range from 0.35 to 0.46 (Supplemental File 1- Table S17). This result suggests that the network subgraph distributions are not uniform, and that certain subgraph patterns have a higher occurrence probability.

Association of network subgraphs and driver genes
The results of the association of network subgraphs and driver genes for cancer networks, STNs and cellular processes are given in Supplemental File 9. The odds ratios are listed in the last column. An odds ratio greater than one indicates that driver genes are enriched in the subgraph module (consisting of 3-node and 4-node subgraph genes, but the subgraphs are not necessarily interconnected). Among the 17 cancer networks, seven have an odds ratio greater than 1. For the 45 STNs, 21 have an odds ratio larger than 1. Among the nine cellular processes, six have an odds ratio greater than 1.
We noted that there are four networks have unusually large odds ratios (Supplemental File 9). These four networks are: "Calcium signaling pathway", "Fc epsilon RI signaling pathway", "Signaling pathways regulating pluripotency of stem cells" and "Regulation of actin cytoskeleton" (5.263, 5.000, 5.700 and 2.933, respectively). Large odds ratios indicate that the four networks are highly enriched with subgraph-associated driver genes. Calcium signaling proteins are related to cell migration, metastasis, and cancer progression (Faris et al., 2018;White, 2017). According to the "Oncology Drug Pipeline Project", the Fc epsilon RI signaling pathway is a major target for cancer drug development (https://www.prnewswire.com/news-releases/fc-epsilon-ri-signaling-pathway-inoncology-drug-pipeline-update-2015-300114859.html). Also, this pathway is the most significant interaction pathway in sorafenib-treated hepatocellular carcinoma (Liu et al., 2015). Regulation of pluripotent stem cells is critically related to oncogenesis (Katoh, 2011;Klein et al., 2018). A proposed anticancer therapy strategy involves killing cancer cells by suppressing the activity of pluripotency transcription factors (Liu, Yu & Liu, 2013).

DISCUSSION
Network subgraphs play an important role in biological networks. We used a rigorous mathematical and systematic approach-spectral graph theory-to characterize 3-node and 4-node network subgraphs. Further, we introduced nine graph energies and reciprocity to characterize network subgraphs. We stated in "Network subgraphs (N-node subgraphs) versus network motifs" that our analysis does not consider randomized version of subgraphs analysis. In case of using the randomization approach, different results were obtained. Description of the randomization steps and the results of the graph energies and reciprocity were given in Supplemental File 10. In addition, we characterized network complexity using two widely-accepted complexity measures, CC and KC. Chemical molecule complexity indices are useful measures that have been used in predictive pharmacology and toxicology (Basak, 1987) and to characterize the structural features of chemical structures (Randic & Plavsic, 2002;Yamaguchi, Aoki & Mamitsuka, 2003). Similar in spirit to Minoli (1976) and the previous three references, a possible application of CC and KC measures is to infer the hierarchical order of molecular network structures. The concept of a unique identifier or descriptor was introduced to label network subgraphs. This novel idea combines four parameters-graph energy, reciprocity, CC, and eigenvalue spectrum-to characterize a network subgraph. Such molecular descriptors have been utilized to design inhibitors against Alzheimer's disease (Gurung et al., 2017) and kidney cancer (Feng et al., 2010). A foreseeable application of our unique descriptor is to examine the transition between subgraphs and mutated subgraphs found in normal and disease networks. In disease states, the regulatory interactions among genetic elements may be disrupted or activated because of genetic mutation or epigenetic modification, resulting in different interactions among the nodes. In addition, driver mutations are likely enriched or depleted in certain subgraph positions, such as the source node of a subgraph (Awan et al., 2007;Carson et al., 2015). A source node is a node that has only outgoing edges. In other words, a mutated driver gene acts as an upstream regulator. Previous studies reported that certain subgraph positions, such as source and target nodes, are enriched in cancer-associated genes. To examine this issue, cancer-specific gene mutation data (provided by the GDC-TCGA database) can be combined with subgraph node identities. Such work will be performed in our next study.
Some 3-node and 4-node subgraphs are interconnected through shared genetic elements. These modules are reported in our previous work (Hsieh et al., 2015). Interconnected subgraphs can be merged to form higher-order network structures. Indeed, coupled subgraphs perform specific functions. For instance, coupled FBLs form dynamic motifs that are often found in cellular networks (Kim, Yoon & Cho, 2008) and show oscillatory behavior (Tsai et al., 2008).
We extended our algorithm, PatternFinder, to identify complete sets of 3-node and 4-node subgraphs for 17 cancer networks, 45 STNs and nine cellular processes. Except for a few networks, 3-node and 4-node subgraphs account for more than 70% of the nodes in the studied networks. Additionally, our study revealed the following features: (1) the relative entropies of the subgraph distributions are not equal or close to one, indicating that the identified subgraphs are not uniformly distributed among the 13 and 199 patterns; (2) molecular biological networks are built from a finite number of subgraph patterns and certain subgraphs with large graph energies are not present. Hence, each network can be characterized by a graph energy cutoff; (3) irreducible subgraph patterns are the most frequently-observed subgraphs. For instance, the cascade pattern is the most frequently found subgraph, followed by the SIM and the MIM subgraphs; and (4) the three network families exhibit the above features, suggesting that there is a universal organization principle determining the underlying network architecture.

CONCLUSIONS
In conclusion, this study provides a systematic and rigorous approach to dissecting the underlying structure of biological molecular networks. SGT serves as a powerful approach to distinguish different subgraph topologies or connectivity patterns, which inspired our hypothesis that network structures can be understood in terms of 3-node and 4-node subgraphs. The next step is to test our hypothesis by analyzing 5-node subgraphs (Zaenudin et al., 2019 ). We expect that our efforts will further elucidate the complex nature of molecular networks.