Predictive models for stage and risk classification in head and neck squamous cell carcinoma (HNSCC)

Machine learning techniques are increasingly used in the analysis of high throughput genome sequencing data to better understand the disease process and design of therapeutic modalities. In the current study, we have applied state of the art machine learning (ML) algorithms (Random Forest (RF), Support Vector Machine Radial Kernel (svmR), Adaptive Boost (AdaBoost), averaged Neural Network (avNNet), and Gradient Boosting Machine (GBM)) to stratify the HNSCC patients in early and late clinical stages (TNM) and to predict the risk using miRNAs expression profiles. A six miRNA signature was identified that can stratify patients in the early and late stages. The mean accuracy, sensitivity, specificity, and area under the curve (AUC) was found to be 0.84, 0.87, 0.78, and 0.82, respectively indicating the robust performance of the generated model. The prognostic signature of eight miRNAs was identified using LASSO (least absolute shrinkage and selection operator) penalized regression. These miRNAs were found to be significantly associated with overall survival of the patients. The pathway and functional enrichment analysis of the identified biomarkers revealed their involvement in important cancer pathways such as GP6 signalling, Wnt signalling, p53 signalling, granulocyte adhesion, and dipedesis. To the best of our knowledge, this is the first such study and we hope that these signature miRNAs will be useful for the risk stratification of patients and the design of therapeutic modalities.


INTRODUCTION
Head and neck squamous cell carcinoma (HNSCC) is the 6th most common type of cancer (Bray et al., 2018) and is associated with 650,000 new cases and ∼330,000 deaths annually worldwide (Boscolo-Rizzo et al., 2018;Kapoor & Kumar, 2019). The majority of the HNSCC cases are Oral Squamous Cell Carcinomas (OSCC) (Deshpande & Wong, 2008). More than 90% of HNSCC are associated with OSCC patients (Vigneswaran & Williams, 2014). The incidence rates (mainly OSCC (Collaboration, 2019)) are higher in South Asian countries such as India (Poddar et al., 2019), Bangladesh (Collaboration, 2019), and Pakistan (Akhtar et al., 2016) as compared to other parts of the world. There are several known risk factors of HNSCC such as chewing tobacco, smoking cigarettes, excessive alcohol consumption (Stenson, Brockstein & Ross, 2016) and oncogenic virus such as Human papillomavirus (HPV) (Marur et al., 2010). Additionally, epigenetic regulation, mutation, copy number variation (CNV) and immune host response also play a key role in carcinogenesis (Network, 2015;Leemans, Snijders & Brakenhoff, 2018). Despite current advancement in cancer diagnosis and treatment, the overall 5-year survival rate is less than ∼50% in HNSCC due to a lack of proper diagnostic markers and targeted therapies (Wiegand et al., 2015).
It has been well documented that detection in early stages leads to higher survival as compared to late stages in various cancers including HNSCC. The American Joint Committee of Cancer staging (TNM) describes an early stage, primary tumor as ∼2-4 cm in diameter without lymph node proliferation and metastasis (TNM stage I and II). The tumor is considered advanced (late stage), if it is larger (>5 cm) and has spread into either nearby lymph nodes only (TNM stage III) or has metastasized to other parts of the body also (TNM stage IV) (https://www.cancer.org/treatment/understanding-yourdiagnosis/staging.html).
A lot of new research have happened in HNSCC in the past few decades, however without clinically meaningful discoveries. While there are some biomarkers (e.g., HPV +ve and −ve), however, they lack important features, such as high specificity and sensitivity, low cost, and short turnaround time. A quick and accurate diagnosis would have numerous benefits for the patients such as proper treatment resulting in reduced morbidity as well as and improve treatment outcomes. Unfortunately, there is no such universally accepted biomarker for HNSCC accepted for clinical use. There is an urgent need for more effective therapies, and clinically relevant biomarkers to stratify patients in HNSCC.
The micro-RNAs (miRNAs) are ∼18-25 nucleotide long non-coding RNAs. They can regulate mRNA expression by interacting with the 3 untranslated regions (UTR) leading to mRNA degradation. These miRNAs by virtue of their control over mRNA expression have important regulatory roles such as regulation of cell division, cell maturation, angiogenesis, proliferation, migration, invasion, metastasis, autophagy, and apoptosis (MacFarlane & Murphy, 2010). However, in various diseases especially cancers, these miRNAs themselves can get dysregulated leading to pathological conditions (Esteller, 2011). A large number of miRNAs have been quite well characterized for their biological function in cancer and their ability to regulate the expression of different cancer pathways (Iorio & Croce, 2012). It is also known that the changes in miRNA expression profile can be detected even before the appearance of clinical symptoms in some cancers (Bianchi et al., 2011). These miRNA due to their stability and ease of detection (in tissues as well as biological fluids) offer a rational approach for the development of excellent biomarkers (Mostert et al., 2011). The analysis of miRNA expression profiles may also offer an insight into underlying tumor progression and/or identifying new therapeutic targets.
Mathematical modeling has been widely used in disease modeling, classification and molecular function prediction for long. The advancement of next generation sequencing (NGS) technology and the availability of massive sequencing data have opened new avenues in disease process understanding and monitoring by machine learning. The machine learning techniques are generally used for risk stratification, mutational frequency prediction, CNV, and new target identification. It has been suggested that machine learning (ML) techniques can be utilized for diagnosis and prognosis in cancers (Obermeyer & Emanuel, 2016). For instance, the ML techniques have been used for the detection of Rat Sarcoma (RAS) activation pathways in cancers utilizing expression data, SNPs (single nucleotide polymorphisms) and CNVs (copy number variations) (Way et al., 2018). A multi-parametric Decision Support System (DSS) with a multitude of heterogenic data (clinical information, genomics, and imaging data) has been used to predict the OSCC progression and potential relapse (local or metastatic) (Komura & Ishikawa, 2018). A study by Avissar et al. (2009) for identification of the miRNAs to predict the presence of HNSCC identified miR-221 and miR375 to be predictors. However, the study was done without considering stages/grade. Recently, a machine learning study used neural networks to predict recurrences in tongue cancer (Alabi et al., 2019). Kim et al. (2019) developed predictive models for survival prediction in oral cancer patients. There is no report to date that stratifies the clinical stages in HNSCC patients using miRNA expression profiles. The identification of late stages often involves cumbersome examination and invasive tests; therefore, such studies could be immensely useful in accomplishing better diagnosis and clinical outcomes. Such strategy can also be extended to the cancers that are difficult to reach and detect such as pancreatic cancer.
In the current study, we have analyzed the miRNA expression patterns in HNSCC patients in order to identify miRNA signatures that can distinguish stages (early and late stage) using machine learning approaches. It has to be remembered that the clinical stage is known to play an important role in the overall survival of the patients. These studies have led to the identification of signatures that can efficiently stratify the HNSCC patients into early and late clinical stages as well as to stratify patient's risk that may help in the decision of the treatment regimens e.g., whether it is effective to use less aggressive treatment than highly toxic therapies.

MATERIALS AND METHODS
The overall strategy involves the following steps. (1) data processing (2) data distribution into training (70%) and test set (30%) (3) model building (4) 10-fold repeated internal cross validation (5) independent external cross-validation. The detailed workflow has been provided in Fig. 1. The studies were divided into three parts (i) Identification of diagnostic signature: Involving the identification of signature miRNAs for patient's classification into early and late stage (ii) Identification of prognostic signature: evaluation of independent prognostic marker using L1-Cox Proportional Hazards Model with penalized regression to predict the risk to the patient (iii) Functional analysis: The identification of pathways and processes regulated/influenced by the identified biomarkers.

Data retrieval and pre-processing
The normalized reads of miRNAs (Illumina HiSeq 2000 platform, miRgene-level, Normalized), mRNAs (llumina HiSeq 2000 platform, Gene-level) expression were retrieved from the TCGA (https://cancergenome.nih.gov). A total of 528 samples (484 primary tumors and 44 Normal Adjacent Tissue (NAT)) with associated clinical information such Figure 1 The schematic representation of workflow. The study was divided into four parts. (A) The expression data (mRNA and miRNA) along with clinical information of HNSCC patients was obtained from the TCGA and was preprocessed by Z -score normalization and balancing (ROSE/SMOTE) prior to model building. The data was then systematically divided into training (70%) and test (30%) sets. (B) Identification of signature miRNA for early and late stage classification was accomplished using the miRNA expression data. The training set was used for feature selection and model building using machine learning techniques viz. Random Forest (RF), Support Vector Machine using radial kernel (svmR), adaptive boosting (adaBoost), Average neural network (avNNet), and Gradient Boosting Machine (GBM). The test set was used as an external test set to rigorously and independently assess the performance of the generated model. as TNM stage, status, gender, and age etc. (Table S1) were selected whereas the samples with unknown stage information were removed. Finally, the data contained 27, 74, 78 and 274 samples from stage I-IV respectively (total 453 samples). The stages I and II were treated as early stage (total 101 samples) and stage III and IV as later stage (total 352 samples). Further, miRNAs and mRNAs expression profiles were normalized using the limma-voom (Law et al., 2014) library package in R3.3.5.

Data scaling
In the current study the data was scaled using Z-scaling. The following equation was used for the scaling of data set.
Here, x − x, represented the expression variance. While σ is the standard deviation of miRNAs expression. The x is the normalized miRNAs expression. The 'Z-score' method represent that data scaled to a mean of zero and a standard deviation of 1.

ML model building for stage classification
The ''caret'' library of R package (Kuhn, 2012) that contains several machine learning, complex regression and classification methods was utilized to develop various ML algorithm based classification models. The machine learning techniques were employed for building the models for patient classification into clinical stages using miRNA expression profiles. In this study, four widely used machine learning (ML) methods namely RF, svmR, AdaBoost, and avNNet were used. The random forest (RF) method commonly uses non-linear regression models and has been applied in a variety of computational studies. It is a simple, interpretive, and flexible method which allows for a large number of predictor variables. Additionally, it also predicts well in the small sample sizes and high genetic heterogeneity (Alexopoulos, 2010). It comprises building trees from bootstrapped training data in such a way that each split is based on a random sample of m variables from a full set of n attributes (Liaw & Wiener, 2002). (2) Here IG is Gini impurity of a node n is 1-the sum of overall predictor as j of the fraction of each pi square Support Vector Machine (SVM) is popular ML technique in medical research for molecular function prediction, genomics variation, histopathology image classifications and subtyping of diseases (Bianchi et al., 2011). It is a non-probabilistic algorithm classifier based on the hyperplane line to maximize the maximum margin to separate two classes based on the two vectors points to achieve the best fit classification (Huang et al., 2018).
Here, function (x) is the kernel function which calculate the similarity between training and predicted x,x i and α i represent the parameter corresponding to each training and predictor variable. β 0 isthe constant.
The AdaBoost is an algorithm for producing strong classifiers from the weak classifier. It is based on the penalized weighted matrix of each instance and voting them according to their weight scheme. It is a popular machine learning method for both balance and unbalanced data sets (Zemel & Pitassi, 2001). The AdaBoost equation is following Where, h t (x) is the out of weak classifier of t for input x and α t is weight given to model.
Here, E is based on the error rate of model.
The avNNeT is a kind of neural network capable of learning nearly infinite number of mapping functions. The neural network (NN) behaves like a natural human neuron. Every predictor variables (inputs) connects to an output response variable (output), either directly or through backward and forward propagation of single and several of hidden nodes to calculate units (neurons). The depth of an NN corresponds to the number of hidden layers. The calculation performed in NN are performed in a hidden layer called deep network (Huang, Zhu & Siew, 2004). The NN is commonly used for calculation of gene expression, CNV and clinical data for prognosis prediction of complex diseases such as cancer (Mirza et al., 2019).
Here, a l j of the jth neuron in the lth layer is related to the activation in the (l − 1)th layer. The σ is vectorising function. The W l is weight matrix of each layer, l. The jth row and kth column is w l jk . Also, for each layer l as define j a bias vector b l j . The GBM algorithm is used to convert weak learners into strong learners. It is the tree-decision boosting based classification model where each of instances assigned an equal weight at the start. After the examination of first steps, then increase the weight of those observation that are difficult to classify and lower the weight for those that are easy to classify. This is the iterative process to optimized the best fit for classification (Friedman, 2001). The general equation used for the GBM is given below Here, m-th step fit the decision tree hm (x). jm number leaves in each steps. The b jm is the value predicted in the region of R jm .
Data balancing : A challenge in biomedical studies is working with imbalanced data sets i.e., unequal number of normal and disease samples. Unbalanced predictive variable ratio do not meet the assumptions of the machine learning models and its predict biased. Therefore, SMOTE algorithm (Chawla et al., 2002) was used to balance the data using the 'DMwR' library in R.3.3.5 package. The SMOTE is a popular algorithm to deal with unbalanced data (Lusa, 2013;Ramezankhani et al., 2016). There are two types of sampling strategies commonly employed in machine learning viz. oversampling and under-sampling for unbiased classification. In under-sampling, the samples are reduced based on k nearest neighbor (kNN) clustering centroid distance while in oversampling minority classes are amplified to balance both predictors. In our data set, the number of late stage patient samples was reduced so as to match the number of early stage patient samples (N = 101) (Fig. S1A) using the under-sampling method and in oversampling (Fig. S1B), early stage data was amplified (N = 352) to balanced early and later-stage samples.
Thereafter, the datasets (under and over sampled) were systematically distributed into training (70%) and test set (30%) separately. In under-sampled data, 141 and 61 samples were used as training and test set respectively whereas in oversampled data 478 and 206 samples were used as training and test set respectively. First, the model was trained using only training set the test set was kept aside for independent external cross-validation. A 10-fold repeated internal cross-validation with 10-time iterations to randomized data set was done to avoid generation of over/under fitting model. The cost functions were optimized (100 to 3,000 with 100 steps per iteration) to achieve accurate classification. The performance of the generated model was examined based on sensitivity, specificity, accuracy, and AUC (ROC), precision and MCC by external independent data. Equations are given below: Where TP is true positive; TN is true negative; FP false positive; and FN is false negative, MCC is Matthews correlation coefficient.

Patient's risk assessment based on signature miRNA expression
Further, we performed L-1 Penalized Estimation in the Cox Proportional Hazards Model using optimal cross-validated likelihood to identify prognostic markers based on the survival status (alive/dead), survival time and normalized miRNAs expression (Acharya et al., 2017). The Kaplan-Meier curve and the Log-rank method were used to estimate the significance of median miRNA expression on the survival of the patients. It evaluates the independent survival of the patient between the two risk groups (low/high). The patient's clinical parameters such as age, survival status, gender, metastasis, TNM stages, and median expression of significant miRNAs were evaluated for their association with patient's survival using cox-regression for a univariate and multivariate analysis.
The risk score for individual patients was calculated using L1-penalized (LASSO) likelihood Cox-regression (Goeman, 2010). As earlier, the data was partitioned into training (70%) and test (30%) to assess the performance of the generated models. The prognostic value of individual miRNAs was first calculated by the L1-penalised Cox univariate proportional hazard regression in R tool using a 'penalized' package. The significant miRNAs (p-value < 0.05) were used in subsequent Cox multivariate analysis. The prognostic model was built by the linear combination of the expression level of significant miRNAs multiplied with the Cox regression coefficient (β). The standard formula was as follows: Risk Score = X 1 β 1 + X 2 β 2 + ··· + X n βn + C.
Using the median risk score the patients were divided into high-and low-risk groups. The time-dependent receiver-operating characteristic (ROC) curve was created using ''survivalROC'' package in R to evaluate the performance (specificity and sensitivity) of the generated miRNA prognostic signature (Chawla et al., 2002).

The mRNAs target identification for identified miRNAs
The mRNAs targets of the identified biomarkers (diagnostic and prognostic) were predicted using the IPA (Ingenuity Pathway Analysis). The IPA predict miRNAs target based on three evidence (i) experimental (ii) high confidence (Kozomara & Griffiths-Jones, 2014) and (iii) low confidence. The experimental and high confidence interactions are curated by experts. In the current studies only experimentally validated and high confidence miRNA-mRNA interactions were used. The miRNAs can have a broad range of mRNAs targets and many of them may not be relevant to the disease under consideration. Therefore, the predicted target mRNAs were considered only if they are also significantly differentially expressed in the HNSCC. As a result, 2 sets (set 1 with 230 and set 2 with 262 mRNAs) were finally selected for diagnosis and prognostic miRNA targets. The miRNA-mRNAs interactions were further visualized for better understanding using Cytoscape3.7 (Shannon et al., 2003).

Pathway enrichment analysis
The previously identified genes (230) of set and 262 were further used for the identification of pathways and other molecular functions, network and toxicity enrichment analysis in the IPA (Ingenuity Pathway analysis). The IPA can analyze enriched biological pathways, disease and functions, upstream and downstream interactions, molecular function and toxicity for a given gene set. It also gives the statistical significance for each enriched process.

Gene ontology (GO) enrichment
The Gene Ontology is a vocabulary developed to compare and identify genes based on the processes in which they are involved. The GO is defined into three categories viz. Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). In the current studies, the enrichment analysis was performed using PANTHER (Mi et al., 2019) (http://www.pantherdb.org/geneListAnalysis.do) and enrichR (https: //maayanlab.cloud/Enrichr/) with p-value < 0.001 for identification of significant GO terms.

RESULTS
One of the goals of this study was to develop an efficient miRNA signature to correctly classify the early and late stage patients using miRNAs expression profiles using the machine learning algorithms. Further, a least absolute shrinkage and selection operator (LASSO) penalized regression model was employed to identify signature miRNAs for their prognostic potential. The network and pathways analysis revealed that several target genes involving various tumor suppressor and oncogenes in cancer regulatory pathways may be affected by the aberrant expression of the identified miRNAs. The important results are given below. ). An analysis of the common and stage specific miRNAs revealed that there are 148 miRNAs common to early and late stages, whereas 23 and 26 miRNAs are specific to early and late stages respectively (Fig. 3A) (Table S3). Similar to the DE-miRNAs analysis, the DE-mRNAs of 101 early stage with 44 normal adjacent tissues (NAT) samples were resulted in the 3831 differentially expressed mRNAs (2407 over and 1424 under-expressed) with (|log2FC ≥ 1| and p-value < 0.05) (Fig. S3).

Identification of miRNA signature for classification of early and late stages
An important objective of this study was to identify the miRNAs signature for the early and late (TNM stage) using the miRNAs expression profiles. The signature miRNAs were identified using the ensemble machine learning feature selection method (Random Forest (RF), Support Vector Machine Radial Kernel (svmR), Adaptive Boost (AdaBoost), averaged Neural Network (avNNet), and Gradient Boosting Machine (GBM)). A detailed method for feature selection by minimum Redundancy Maximum Relevance (mRMR) algorithm is given in Fig. 3B. The top 20 predictor miRNAs were identified using each of the machine learning methods (RF, svmR, AdaBoost, avNNet and GBM) and the common among them were chosen for further model building and evaluation (Fig. 3C, Table S4). The ensemble ML based mRMR feature selection method reduce the bias that might be introduced by one or two methods. As a result, a 6 miRNA (let-7c, miR29c, miR-96, miR-133a, miR-196a, miR-519a) with average importance ranging from 90.03 to 78.27 for classification were selected for further studies. The mean importance of these miRNAs is shown in Fig. 3D. Among them, the miR-96 and miR-196a are over-expressed while the rest (miR-133b, miR-29c, miR-519a and let-7c) are under-expressed in normal vs primary tumor samples (Fig. S4). It is interesting to note that miR-519a is significantly differentially expressed in early stage only.

The performance of generated models for stage classification
The 6 miRNAs were selected by all five machine learning methods. The model was built using a training set and an independent test set (70:30) ratio was used for model prediction.  avNNet and GBM was 0.87 ± 0.021, 0.86 ± 0.032, 0.87 ± 0.035, 0.85 ± 0.032, and 0.80 ± 0.038 respectively for oversampled data (early and late = 352 samples) The details of the internal top 10 models of ML is given in the Table S5. The detailed performance of each ML algorithm is given in Figs. 4A & 4B. The final model performance was evaluated by the independent test set using accuracy, sensitivity, specificity, AUC, precision, and MCC. The accuracy for RF, svmR, AdaBoost, avNNet, GBM was found 0.87, 0.87, 0.87, 0.75, and 0.84 for test set generated using oversampled data. Whereas the accuracy was 0.79, 0.76 0.78, 0.68, 0.77 for the above five methods for the test set generated using under-sampled data . The MCC (0.65, 0.62, 0.64, 0.59, and 0.68 was for RF, svmR, AdaBoost, avNNet and GBM) gave better model performance over other confusion matrix categories. The robustness of model was evaluated using the area under curve (AUC) of Receiver Operating Characteristics (ROC) given in Figs. 4C & 4D for over and under sampled data. The RF and AdaBoost found the subsequently higher prediction accuracy in the under and over sampling methods and avNNet was the least prediction accuracy. The overall classification result showed, the generated 6 miRNAs signature can stratify the HNSCC patients on clinical TNM stages with reasonable accuracy based on expression profile.

Risk score calculation
To assess the prognostic performance of the generated model the time-dependent ROC curve was considered for risk score (low and high) calculation. The overall 5year risk score was calculated of the normalized expression with the coefficient of the LASSO cox-coefficient. As stated in methodology, the risk score can be used to assess the risk for a patient based on the median value expression. The risk score was calculated using the risk score equation as given below and plotted using Kaplan-Meir plot Fig. 5. Kaplan-Meir plot for each of the signature miRNAs is given in Fig. S5. Risk Score = (0.022 × let − 7 expression ) + (−1.996 × miR-96 expression ) + (6.593 × miR-9 expression ) + (1.361 × miR-143 expression ) + (−2.938 × miR-379 expression ) + (−1.249 × miR-545 expression ) + (−1.969 × miR-658 expression ) + (−1.446 × miR-3926 expression ) − 1.505.
The patient samples were systematically divided into training (368 samples) and test sets (160 samples). The 5 risk score (high and low) prediction was evaluated by AUC. The AUC for the training set and test set was found to be ∼0.86 and ∼0.79, respectively which indicated good performance of the identified signature (Figs. 6A & 6B). This time-dependent risk prediction analysis clearly showed that these miRNAs can be used for risk assessment of HNSCC patients.

Univariate and multivariate analysis for prognostic evaluation
The identified miRNA signature was evaluated against various clinical variables to check their effect on survival using the standard Cox-proportional hazards regression (Table 1). It indicates that advancing pathological stages may have a significant adverse impact on the survival of the patients. The Cox univariate analysis showed that pathological stage, T stage, N stage, metastasis and signature miRNAs were significantly associated with the prognosis. While in multivariate analysis pathological T stage (p-value = 0.017) and signature mRNA (p-value = 0.023) showed significant p-values associated with the poor prognosis (Table S6).

The microRNA-mRNA interactions
It was imperative to understand the role of the identified miRNA signatures in different biological processes. A mapping of the mRNA targets of the identified signatures (6 miRNAs for stage classification and 8 for prognosis) resulted in the identification of 1532 and 1446 mRNAs, respectively. Further, the analysis was done to select those mRNAs that get differentially expressed in various stages of HNSCC. It was found that for the identified diagnostic and prognostic miRNA signatures, there are 2 set of mRNA' target (set 1 with 230 and set 2 with 262) mRNA that were differentially expressed respectively in the HNSCC patients (Tables S7 & S8).

Biological pathways, diseases functions and network enrichment analysis
As indicated earlier, the pathways enrichment analysis was performed by the Ingenuity Pathway Analysis (IPA) separately for diagnostic targets (set 1) and prognostic targets (set 2). Analysis of set 1: The top enriched pathways for diagnostic targets were (i) GP6 signalling (ii) Neuroinflammation signalling (iii) Intrinsic Prothrombin Activation Pathway (iv) granulocyte adhesion and diapedisis (Fig. 7). Other enriched pathways are given in Table S9. The KEGG pathways analysis also revealed that majority of the genes were associated with pathways such as cytokine-cytokine receptor interaction, small cell lung cancer, transcriptional misregulation in cancer, ECM-receptor, and glutathione metabolism (Table S10). While WikiPathways enrichment showed, nuclear receptors meta-pathways (ID: WP2882), TGF-beta signalling pathways (WP366), PI3K-Akt-mTOR-signalling pathway (WP4172) ( Table S11). The pathway analysis revealed that these miRNAs targets are highly enriched with cancer associated pathways. The top disease and disorders associated with the set 1 are organismal injury and abnormalities, cancer, connective tissue disorder, and skeletal muscle disorder. Top molecular functions and cellular functions were linked with cellular movement, cellular assembly and organization, cellular function and maintenance, cell death & survival and cell development (Fig. 8) (Table S12). The top upstream regulators were miR-29c, TGFB1, estrogen receptor, histone h4, and CREB which are associated with cell growth and proliferation. The miR-29, a potent tumor suppressor, is found under-expressed in this study. Interestingly, the top analysis of the upstream regulatory genes showed that miR-29 directly targets 11 genes (9 over-expressed and 2 under-expressed) associated with head and neck cancer (Fig. S6). A thorough look at these mRNAs revealed that many of them are transcription factors (total 31) and part of cancer gene panels (total 19) (Sondka et al., 2018). These mRNAs were used in the subsequent functional analysis step. It was imperative to note that many of the transcription factors and cancer censes gene are targeted by the let-7c which is well known for its role in a variety of cancers (Fig. S7). The toxicity enrichment analysis revealed increased levels of alkaline phosphatase (ALP). Interestingly, the serum ALP level have been found to be significantly higher in OSCC patients (Acharya et al., 2017).
Analysis of set 2: The pathway enrichment analysis by IPA showed that the top enriched pathways are neuroinflammation signalling, small cell lung cancer, MAPK signalling, and cell cycle: G2/M DNA damage checkpoints ( Fig. 9) (Table S13). Top prognostics signature miRNAs are directly targeting several key cancer regulatory pathways such as p53 signalling, ATM signalling, EMT pathways etc., (Fig. 10). The KEGG pathways analysis also revealed that majority of the genes were associated with cancer progression such as protein digestion and absorption, ECM-receptor, transcriptional misregulation in cancer, and focal adhesion.
In case of WikiPathways, the ECM and membrane receptors (ID: WP291), miR-509-3p alteration of YAP1/ECM axis (WP3967), and focal adhesion-PI3K-Akt-mTOR-signalling pathway (WP3932) were found enriched (Table S14). The prognostics signature target different carcinogenic gene and pathways. The molecular target of these eight signature miRNA were predominantly associated with tumor-suppressors and oncogenes in different pathways. Most of the miRNA targets are directly associated with HNSCC progression and proliferation. These miRNAs also targeting different cancer regulatory pathways i.e., p53 signalling, ATM signalling, regulation of the EMT pathways, EIF2 signalling, HOTAIR Regulatory pathways, and molecular mechanisms of cancer. The pink marked mRNAs are known to be associated with head and neck cancer. Full-size DOI: 10.7717/peerj.9656/ fig-10

DISCUSSION
The HNSCC is associated with high mortality and morbidity. It is highly heterogeneous and identification of robust, potential and reproducible biomarker is a major challenge.
The TCGA provides detailed transcriptomic data with clinical information that can be gainfully used for better risk prediction and disease management. The ML techniques, riding the recent advancement in software, hardware, and sequencing technologies, are being used to identify patterns or factors (mutation/variation) for personalized diagnostics, prognostics, and therapeutics (Kann et al., 2019;Topol, 2019).
The alteration in miRNA expression in tissue as well as in biological fluids has been reported to be an important biomarker in various cancers. Interestingly, during the course of this work, we came across the announcement by Toshiba company (https://www.toshiba.co.jp/rdc/rd/detail_e/e1911_06.html) about a device that can predict the presence of 13 different cancers using the miRNA expression in blood alone with high (>99%) accuracy. Thus proving beyond doubt that miRNA expression data can be utilized for the development of a simple yet highly accurate screening tool for various cancers. The etiology and aggressive behavior of HNSCC are not well understood at the molecular level. It calls for the studies to improve the molecular understanding leading to overall better clinical outcomes. Therefore, the main purpose of this study was to identify a set of signature miRNA using ML techniques to correctly classify the early and late-stage along with other prognostic biomarkers. The main findings of the study are outlined below: The machine learning techniques identified biomarkers with good accuracy: The analysis resulted in the identification of 6 miRNA signature which predicted the early and late stages with good accuracy on external data set. The good accuracy in classification of the samples in the external set further validated the good classification ability of the generated models. It is pertinent to note that the stage is also a significant prognostic indicator.
The evaluation of the identified prognostic signature indicated that these miRNAs are significantly associated with poor survival of the patient and thus it can be a good prognostics biomarker. The IPA, KEGG and WiKipathways analysis also revealed that target of these miRNAs are highly associated with the cancer initiation and progression. The molecular functional analysis of the identified miRNA also indicated that they directly or indirectly regulate several oncogenes, tumor suppressor genes and transcription factors.
The data balancing by enhanced sampling can improve accuracy: The small cohort and imbalanced data highlight the major challenges in this study. The SMOTE algorithm was therefore utilized to generate a balanced dataset for classification of stages (Vafaee et al., 2018). It was interesting to note that the models generated with oversampling of the data showed better accuracy as compared to the undersampled data. Perhaps, the undersampling of the data might have resulted in loss of important information and thus reduced accuracy.
The identified miRNAs are detectable in biological fluids: A detailed literature survey suggested that the identified miRNAs are related to diagnostics and prognostics of many cancers. Interestingly, they are also found in biological fluids (plasma, serum, etc.) in several cancers including HNSCC. The miR-29 and let-7 family miRNAs were significantly downregulated (p < 0.001) in the serum of patients with high-risk oral lesions. The miR-196a was found up-regulated in the plasma of HNSCC patients. The miR-486 was found downregulated in blood samples of non-small lung cancer patients. The reduced expression of circulating miRNA-133b was also associated with the clinical stage, metastasis, and survival of breast cancer patients. The upregulation of miR-96 is associated with various cancers such as breast (Zhang et al., 2017), hepatic (Musaddaq et al., 2019, and colorectal (Brunet Vega et al., 2013). The deregulation of miR-519 is linked with the prognosis and diagnosis of lung cancer and it's found to be in circulating in plasma (Wang et al., 2019). The overall literature survey suggested that the identified miRNAs were found dysregulated in biological fluids of patients of various cancers. They can also be useful for the detection and prognosis of HNSCC in biological fluids as well, though experimental validation would be needed to prove this assumption.

CONCLUSION
In this study we have tried to identify miRNA expression patterns for classification of TNM stage (early and late stage) in HNSCC patients using machine learning algorithms (RF, svmR, adaBoost, avNNet and GBM). The dissection of miRNA expression with the help of machine learning tools provided us with miRNA signatures that can distinguish between early and late-stage tumor samples and risk to the patient with good accuracy. The five-year overall survival analysis revealed that the dysregulation of the identified miRNAs is significantly correlated with poor prognosis. We have further identified the functional roles of the identified miRNAs, their mRNA targets, pathways and process that may get perturbed in the HNSCC carcinogenesis and progression. Very interestingly, the identified miRNAs were also found to be differentially expressed in the biological fluids of patients of HNSCC patients. This opens a possibility that these miRNAs can also be proposed as non-invasive biomarkers for HNSCC, although it will need a large number of patient samples and experimental validation. We hope that these studies will help in the development of potential biomarkers and miRNA based therapies in cancer.