CESCProg: a compact prognostic model and nomogram for cervical cancer based on miRNA biomarkers

Sangeetha Muthamilselvan; Ashok Palaniappan

doi:10.7717/peerj.15912

CESCProg: a compact prognostic model and nomogram for cervical cancer based on miRNA biomarkers

Sangeetha Muthamilselvan, Ashok Palaniappan

Department of Bioinformatics, School of Chemical and Biotechnology, SASTRA Deemed University, Thanjavur, Tamil Nadu, India

DOI: 10.7717/peerj.15912

Published: 2023-09-27
Accepted: 2023-07-26
Received: 2023-04-13

Academic Editor: Nagarajan Raju

Subject Areas: Bioinformatics, Gynecology and Obstetrics, Oncology, Women’s Health, Data Mining and Machine Learning
Keywords: Cervical cancer, miRNA biomarkers, Differential expression, Overall survival, Prognostic nomogram, Risk score modeling, Cancer stages, Multivariate Cox survival analysis, Data mining, Univariate Cox regression

Copyright: © 2023 Muthamilselvan et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Cite this article: Muthamilselvan S, Palaniappan A. 2023. CESCProg: a compact prognostic model and nomogram for cervical cancer based on miRNA biomarkers. PeerJ 11:e15912 https://doi.org/10.7717/peerj.15912

The authors have chosen to make the review history of this article public.

Abstract

Cervical squamous cell carcinoma, more commonly cervical cancer, is the fourth common cancer among women worldwide with substantial burden of disease, and less-invasive, reliable and effective methods for its prognosis are necessary today. Micro-RNAs are increasingly recognized as viable alternative biomarkers for direct diagnosis and prognosis of disease conditions, including various cancers. In this work, we addressed the problem of systematically developing an miRNA-based nomogram for the reliable prognosis of cervical cancer. Towards this, we preprocessed public-domain miRNA -omics data from cervical cancer patients, and applied a cascade of filters in the following sequence: (i) differential expression criteria with respect to controls; (ii) significance with univariate survival analysis; (iii) passage through dimensionality reduction algorithms; and (iv) stepwise backward selection with multivariate Cox modeling. This workflow yielded a compact prognostic DEmiR signature of three miRNAs, namely hsa-miR-625-5p, hs-miR-95-3p, and hsa-miR-330-3p, which were used to construct a risk-score model for the classification of cervical cancer patients into high-risk and low-risk groups. The risk-score model was subjected to evaluation on an unseen test dataset, yielding a one-year AUROC of 0.84 and five-year AUROC of 0.71. The model was validated on an out-of-domain, external dataset yielding significantly worse prognosis for high-risk patients. The risk-score was combined with significant features of the clinical profile to establish a predictive prognostic nomogram. Both the miRNA-based risk score model and the integrated nomogram are freely available for academic and not-for-profit use at CESCProg, a web-app (https://apalania.shinyapps.io/cescprog).

Introduction

Cervical cancer (cervical squamous cell carcinoma; CESC) ranks fourth globally among cancers in women, and second among women of reproductive age. Due to unequal implementation of invasive screening techniques, the morbidity and mortality rate of cervical cancer continues to rise in countries like India, where it accounted for 9.4% of all cancers and 18.3% of new cases in 2020 (Sung et al., 2021). Multiple etiological factors contribute to its incidence, including persistent infection of human papilloma virus (HPV) (Bruni et al., 2019), and known lifestyle factors such as excessive smoking and use of contraceptive pills. Cervical cancer tends to be refractory to treatment unless detected early, and its prognosis is vital to quality-of-life expectations. Late diagnoses in the advanced stages of cervical cancer require expensive and complex treatment, with concomitant poor prognoses (Mehrotra & Yadav, 2021). Many gaps remain with respect to cervical cancer screening, diagnosis and prognosis (Jiang et al., 2020), and biomarkers with high specificity and sensitivity are necessary.

MiRNAs exert key control over regulation of gene expression (Li & Rana, 2014), by inducing specific translational repression via target mRNA 3′ UTR deadenylation and decapping. MiRNAs are known to target ∼60% of the transcriptome, thus modulating biological processes (Pedroza-Torres et al., 2016). Their aberrant, differential expression is implicated in various cancers, where they act as either oncogenes (oncomirs) or tumor suppressor genes (mirsupps), regulating tumorigenic process like cell maturation, cell proliferation, migration, invasion, apoptosis, and metastasis (Reddy, 2015). MiRNA biomarkers from the serum or cervical mucus could potentially augment systems for early diagnosis, prediction of disease progression, and outcome improvement, in addition to facilitating prognostic information, with respect to cervical cancer. The US National Cancer Institute launched the Cancer Genome Atlas (TCGA) to characterize different tumor types using -omics platforms, and make raw and processed data available to all researchers (Chandran et al., 2016). In this study, we used the TCGA CESC miRNA -omics dataset to build a validated prognostic risk model and predictive nomogram based on a minimal miRNA signature and the clinical profile. The developed models have been deployed as a freely-available web-app service for non-commercial use at CESCProg (https://apalania.shinyapps.io/cescprog).

Materials and Methods

Portions of this text were previously published as a preprint (Muthamilselvan & Palaniappan, 2023). The workflow is summarised in Fig. 1, and discussed in detail below. R version ‘3.6.3’ (http://www.r-project.org) was used in the study.

Figure 1: The workflow used in this study for the development of a compact validated risk model for cervical cancer prognosis.
The predictive prognostic nomogram was re-built with the full dataset prior to deployment at CESC-PROG (https://apalania.shinyapps.io/cescprog).

Download full-size image

DOI: 10.7717/peerj.15912/fig-1

Data preprocessing

Normalized and log₂-transformed Illumina HiSeq miRSeq data preprocessed with the TCGA miRNA analysis pipeline were obtained from firebrowse.org portal. The patient barcode of each sample was parsed to annotate the samples as ‘normal’ and ‘cancer’. The corresponding clinical metadata was also retrieved from firebrowse.org (CESC.Merge_Clinical.Level_1.2016012800.0.0.tar) and used to annotate the stage information (encoded in ‘patient.stage_event.clinical_stage’ variable) of the tumor samples, and then merged with the expression data. The clinical stage is essentially the surgical stage prior to any treatment received, from the biopsy obtained at the time of surgery. Collapsing possible substages (A, B, C) in each stage yielded the four-class macro-progression of stages (I, II, III, IV). Certain demographic and clinical factors in the metadata including age, HPV status, smoking history, pregnancies, histologic grade and vital status were retained. The survival data for each sample is derived based on the attributes ‘vital status’ and ‘days_to_follow_up’ in the clinical metadata. Based on the merged dataset, the following dataset pruning was done: (i) samples without stage information were removed; and (ii) miRNAs with negligible change in expression across samples (defined as standard deviation of miRNA expression across samples, σ <1) were also removed.

Linear modelling

The miRNA expression analyses of cancer stages relative to the normal tissue (controls) were performed using the limma version ‘3.42.2’ package in R (Ritchie et al., 2015). The workflow was essentially adapted from protocols developed in our lab (Sarathi & Palaniappan, 2019). To recapitulate, a linear model of gene expression was fit using controls as intercept and sample stages as indicator variables, according to the following relationship: (1) $y = α + β_{1} X_{1} + β_{2} X_{2} + β_{3} X_{3} + β_{4} X_{4} .$

The fit model was adjusted with empirical Bayes to obtain moderated t-statistics (McCarthy & Smyth, 2009). Multiple hypothesis testing and the false discovery rate were applied using the method of Benjamini and Hochberg to yield adjusted p-values of the F-statistic of the linear fit (Hochberg & Benjamini, 1990). Based on the fold change (FC) in the expression of individual miRNAs across conditions, miRNAs with —log₂(FC)—>1.5 and adj. p-value <0.05 were considered significantly differentially expressed miRNAs (DEmiRs). The preprocessed dataset was then split into train and test datasets in the ratio 0.8:0.2. The test dataset was kept invisible to the model development process and used only for the performance evaluation of the final model.

Development of compact miRNA signature

Univariate Cox models (Clark et al., 2003) were used to screen the DEmiRs by significance, and only DEmiRs with p-value <0.05 were filtered for further analysis. Two robust feature selection methods, namely Least absolute shrinkage and selector operation (LASSO) Cox regression (Tibshirani, 1997) and Support vector machine - recursive feature elimination (SVM-RFE) (Adorada et al., 2018), were used in combination to reduce the dimensionality of the prognostic DEmiRs. LASSO, a form of ‘penalized’ regression with L1 penalty, was implemented using R-glmnet version ‘4.1’ (Friedman, Hastie & Tibshirani, 2010), whereas SVM-RFE, which computes ranking weights for all features and then iteratively performs backward selection, was implemented using R-e1071 version ‘1.7.4’ (Dimitriadou et al., 2009). A union of the features selected from these two implementations was taken forward and used in a stepwise multivariate Cox regression (Bradburn et al., 2003) for establishing the prognostic DEmiR signature of cervical cancer.

Prognostic risk model

A risk model was formulated based on the identified prognostic DEmiR signature and used to evaluate the survival risk of each patient. The risk with respect to a unit increase in a given biomarker level is given by exponential of the miRNA coefficient in the following multivariate Cox model: (2) $miRNA_Risk_score = β_{1} \times {miRNA}_{1} + β_{2} \times {miRNA}_{2} + \dots + β_{n} \times {miRNA}_{n}$ where n is the size of the prognostic DEmiR signature, miRNA_i denotes the expression level of the ith miRNA, and β_i denotes the effect-size (or weight) of the ith miRNA. Applying the optimal cut-point (i.e, median) given by maxstat (maximally selected rank) statistic from the R-survminer version ‘0.4.8’ (Kassambara, Kosinski & Biecek, 2020) to the risk score distribution, we categorized (binarized) patients with CESC into high-risk and low-risk groups. Kaplan–Meier curves and AUROC were used to analyze the overall survival (OS) probabilities between high-risk and low-risk groups using R survival version ‘3.2.7’ (Therneau & Grambsch, 2000) and R survival ROC version ‘1.0.3’ (Heagerty, Lumley & Pepe, 2000), respectively. The test dataset and an additional external dataset for blind validation were used to evaluate the prognostic value of the developed model.

Nomogram construction

Since miRNA-based risk score was unlikely to be the only prognostic predictor for overall survival, the clinical profile was also considered. Both univariate and multivariate Cox regression analyses were performed with some clinical features, namely age, pregnancies, smoking_history, grade, stage, and HPV_status. Only the variables that stayed significant in both the models were selected, and used with the miRNA-based risk score to build an integrated risk model and then a nomogram map that tabulates the probabilities of one-year and five-year OS of CESC. The discrimination was quantified using Harrell’s concordance index (C-index), and calibration performed using bootstrap with 1,000 resamples.

Results

The TCGA expression data consisted of expression values of 2,589 miRNA in 312 samples enrolled in this study, including 309 cervical cancer tissues and three matched normal tissues. The dataset was pruned in accordance with the pre-processing steps outlined in the Methods section, yielding an expression dataset consisting of 467 miRNAs across 303 samples with stage annotation. The distribution of samples according to the AJCC staging system (Amin et al., 2017) is shown in Table 1. The demographic features and clinical characteristics considered, namely age, smoking history, vital status, pregnancies, HPV status, and histologic grade are summarized in Table 2. Fitting the linear model and applying the filter criteria yielded a total of 101 differentially expressed miRNAs between cervical cancer tissues and matched normal tissues, provided in File S1. Most of the top-ranked miRNAs are overexpressed (e.g., hsa-miR-200c-3p, hsa-miR-141, hsa-miR-200a, hsa-miR-21-5p), suggesting oncomir function with increased epigenetic suppression of target tumor-suppressor gene expression. Table 3 shows the top ten miRNAs with their stage-wise log₂FC and linear model significance.

Table 1:

Distribution of cases by stage.

AJCC staging is represented by the TNM (Tumor-Node-Metastasis) code. Control refers to matched normal samples, and ‘NA’ denotes cases with unavailable stage information.

TCGA stage	TNM classification	#Cases
1	T1N0M0	5	163
1A	T1aN0M0	1
1A1	T1a1N0M0	1
1A2	T1a2N0M0	1
1B	T1bN0M0	38
1B1	T1b1N0M0	78
1B2	T1b2N0M0	39
2	T2N0M0	5	70
2A	T2aN0M0	9
2A1	T2a1N0M0	5
2A2	T2a2N0M0	7
2B	T2bN0M0	44
3	T3N0M0	1	46
3A	T3aN0M0	3
3B	T3bN(any)M0	42
4A	T4N(any)M0	9	21
4B	T(any)N(any)M1	12	21
Control	–	3
NA	–	7

DOI: 10.7717/peerj.15912/table-1

Table 2:

Clinical profile of cervical cancer patients.

Summary of key clinical/demographic features of the dataset. For ordinal/continuous variables (age, smoking_history, and pregnancies), the mean ± standard deviation is given. Histologic grade refers to the degree of differentiation in the cancer sample. It is seen that most cervical cancer patients present with HPV+ status.

Characteristic		Stage I	Stage II	Stage III	Stage IV	‘NA’	Overall
Number of samples		163	70	46	21	7	307
Age (years)		45.9 ± 13.2	49.1 ± 14.2	51.2 ± 13.4	53.3 ± 12.6	58.8 ± 18.8	48.3 ± 13.8
HPV status	Positive	152	63	44	18	7	284
	Negative	11	6	2	3	–	22
	Indeterminate	–	1	–	–	–	1
Smoking history		1.8 ± 1.1	1.7 ± 1.2	1.9 ± 1.1	1.7 ± 1.1	2.7 ± 2.1	1.8 ± 1.2
Pregnancies		3.3 ± 2.1	3.9 ± 3.1	4.1 ± 2.8	3.7 ± 2.4	2.5 ± 2.1	3.6 ± 2.6
Vital status	Alive	135	61	36	8	7	247
Vital status	Dead	28	9	10	13	–	60
Histologic Grade	G I/II	84	34	23	11	2	154
Histologic Grade	G III/IV	65	27	20	5	4	121

DOI: 10.7717/peerj.15912/table-2

Table 3:

Top 10 miRNAs of the linear model.

The log-fold change expression of the miRNA in each stage relative to the controls is given, followed by p-value adjusted for multiple hypothesis testing.

miRNA	Stage I	Stage II	Stage III	Stage IV	adj.P-val
hsa-miR-200c-3p	6.482045	6.481021	6.368438	6.531557	1.96E−20
hsa-miR-141-5p	7.198326	7.176844	6.987161	6.984235	1.96E−20
hsa-miR-141-3p	7.255806	7.393614	7.107695	7.661661	6.69E−20
hsa-miR-200b-5p	6.894887	6.722141	6.676681	6.800084	1.65E−18
hsa-miR-200a-5p	6.6007	6.427489	6.32197	6.630056	3.47E−17
hsa-miR-429	6.457317	6.198339	6.260309	6.738229	1.90E−14
hsa-miR-183-5p	6.65119	6.37214	6.573401	6.790348	1.13E−12
hsa-miR-200a-3p	6.366769	6.32042	6.045745	6.690082	2.50E−12
hsa-miR-21-5p	2.627225	2.74718	2.653042	2.670944	2.50E−12
hsa-miR-182-5p	5.965592	5.618077	5.735565	5.76765	3.67E−12

DOI: 10.7717/peerj.15912/table-3

Each DEmiR was subjected to univariate Cox modeling to evaluate its prognostic significance. This process identified only 52 miRNAs as significantly associated with overall survival, based on p-value <0.05 (data presented in File S2). To optimize the dimensions of the prognostic miRNA biomarker panel, we applied Lasso-penalized Cox regression on the 52 miRNAs to obtain five miRNAs, hsa-miR-625-5p, hsa-miR-3934-5p, hsa-miR-330-3p, hsa-miR-642a-5p, has-miR-95-3p. Only one miRNA, hsa-miR-616-5p, survived the SVM-RFE feature selection process. Figure 2 shows the union of these results (i.e, the six miRNAs). All the prognostic DEmiRS post feature selection processes were upregulated, but none were an outlier DEmiR.

Figure 2: Volcano plot of the expression distribution of the miRNAs with non-trivial expression patterns in cancer samples relative to controls, highlighting the upregulated and downregulated DEmiRs, and the prognostic DEmiRs post the feature selection process.
Any potential outliers would be located in the top-right space. X-axis denotes log₂(FC) of expression with respect to control, and the Y-axis denotes the −log₁₀ transformation of the p-value significance of the linear model for the respective miRNA.

Download full-size image

DOI: 10.7717/peerj.15912/fig-2

The six miRNAs were taken forward for multivariate survival analysis, and subjected to a stepwise backward-selection process, to further compact the miRNA signature. This process yielded an optimal signature of three miRNAs namely hsa-miR-625-5p, hsa-miR-330-3p, and hsa-miR-95-3p, with model p-value <0.002 (Table 4), for construction of the CESC prognostic risk model.

Table 4:

Summary of the Cox survival analysis.

It is seen that hsa-miR-625-5p has a significant protective effect on CESC OS, in contrast with hsa-miR-95-3p and hsa-miR-330-3p. The overall multivariate model is very significant with p-value < 0.002. HR denotes hazard rate, and CI confidence interval.

Variables	Analysis	Coefficient	HR (95% CI)	P-value
hsa-miR-95-3p	Univariate	−0.84	0.43 (0.24–0.79)	0.0063
hsa-miR-95-3p	Multivariate	0.30	1.35 (1.05–1.73)	0.0197
hsa-miR-625-5p	Univariate	1.4	4.2 (1.3-14)	0.0180
hsa-miR-625-5p	Multivariate	−0.52	0.59 (0.43–0.83)	0.0020
hsa-mir-330-3p	Univariate	−0.68	0.51 (0.28–0.93)	0.0290
hsa-mir-330-3p	Multivariate	0.35	1.42 (0.98–2.03)	0.0608

DOI: 10.7717/peerj.15912/table-4

Notes:

HR: hazard rate
CI: confidence interval

The CESC prognostic risk model, given by Eq. (2), was then parameterized using the expression of these three miRNAs: (3) $miRNA_Risk_score = 0.30 * hsa-miR-95-3p + 0.35 * hsa-miR-330-3p - 0.52 * hsa-miR-625-5p .$

It is seen that hsa-miR-625-5p has a significant protective effect on CESC OS, whereas the expression of hsa-miR-95-3p and hsa-miR-330-3p elevate the risk. Based on this model, we computed the risk score for each patient in the train dataset, and used the maxstat of the resulting risk-score distribution to separate patients into high- and low-risk groups (Fig. 3A). The Kaplan–Meier survival curve of this distribution revealed significantly worse prognosis in the high-risk group (p-value <1E-4) (Fig. 3B). Time-dependent ROC analysis of the risk-score model on the train dataset for 1-, 2-, 3-, and 5-year overall survival yielded prognostic AUC values of 0.71, 0.72, 0.74 and 0.73, respectively (Fig. 3C). These results encouraged validation of the CESC-related prognostic signature on the test dataset, whose risk-score distribution is shown in Fig. 4A. The following outcomes validated the results: (i) Kaplan–Meier survival curve showed significantly worse prognosis in the high-risk group ( p-value <1E-4) (Fig. 4B) ; and (ii) time-dependent AUROC values 0.84, 0.79, 0.71 and 0.71 were obtained for 1-, 2-, 3-, and 5-year overall survival, respectively (Fig. 4C).

Figure 3: Performance of the constructed risk-score model on train dataset.
(A) This panel shows the risk-score value (top), survival status (middle), and expression of the three prognostic miRNAs (bottom) for each patient, sorted by the risk-score distribution. Patients were stratified into low-risk (blue) and high-risk (red) groups according to the risk-score value. The patterns in the expression profiles accord with the signed risk of the respective miRNAs. (B) Kaplan–Meier survival curves based on the three-miRNA prognostic signature showing significant difference between the two groups. (C) Time-dependent ROC curves for 1-, 2-, 3-, and 5-year overall survival predictions using the given model.

Download full-size image

DOI: 10.7717/peerj.15912/fig-3

Figure 4: Performance evaluation of the constructed risk-score model on unseen test dataset.
(A) This panel shows the risk-score value (top), survival status (middle), and expression of the three prognostic miRNAs (bottom) for each patient, sorted by the risk-score distribution. Patients were stratified into low-risk (blue) and high-risk (red) groups according to the median risk-score value. (B) Kaplan–Meier survival curves based on the three-miRNA prognostic signature showing significant difference between the two groups. (C) Time-dependent ROC curves for 1-, 2-, 3-, and 5-year overall survival predictions using the given model.

Download full-size image

DOI: 10.7717/peerj.15912/fig-4

To validate the prognostic value of the model on an external, out-of-domain dataset, we used the study results of How et al. (2015). This study used a TaqMan Low Density Array (TLDA) to measure expression in formalin-fixed paraffin-embedded (FFPE) cervix samples. Two datasets from the study were used:

(i) Normalized and log₂-transformed miRNA expression data of 87 FFPE cervix samples used for validation, available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4399941 and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4399941. The clinical information was used to annotate the samples, and the expression subset corresponding to the three miRNAs in the optimal risk model (i.e., (3)) was extracted. Since the miRNA arm information(-3p or -5p) was missing for hsa-miR-625 and hsa-miR-95, the arm-neutral expression values for both these miRNAs were used. The risk score for each sample was calculated based on (3), and the resulting risk score distribution was stratified into high-risk and low-risk patient groups based on the maxstat statistic computed by R survminer. The curves were visualized using Kaplan–Meier analysis, yielding significantly worse prognosis (P = 0.032) in the high-risk patient group relative to the low-risk group (Fig. 5).

Figure 5: Kaplan–Meier survival curves for the validation dataset, showing significantly worse prognosis for the high-risk patient group relative to the low-risk group.
95% confidence bands for the risk groups are also shown.

Download full-size image

DOI: 10.7717/peerj.15912/fig-5

Certain clinical features namely age, HPV_status, pregnancies, smoking_history, histologic_grade, and stage could boost the prognostic predictive value, and hence were examined for candidate inclusion in the risk model. Each clinical feature was subjected to univariate Cox survival analysis. Surprisingly, neither the patient’s HPV-status nor tumor grade was significant to CESC OS, and moreover only one clinical feature turned out significant, namely the ‘Stage’ representing the distinction between early clinical stage (Stage: I/II) and late clinical stage (Stage: III/IV) cancers. This ‘Stage’ constitutes a risk factor independent of the miRNA-based risk-score, and was used with the miRNA_risk_score to construct an integrated multivariate Cox regression model. As a result, both factor levels of both the variables were significant, and the overall multivariate model was extremely significant (p-value ∼4E-05) (Fig. 6). The integrated CESC prognostic risk model was then parameterized as: (4) $Integrated_risk_score = 0.64 * miRNA_Risk_score + 0.74 * Stage .$

Figure 6: Univariate and multivariate Cox regression analyses of patient clinical profile, with respect to CESC OS.

Based on the risk models developed, a nomogram was built to predict one-year and five-year survival probabilities (Fig. 7). The nomogram C-index of the integrated model was estimated as 0.7136 ± 0.047, compared with a C-index of 0.6399 ± 0.044 for the miRNA-based risk model alone, thus differentiating the models on the basis of discrimination. Furthermore, the nomogram calibration plots of the integrated model for one-year and five-year OS probabilities based on bootstrap resampling showed consistency between the predicted and actual survival probabilities (Fig. 8).

Figure 7: Nomogram for reading the overall survival in CESC sample, according to miRNA_risk_score ((2)) and clinical stage.

Download full-size image

DOI: 10.7717/peerj.15912/fig-7

Figure 8: Nomogram calibration curves.
(A) 1-year OS probability; (B) 5-year OS probability. The four sub-cohorts of the dataset are visualized, and the corresponding x represents the bootstrap-corrected estimates of the nomogram performance along with the standard error. The solid line compares the nomogram performance with the reference truth.

Download full-size image

DOI: 10.7717/peerj.15912/fig-8

Discussion

MiRNAs add a layer of critical regulatory control over genomic expression, and aberrations in their expression could lead to the development of cancer hallmarks (Iorio & Croce, 2012). MiRNAs could be detected in the serum, and lend valuable potential as diagnostic and prognostic biomarkers of various cancers, including cervical cancer (Pisarska & Baldy-Chudzik, 2020). Several prognostic miRNAs for CESC have been reported, including miR-31 (Wang et al., 2014), miR-155 (Fang et al., 2016), and miR425-5p (Sun et al., 2017). However a systematic hypothesis-free scan for comprehensive miRNA signatures remains missing in the literature. In this study, we have attempted to fill this void with an integrated multi-layered bioinformatics approach to the detection of a reliable prognostic DEmiR biomarker signature. The study has yielded three prognostic miRNAs, namely hsa-miR-625-5p, hsa-miR-95-3p, and hsa-miR-330-3p. Downregulation of hsa-miR-625-5p has been documented in many cancers including non-small cell lung cancer (Dao et al., 2020), hepatocellular carcinoma (Zhou et al., 2015), melanoma (Zou et al., 2019) and cervical cancer (Wang et al., 2020). A causal mechanism relating miR-625-5p expression to inhibition of cervical cancer cell growth via suppression of NF- κB signaling has been reported (Li et al., 2020), consistent with its mirsupp identity disclosed here. Sponging miR-625-5p in turn is likely to drive cervical cancer progression, and this has been demonstrated recently (Li et al., 2023). Jafarzadeh et al. suggested that miR-330-3p promoted pro-tumorigenic events in various cancers like those of the lung, pancreas, bladder, and cervix, and that its downregulation could stall tumor development (Jafarzadeh et al., 2022), both observations consistent with its oncomir identity disclosed here. Further, miR-95-3p has been implicated in activating the wnt/ βcatenin pathway in prostate cancer tissues (Xi et al., 2019), thereby promoting cell proliferation, migration and invasion, consistent with its oncomir identity disclosed here.

To examine the network-level effects of these miRNAs, we retrieved the RNA-Seq transcriptome for each patient in our dataset from firebrowse.org, and correlated this data with the expression of the three miRNAs of interest to infer potential target genes. Target genes with substantial inverse correlation in expression (defined as Pearson ρ or Spearman ρ or Kendall τ <−0.3) were identified, and the consensus with multiMiR version ‘1.8.0’ (Ru et al., 2014) predictions for each of the three miRNAs was investigated. This yielded three consensus target genes with respect to hsa-miR-95-3p, namely NXPH3, BOC, EID1; two consensus target genes with respect to hsa-miR-625-5p, namely SIN3B and TPRG1L; and two consensus target genes with respect to hsa-miR-330-3p, namely THRA and DYRK2. Functional enrichment analysis of the consensus genes conducted with miRNeT (Chang et al., 2020) on GO and KEGG databases yielded significance for cancer pathways and cell cycle regulation. We also used the miR2Trait server (Babu & Palaniappan, 2022) to investigate the diseasome of this three-miRNA signature, and found significance for ‘uterine cervical neoplasm’ (p-value ∼1.5E-3), ‘squamous cell carcinoma’ (p-value ∼7.7E-3), and ‘cervical intraepithelial neoplasia’ (p-value ∼2.2E-2). Detailed results of the above investigations are presented in File S3.

Though all stages of cervical cancer were sufficiently represented in the dataset we have used, it is noted that the dataset was imbalanced with respect to controls. To investigate if the observations in our study generalize when more controls were included in the analysis, we obtained an miRNA-based GEO (Edgar, Domrachev & Lash, 2002) study GSE104758 (Snoek et al., 2019), containing 32 controls and 24 carcinoma in situ CIN3 samples, and concerned with identifying biomarkers for detecting precancerous CIN3 lesions. The Sequence Read Archive of the 32 controls was imported into Galaxy (The Galaxy Community, 2022), checked for FastQC quality, and trimmed for adapters. The reads were then mapped with the reference human genome hg_38 to yield genome-wide miRNA quantification data. This dataset was batch-corrected using ComBat from SVA version ‘3.35.2’ (Leek et al., 2012), and used to augment the normals in the CESC TCGA cohort, yielding 35 normal and 300 cancer samples. The corresponding expression matrix was then modelled according to (1) in limma, as outlined in the Methods section. Using an adj. p-val threshold ∼0.05 yielded a total of 544 DE miRNAs, including all the 101 original DEmiRs and specifically the three miRNAs of the prognostic miRNA_Risk_score model derived above. The detailed results of the linear model obtained with this controls-augmented dataset are provided in File S4. Additional supporting evidence for our results were then obtained as follows: (i) A contrast between the normals and CIN3 samples of GSE104758 yielded 64 DEmiRs with p-value <0.05. This was compared with the nine miRNAs in ref. 44 obtained by applying group-regularised ridge regression. Except hsa-let-7b, the remaining eight miRNAs were common to both the analyses, viz. hsa-miR-9-5b, hsa-miR-15b-3p, hsa-miR-20a-5p, hsa-miR-31-5p, hsa-miR-93-5p, hsa-miR-183-5p, hsa-miR-184, and hsa-miR-222-3p. The GSE104758 study further narrowed down to a biomarker signature of just five miRNAs(viz. hsa-miR-93-5p, hsa-miR-15b-3p, and hsa-miR-222-3p, and hsa-let-7b) via multivariate logistic regression with backward selection. Application of SVM-RFE to the 64 DEmiRs obtained from our contrast analysis identified 24 DEmiRs, including all the miRNAs in the GSE104758 biomarker panel except hsa-let-7b. It is notable that the process also yielded hsa-miR-330-3p, ranked nine among the 24 selected features. (ii) Univariate Cox analysis with the batch-corrected cancer samples in the augmented dataset yielded 241 significant miRNAs, including hsa-miR-95-3p and hsa-miR-625-5p.

Nomograms are widely used for simplifying the task of interpretation from models, and have been constructed with miRNAs for cervical cancer screening (Kotani et al., 2022), prognosis (Liu et al., 2020), and recurrence risk (Bogani et al., 2019). To facilitate the ready prognosis of cervical cancer patients, the models developed in this work were re-built with the full (train + test) dataset, and served as a web-app named CESCProg, deployed at: https://apalania.shinyapps.io/cescprog/ for non-commercial uses. The concerned user may provide the form inputs, namely the expression values of the three prognostic DEmiRs and an optional sample staging information. Based on the user request, the app proceeds to classify the risk of the sample, and compute a risk-score based on (2) or (4). The calculated risk-score is then consulted with the back-end nomogram to estimate the one-year and five-year survival probabilities. Serum-based or cervical mucus-based miRNAs are minimally invasive, and could be detected and quantified using a range of techniques (e.g., see ref. Baabu et al., 2022).

Conclusions

MiRNA biomarkers are an emerging diagnostic and prognostic aid to the management of disease, especially cancers. Here we present CESCProg, an miRNA-based prognostic model for cervical cancer developed by applying a sequence of purifying filters to the TCGA CESC dataset. All three miRNAs in the panel, namely hsa-miR-95-3p, hsa-miR-330-3p and hsa-miR-625-5p, show upregulation in cervical cancer relative to controls, suggesting feasibility for detection as biomarkers. In the miRNA risk model, hsa-miR-625-5p exhibits a protective effect on OS, while the other two miRNAs elevate the risk. The miRNA risk model was effective and extremely significant in stratifying CESC OS on the test dataset. A second risk model was developed with the inclusion of clinical features to maximize nomogram discrimination. This yielded a C-index of 0.7136 ± 0.047. The models have been deployed as a web-service as a possible aid to medical decision-making. They are available for non-profit use at: https://apalania.shinyapps.io/cescprog.

Supplemental Information

Linear model DEmiRs from the TCGA CESC dataset applying the lfc and significance criteria

DOI: 10.7717/peerj.15912/supp-1

Download

Significant DEmiRs from Univariate Cox regression analysis based on survival outcomes

DOI: 10.7717/peerj.15912/supp-2

Download

Enrichment and network analyses of DEmiRs using various tools and databases

DOI: 10.7717/peerj.15912/supp-3

Download

Significant Linear Model miRNAs from the controls-augmented TCGA CESC dataset

DOI: 10.7717/peerj.15912/supp-4

Download

[1] Adorada A, Permatasari R, Wirawan PW, Wibowo A, Sujiwo A. 2018. Support Vector Machine—Recursive Feature Elimination (SVM - RFE) for selection of microrna expression features of breast cancer. In: 2018 2nd International Conference on Informatics and Computational Sciences (ICICoS). 1-4

[2] Amin MB, Greene FL, Edge SB, Compton CC, Gershenwald JE, Brookland RK, Meyer L, Gress DM, Byrd DR, Winchester DP. 2017. The Eighth Edition AJCC Cancer Staging Manual: continuing to build a bridge from a population-based to a more personalized approach to cancer staging. CA: A Cancer Journal for Clinicians 67:93-99

[3] Baabu PRS, Srinivasan S, Nagarajan S, Muthamilselvan S, Selvi T, Suresh RR, Palaniappan A. 2022. End-to-end computational approach to the design of RNA biosensors for detecting miRNA biomarkers of cervical cancer. Synthetic and Systems Biotechnology 7:802-814

[4] Babu P, Palaniappan A. 2022. miR2Trait: an integrated resource for investigating miRNA-disease associations. PeerJ 10:e14146

[5] Bogani G, Tagliabue E, Ferla S, Martinelli F, Ditto A, Chiappa V, Leone Roberti Maggiore U, Taverna F, Lombardo C, Lorusso D, Raspagliesi F. 2019. Nomogram-based prediction of cervical dysplasia persistence/recurrence. European Journal of Cancer Prevention 28:435-440

[6] Bradburn MJ, Clark TG, Love SB, Altman DG. 2003. Survival analysis part II: multivariate data analysis—an introduction to concepts and methods. British Journal of Cancer 89:431-436

[7] Bruni L, Barrionuevo-Rosas L, Albero G, Serrano B, Mena M, Gómez D, Muñoz J, Bosch F, De Sanjosé S, Cancer. 2019. Human papillomavirus and related diseases report.

[8] Chandran UR, Medvedeva OP, Barmada MM, Blood PD, Chakka A, Luthra S, Ferreira A, Wong KF, Lee AV, Zhang Z, Budden R, Scott JR, Berndt A, Berg JM, Jacobson RS. 2016. TCGA expedition: a data acquisition and management system for TCGA data. PLOS ONE 11:e0165395

[9] Chang L, Zhou G, Soufan O, Xia J. 2020. miRNet 2.0: network-based visual analytics for miRNA functional analysis and systems biology. Nucleic Acids Research 48:W244-W251

[10] Clark TG, Bradburn MJ, Love SB, Altman DG. 2003. Survival analysis part I: basic concepts and first analyses. British Journal of Cancer 89:232-238

[11] Dao R, Wudu M, Hui L, Jiang J, Xu Y, Ren H, Qiu X. 2020. Knockdown of lncRNA MIR503HG suppresses proliferation and promotes apoptosis of non-small cell lung cancer cells by regulating miR-489-3p and miR-625-5p. Pathology—Research and Practice 216:152823

[12] Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A. 2009. E1071: misc functions of the department of statistics (E1071), TU Wien.

[13] Edgar R, Domrachev M, Lash AE. 2002. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30:207-210

[14] Fang H, Shuang D, Yi Z, Sheng H, Liu Y. 2016. Up-regulated microRNA-155 expression is associated with poor prognosis in cervical cancer patients. Biomedicine & Pharmacotherapy 83:64-69

[15] Friedman J, Hastie T, Tibshirani R. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33:1-22

[16] Heagerty PJ, Lumley T, Pepe MS. 2000. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56:337-344

[17] Hochberg Y, Benjamini Y. 1990. More powerful procedures for multiple significance testing. Statistics in Medicine 9:811-818

[18] How C, Pintilie M, Bruce JP, Hui AB, Clarke BA, Wong P, Yin S, Yan R, Waggott D, Boutros PC, Fyles A, Hedley DW, Hill RP, Milosevic M, Liu FF. 2015. Developing a prognostic micro-RNA signature for human cervical carcinoma. PLOS ONE 10:e0123946

[19] Iorio MV, Croce CM. 2012. MicroRNA dysregulation in cancer: diagnostics, monitoring and therapeutics. A comprehensive review. EMBO Molecular Medicine 4:143-159

[20] Jafarzadeh A, Paknahad MH, Nemati M, Jafarzadeh S, Mahjoubin-Tehran M, Rajabi A, Shojaie L, Mirzaei H. 2022. Dysregulated expression and functions of microRNA-330 in cancers: a potential therapeutic target. Biomedicine & Pharmacotherapy 146:112600

[21] Jiang Y, Hu Z, Zuo Z, Li Y, Pu F, Wang B, Tang Y, Guo Y, Tao H. 2020. Identification of circulating MicroRNAs as a promising diagnostic biomarker for cervical intraepithelial neoplasia and early cancer: a meta-analysis. BioMed Research International 2020:4947381

[22] Kassambara A, Kosinski M, Biecek P. 2020. Drawing Survival Curves using ’ggplot2’. R package survminer version 0.4.8 software

[23] Kotani K, Iwata A, Kukimoto I, Nishio E, Mitani T, Tsukamoto T, Ichikawa R, Nomura H, Fujii T. 2022. Nomogram for predicted probability of cervical cancer and its precursor lesions using miRNA in cervical mucus, HPV genotype and age. Scientific Reports 12:16231

[24] Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. 2012. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28:882-883

[25] Li H, Zheng S, Wan T, Yang X, Ouyang Y, Xia H, Wang X. 2023. Circular RNA circ_0000212 accelerates cervical cancer progression by acting as a miR-625-5p sponge to upregulate PTP4A1. Anticancer Drugs 34:659-668

[26] Li Y, Ding Y, Ding N, Zhang H, Lu M, Cui X, Yu X. 2020. MicroRNA-625-5p sponges lncRNA MALAT1 to inhibit cervical carcinoma cell growth by suppressing NF- κB signaling. Cell Biochemistry and Biophysics 78:217-225

[27] Li Z, Rana TM. 2014. Therapeutic targeting of microRNAs: current status and future challenges. Nature Reviews Drug Discovery 13:622-638

[28] Liu J, Yang J, Gao F, Li S, Nie S, Meng H, Sun R, Wan Y, Jiang Y, Ma X, Cheng W. 2020. A microRNA-Messenger RNA regulatory network and its prognostic value in cervical cancer. DNA and Cell Biology 39:1328-1346

[29] McCarthy DJ, Smyth GK. 2009. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics 25:765-771

[30] Mehrotra R, Yadav K. 2021. Cervical cancer: formulation and implementation of govt of india guidelines for screening and management. Indian Journal of Gynecologic Oncology 20:4

[31] Muthamilselvan S, Palaniappan A. 2023. CESCProg: a compact prognostic model and nomogram for cervical cancer based oN miRNA biomarkers. medRxiv

[32] Pedroza-Torres A, Fernández-Retana J, Peralta-Zaragoza O, Jacobo-Herrera N, Leon DCantúde, Cerna-Cortés JF, Lopez-Camarillo C, Pérez-Plasencia C. 2016. A microRNA expression signature for clinical response in locally advanced cervical cancer. Gynecologic Oncology 142:557-565

[33] Pisarska J, Baldy-Chudzik K. 2020. MicroRNA-based fingerprinting of cervical lesions and cancer. Journal of Clinical Medicine 9:3668

[34] Reddy KB. 2015. MicroRNA (miRNA) in cancer. Cancer Cell International 15:38

[35] Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. 2015. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43:e47

[36] Ru Y, Kechris KJ, Tabakoff B, Hoffman P, Radcliffe RA, Bowler R, Mahaffey S, Rossi S, Calin GA, Bemis L, Theodorescu D. 2014. The multiMiR R package and database: integration of microRNA-target interactions along with their disease and drug associations. Nucleic Acids Research 42:e133

[37] Sarathi A, Palaniappan A. 2019. Novel significant stage-specific differentially expressed genes in hepatocellular carcinoma. BMC Cancer 19:663

[38] Snoek BC, Verlaat W, Babion I, Novianti PW, Van de Wiel MA, Wilting SM, Van Trommel NE, Bleeker MCG, Massuger L, Melchers WJG, Sie D, Heideman DAM, Snijders PJF, Meijer C, Steenbergen RDM. 2019. Genome-wide microRNA analysis of HPV-positive self-samples yields novel triage markers for early detection of cervical cancer. International Journal of Cancer 144:372-379

[39] Sun L, Jiang R, Li J, Wang B, Ma C, Lv Y, Mu N. 2017. MicoRNA-425-5p is a potential prognostic biomarker for cervical cancer. Annals of Clinical Biochemistry 54:127-133

[40] Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. 2021. Global Cancer Statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer Journal for Clinicians 71:209-249

[41] The Galaxy Community. 2022. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Research 50:W345-w351

[42] Therneau TM, Grambsch PM. 2000. Modeling survival data: extending the cox model. New York: Springer.

[43] Tibshirani R. 1997. The lasso method for variable selection in the Cox model. Statistics in Medicine 16:385-395

[44] Wang L, Zhong Y, Yang B, Zhu Y, Zhu X, Xia Z, Xu J, Xu L. 2020. LINC00958 facilitates cervical cancer cell proliferation and metastasis by sponging miR-625-5p to upregulate LRRC8E expression. Journal of Cellular Biochemistry 121:2500-2509