Predicting cancer probability using habit-based data: A regression approach with Random Forest
Abstract
Cancer continues to be the top cause of mortality, with around 10 million fatalities in 2022. In fact, nearly one in six deaths worldwide is due to cancer. About 30% of these deaths are caused by lifestyle factors, including drinking alcohol and smoking, which emphasizes the possibility of preventive interventions. Nevertheless, the majority of models ignore subtle risk levels in favour of binary classification. To predict the likelihood of continuous cancer (0-1) based on behaviours such as drinking, smoking, riding, walking, and jogging, this study employs a Random Forest Regressor. We conducted exploratory data analysis (distributions, correlations), hyperparameter-tuned training, and preprocessing (cleaning, label encoding, standardization) using a Kaggle dataset. The results showed a low evaluation error: MAE = 0.0828, MSE = 0.0091, and RMSE = 0.0955. Smoking and drinking were identified as significant predictors based on feature importance, accounting for approximately 35% and 30% of the variance explained, respectively. Risk stratification (e.g., low <0.2; high >0.8) is made possible by this regression approach, which supports tailored prevention and may reduce incidence by 20–30% with specific lifestyle modifications. Integration with national registries could enhance screening in areas such as the UAE/MENA, where tobacco use has increased by 15% since the COVID-19 pandemic. Multi-omics expansion and SHAP interpretability are areas of future development.