A benchmarking study of responsible machine learning models in educational institutions
Abstract
Universities, schools, and other educational institutions are increasingly adopting Machine Learning (ML) for tasks such as student performance prediction and dropout analysis. However, model evaluation is often limited to predictive accuracy, overlooking deployment-critical factors such as interpretability (the ability to understand how and why a model makes decisions), scalability (how well a model handles increasing amounts of data or users), and stability (consistency of results across different data samples). This study proposes a multidimensional benchmarking framework to evaluate five major educational prediction tasks: Student Retention Prediction, Student Performance Prediction, Budget Prediction, University Ranking Prediction, and Student Enrollment Prediction. We are using publicly available datasets and multiple ML models, including Logistic Regression, Linear Regression, RandomForest, XGBoost, and LightGBM. Models are assessed using traditional performance metrics, such as R² (a measure of how well predicted values fit the actual data), RMSE (Root Mean Square Error, a measure of prediction error), MAE (Mean Absolute Error, a measure of average prediction error magnitude), and accuracy (the percentage of correct predictions). We also evaluate SHAP-based interpretability (SHapley Additive exPlanations, a method to explain model outputs), as well as stability, scalability, and data quality. We developed the Educational AI Readiness Index, unifying these dimensions into a score for real-world deployment. Our results show no single model excels universally; optimal models vary by task. XGBoost is best for Retention and Enrollment, Linear Regression for Performance due to scalability, and LightGBM scores highest for Ranking and Budget Prediction. SHAP-based analysis shows that high accuracy does not guarantee stable model attributions, exposing deployment risks that are missed by accuracy-only assessments. The University Ranking dataset is the most deployment-ready, while the Budget Prediction dataset is challenging. These insights emphasize the need for multidimensional evaluation frameworks for responsible AI deployment in education.