Economic insights into credit risk: a deep learning model for predicting loan repayment capacity

View article
PeerJ Computer Science

Introduction

Predicting loan repayment capacity is a critical issue in financial management, particularly in corporate financial management. A borrower’s ability to repay a loan is influenced by a multitude of factors, including socioeconomic, educational, behavioral elements, and others. These factors vary significantly depending on the context and type of enterprise or business activity. Numerous studies in the academic literature have explored these aspects, contributing to the development of models that aim to more accurately predict repayment capacity.

Specifically, several socioeconomic factors have been found to impact loan repayment capacity. For example, age, education level, marital status, and household characteristics were identified as significant factors influencing the repayment behavior of microfinance clients in Ethiopia. Older borrowers generally demonstrated better repayment capacity due to accumulated experience and more stable sources of income (Abdulahi, Charkos & Haji, 2022). Moreover, household size was found to have a negative impact on repayment ability, suggesting that larger families may have more strained financial resources (Ojiako, 2012). A similar pattern was also observed in studies on cooperative farmers, where factors such as net farm income and the number of years of cooperative membership positively correlated with repayment capacity, while age and level of education showed negative influences (Ilavbarhe, Alufohai & Keyagha, 2017).

In the business environment, income diversification is considered a key factor in enhancing loan repayment capacity. For instance, multiple income streams from both agricultural and non-agricultural activities can serve as a financial buffer against shocks, thereby improving borrowers’ ability to adhere to repayment schedules. For example, the study by Okonta, Achoja & Okuma (2023) on poultry, fish, and arable crop farmers in Nigeria demonstrated a clear hierarchy in repayment capacity based on income diversity and the effectiveness of management practices. Microfinance institutions also emphasize the importance of integrating credit assessment mechanisms to evaluate a borrower’s financial health, particularly their ability to generate stable cash flows from multiple income sources (Okero & Waweru, 2023). Credit risk assessment is another crucial factor in predicting loan repayment, involving an evaluation of the borrower’s credit history and financial condition, typically reflected in credit scores obtained from credit reference bureaus (Okero & Waweru, 2023). A systematic evaluation of these elements enables better insight into default likelihood, allowing financial institutions to adjust their lending strategies accordingly.

Computer-based models employing advanced statistical methods, such as logistic regression and machine learning algorithms, have also been applied to improve the accuracy of loan repayment prediction (Abdulahi, Charkos & Haji, 2022). Large-scale modelling has also proven effective for policy-relevant forecasting in other domains, reinforcing the importance of principled model design and validation for risk prediction (Nguyen et al., 2021). Cultural and institutional contexts also significantly shape repayment behavior. The influence of social capital and group lending models has been recognized as instrumental in facilitating repayment success. In Indonesia, community lending models involving collaborative participation among borrowers showed higher repayment rates, largely due to social pressure from group members to fulfill loan obligations (Indriani et al., 2023; Abdirashid & Jagongo, 2019). Additionally, geographical and economic conditions, including credit accessibility, also play a vital role in assessing repayment capacity within microfinance frameworks (Dang et al., 2022; Enimu, Eyo & Ajah, 2017).

In a comparative analysis of different contexts, subtle differences in repayment performance between cooperative members and microfinance beneficiaries highlight the importance of contextual adaptation in financial decision-making (Gudde Jote, 2018). Studies show that when institutions adjust loan terms based on local practices and borrower histories, repayment rates can improve significantly, emphasizing the need for tailored approaches in lending strategies. Moreover, mechanisms for monitoring loan repayments also play a key role in predicting and enhancing repayment capacity. Flexible repayment schedules, as opposed to rigid ones, can facilitate better compliance with repayment obligations. The study by Field & Pande (2008) indicates that while repayment frequency itself may not significantly affect default rates, flexibility in scheduling can promote more sustainable financial behavior. Furthermore, promoting financial literacy among borrowers has been shown to improve their ability to manage loans effectively, leading to better repayment outcomes (Kariuki & Namusonge, 2024).

With increasing demand for credit risk assessment, we propose a deep learning-based approach that integrates diverse information such as socioeconomic characteristics, credit history, behavioral factors, and income structure to provide more accurate predictions of loan repayment capacity. Our design is informed by advances in multimodal representation learning, e.g., unified frameworks that couple sequence encodings with predicted structural views via contrastive objectives to boost generalization (Nguyen et al., 2024), and we analogously integrate heterogeneous borrower signals (demographics, credit history, behavioural traces) within a single architecture. The proposed model not only contributes academically by offering deeper insights into borrower behavior, but also has strong practical value in supporting financial institutions to build more precise and context-adaptive credit risk assessment systems. This represents an important step toward developing sustainable lending strategies, minimizing credit risk, and promoting long-term financial stability.

Beyond its academic contributions, this study also carries significant practical implications. For commercial banks and credit institutions, a more accurate prediction of loan repayment capacity enables the development of risk-sensitive lending policies, reducing default rates and improving portfolio quality. By integrating diverse borrower information into a unified predictive framework, financial institutions can refine credit scoring systems, tailor loan products to specific client segments, and design more flexible repayment plans that align with borrower conditions. Furthermore, in the broader context of financial risk management, such predictive models help institutions comply with regulatory requirements, strengthen early-warning mechanisms, and enhance resilience against systemic risks, thereby fostering a more sustainable and inclusive financial ecosystem.

Related work

The rapidly growing field of machine learning and deep learning has witnessed numerous important applications in predicting loan repayment capacity. Recent studies emphasize the effectiveness of various algorithms in enhancing prediction accuracy, thereby improving decision-making processes in loan approvals. In 2022, a prominent study by Abdulahi, Charkos & Haji (2022) demonstrated that the implementation of machine learning techniques, particularly the Random Forest algorithm, significantly improved the accuracy of predicting the repayment capability of small and medium-sized enterprises in Malaysian financial institutions. This aligns with the findings of Addo, Guegan & Hassani (2018), who asserted that machine learning and deep learning models have considerably enhanced the credit risk analysis framework through advanced data-driven predictive capabilities. The integration of these models has enabled the effective utilization of large datasets, allowing for more accurate assessments of default probabilities compared to traditional methods. Furthermore, the study by Wu (2022) highlights the increasing reliance of banks on machine learning to improve both the speed and accuracy of default predictions, signaling a pivotal shift from manual evaluations to automated systems. In 2023, Jumaa, Saqib & Attar (2023) emphasized the potential of deep learning methods in refining credit risk assessment frameworks, asserting that consumer loan default prediction models developed using these techniques deliver higher accuracy and better identification of borrower risks. For example, research by Liu et al. (2023) demonstrated that Bayesian deep learning can significantly enhance predictive outcomes by identifying subtle variations in default indicators—factors that are critical for lenders. The optimal application of different machine learning models is also discussed by Zhou (2023), who compared the performance of models such as Logistic Regression, Decision Trees, and XGBoost, thereby illustrating their capabilities in contextualizing borrowers’ asset values and financial conditions.

Additionally, alternative approaches have emerged in recent literature. For instance, Netzer, Lemaire & Herzenstein (2019) focused on textual analysis of loan applications, showing that qualitative insights extracted from applicants’ textual data can substantially enhance the assessment of repayment behavior. This innovative intersection between natural language processing and financial forecasting reflects a trend of complementing traditional numerical analysis with qualitative indicators, providing a more holistic view of borrower risk. Similarly, Kvamme et al. (2018) integrated diverse methodological frameworks, employing convolutional neural networks (CNNs) to predict mortgage defaults with impressive results, showcasing the flexibility of deep learning models in handling complex datasets. The study by Xu, Lu & Xie (2021) on P2P lending emphasized the benefits of using machine learning methods in assessing repayment risks, effectively leveraging robust datasets to fine-tune prediction criteria. As the field continues to advance, blending machine learning with deep learning appears to be an emerging trend. For instance, Li, Tian & Zhuo (2023) explored blended models to improve the accuracy and stability of credit default predictions. Interest in ensemble methods is also on the rise, with (Hamori et al., 2018) comparing ensemble learning methods with various neural network architectures. Their results indicate that although deep learning provides advanced predictive capabilities, ensemble methods like Random Forest still offer competitive accuracy and reliability. Such integrated approaches demonstrate how traditional algorithms can be enhanced by modern technologies, providing financial institutions with more powerful toolsets for predicting repayment capacity.

In practical applications, the research by Bature et al. (2023) illustrates the development of a credit default prediction system using machine learning, clearly showing how theoretical models can be operationalized into useful applications for financial institutions. This transition from model to real-world implementation is crucial, reflecting the growing trend of applying advanced analytics in day-to-day lending processes. However, challenges remain. The need for more robust feature extraction techniques in deep learning, as discussed by Aniceto, Barboza & Kimura (2020), highlights the inherent complexity of credit data. This complexity calls for ongoing research and innovation in feature selection and extraction as fundamental components in the development of reliable predictive models. As the landscape of credit risk assessment continues to evolve, deepening the integration of machine learning and deep learning technologies holds great promise for achieving higher predictive accuracy in loan repayment forecasting. Studies by Turiel & Aste (2019) recognize the potential of AI-based models to recalibrate lending practices, underscoring the importance of technological adaptation in modern financial services.

Material and methodology

Data description and pre-processing

We utilized a dataset from LendingClub (https://www.kaggle.com/datasets/jeandedieunyandwi/lending-club-dataset), which includes both approved and rejected loan applications. This dataset serves as a valuable resource for analyzing credit risk and predicting loan repayment behavior. It contains borrower demographic information, loan characteristics, and creditworthiness indicators such as FICO scores. Key features include loan amount, interest rate, loan term, borrower income, debt-to-income ratio, FICO score, loan purpose, and repayment status. This dataset supports various tasks such as credit risk modeling, loan approval prediction, and economic behavior analysis.

Given the complexity of the original data, we applied several preprocessing techniques before using it for model training and evaluation:

  • Feature integration: The original data sources consisted of multiple files containing complementary information. We merged these sources into a single dataset by matching records using a unique borrower identifier.

  • Handling missing and invalid values: The dataset contained a significant number of missing values (NaN) and invalid entries (e.g., abnormally large values). Missing values were imputed using the mean of the respective feature. For invalid entries, we applied a filtering strategy: if more than 30% of the records in a feature were invalid, the feature was discarded; otherwise, only the affected rows were removed.

After analyzing the processed dataset, we observed an imbalance between negative and positive labels. The number of data points with negative labels was significantly higher than those with positive labels. There are many potential issues and risks associated with using an imbalanced dataset. If we train a classifier on such data, the model may classify all minority class instances as belonging to the majority class, potentially resulting in high test accuracy. However, such a model could be meaningless, as it would lack the ability to recognize samples from the minority class.

After preprocessing the data, we randomly split the dataset into three subsets: training, validation, and test, with respective ratios of 0.6, 0.1, and 0.3. The purpose of this split is to train the model effectively while ensuring reliable evaluation and tuning. The training set is used to fit the model parameters, the validation set is used to optimize hyperparameters and prevent overfitting, and the test set is used to assess the final performance of the model on unseen data. The statistical distribution of the dataset across these subsets is illustrated in Fig. 1.

Data distribution.

Figure 1: Data distribution.

Motivation behind model architecture

The task of predicting loan repayment capacity involves analyzing heterogeneous tabular data composed of both categorical and numerical features with complex interdependencies. Despite these advances, prior research still exhibits several important limitations. Most existing studies predominantly focus on applying conventional machine learning or deep learning models such as Random Forests, eXtreme gradient boosting (XGBoost), CNNs, or ensemble methods, which often require extensive feature engineering and are limited in capturing complex cross-feature dependencies inherent in tabular credit data. While some works explore innovative directions, such as textual analysis or blended modeling, they generally do not address the challenge of simultaneously integrating heterogeneous feature types within a unified framework. Moreover, the interpretability of deep learning approaches remains underexplored, restricting their adoption in regulatory and high-stakes financial contexts. These gaps motivate the design of our Gated Transformer architecture, which aims to not only achieve high predictive performance but also enhance transparency and adaptability across diverse real-world credit datasets.

We select the proposed Gated Transformer architecture due to its inherent ability to model interactions among diverse feature types via multi-head self-attention mechanisms. Unlike conventional models, the Transformer-based approach processes features in a contextual manner, dynamically weighting their contributions based on learned dependencies. Furthermore, the introduction of a token-wise gating mechanism enhances interpretability and allows the model to selectively emphasize relevant features, which is crucial for economic applications where understanding feature importance is key. This architecture also effectively integrates categorical embeddings and normalized numerical inputs within a unified latent space, facilitating end-to-end learning without extensive preprocessing. Overall, the Gated Transformer provides a flexible and powerful framework to extract meaningful economic insights from complex credit data, improving predictive performance while maintaining interpretability necessary for responsible financial decision-making.

Proposed architecture

Our proposed method is a Transformer-based architecture tailored for tabular data, consisting of three main stages: Feature Encoding and Gated Transformer Processing and Prediction Module, as illustrated in Fig. 2. The design effectively integrates both categorical and numerical features into a unified embedding space and leverages gated self-attention mechanisms to capture complex feature interactions.

Model architecture visualization.

Figure 2: Model architecture visualization.

Encoding data

To encode the data before training the model, we construct a processing block called the Feature Encoder. Figure 3 illustrates the data encoding process. Categorical features xcat are tokenized and mapped to embedding vectors via an embedding table:

e~c=Embedding(Tokenize(xcat))Rdemb.

Details of feature encoder.
Figure 3: Details of feature encoder.

Numerical features xnum, typically scalars or low-dimensional vectors, are first normalized and then projected into the embedding space through a shared linear layer:

e~n=WnLayerNorm(xnum)+bn,WnRdemb×din.

All features are then combined via concatenation, along with a special embedding token [cls]:

E=e~ce~ne[cls]R(n+m+1)×demb,where n is the number of categorical features, and m is the number of numerical features.

Gated transformers

The Gated Transformer is a variant of the Transformer (Vaswani et al., 2017) architecture originally introduced in the field of natural language processing. It is modified and fine-tuned to better suit the characteristics of tabular data features. The Gated Transformer consists of two main components: multi-head self-attention layers and gated feedforward layers. The input representation Z is first processed by a multi-head self-attention mechanism to capture interactions among features:

Zatt=MultiHeadAttn(Z)=[head1,head2,,headh]WO,where

headi=Attention(Z×WiQ,Z×WiK,Z×WiV),Here, Z = E corresponds to the input embedding of tabular data described in Section Encoding Data, WORd×d is the output projection matrix, and {WiQ,WiK,WiV}Rd×dh are the learnable projection matrices for the i-th attention head. The attention output Zatt is then modulated by a token-wise gating mechanism. Specifically, a gating vector g is computed as:

g=σ(Zatt(l)wG),where σ() denotes the sigmoid function, wGRd×1 is a learnable parameter, and g(l)[0,1]n controls the activation strength of each token embedding. The gated output is then passed through two linear projections and combined as follows:

Zout=Linear(gZatt)Linear(Zatt),where denotes element-wise multiplication, and indicates residual addition. The resulting output ZRn×d serves as the input for the MLP layer.

Experimental setup

Traditional ML models

We explore various machine learning models to predict borrower repayment abilities. The machine learning algorithms considered include Logistic Regression, Random Forests, Naive Bayes, Light Gradient Boosting Machine (LightGBM), Neural Networks, K-Nearest Neighbors (KNN), and XGBoost. Some of these models are used in previous studies, while others were selected based on their potential performance. We provide justifications for selecting these algorithms and will present initial experimental results.

  • • Random Forests (Breiman, 2001) are ensemble learning methods particularly suited for handling imbalanced datasets. By aggregating predictions from a large number of randomized decision trees, Random Forests reduce model variance and mitigate overfitting. The final prediction is made through majority voting (classification) or averaging (regression), formally expressed as: y^=majority_vote{hb(x)}b=1Bory^=1Bb=1Bhb(x).

  • • Naive Bayes (Webb, 2011), in contrast, adopts a probabilistic and generative perspective. Assuming conditional independence among features given the class label, it models the joint probability distribution P(X,Y). According to Bayes’ theorem: P(Y=yX1,,Xn)P(Y=y)i=1nP(XiY=y).

  • • LightGBM (Ke et al., 2017) represents a new generation of gradient boosting algorithms, engineered for extreme efficiency and scalability. Unlike traditional Gradient Boosting Decision Trees that grow trees level-wise, LightGBM grows trees leaf-wise, selecting leaves that yield the largest loss reduction. This strategy substantially accelerates training time and reduces memory consumption, often achieving state-of-the-art results in multi-class classification and ranking problems.

  • • K-Nearest Neighbors (Kramer, 2013) offers a fundamentally different approach. It is a non-parametric, instance-based learning algorithm that classifies a data point based on the majority label among its k nearest neighbors. Distance between points is often measured using the Euclidean metric: d(x,x)=i=1n(xixi)2.

  • • XGBoost (Chen & Guestrin, 2016) stands as a highly optimized implementation of gradient boosting, celebrated for its speed and predictive performance. Unlike basic boosting algorithms, XGBoost explicitly incorporates regularization terms to control model complexity. The objective function typically takes the form: L(ϕ)=i=1nl(yi,y^i)+k=1KΩ(fk),

  • where Ω(f)=γT+12λ||w||2 penalizes the number of leaves T and the leaf weights w.

  • • CatBoost (Dorogush, Ershov & Gulin, 2018) is a gradient boosting algorithm on decision trees, specifically designed to handle categorical features efficiently without extensive preprocessing. Unlike traditional boosting methods, CatBoost employs an ordered boosting strategy to reduce prediction shift and prevent overfitting. Additionally, it uses target statistics with permutation-driven schemes for encoding categorical variables, ensuring unbiased estimation.

By evaluating these models, we aim to identify the most suitable algorithm for predicting borrower repayment behavior, balancing accuracy, interpretability, and computational efficiency.

Advanced tabular models

Recent advancements in deep learning have markedly improved predictive capabilities for tabular data tasks. In this study, we conduct a comparative analysis of several advanced tabular deep learning architectures, each addressing challenges inherent to tabular datasets, including sparsity and overfitting, while striving to maximize predictive fidelity. These architectures will be designed and implemented for a dataset related to loan repayment capacity prediction. Through this, we will analyze and evaluate the performance of the proposed model in comparison with these advanced tabular models.

  • Multilayer Perceptron (MLP) (Lippmann, 1994): As a widely adopted baseline for tabular data modeling, MLP leverages fully connected layers with nonlinear activation functions to learn feature interactions. Despite its simplicity and computational efficiency, MLP often struggle to capture complex dependencies between heterogeneous features compared to more advanced architectures. Nonetheless, they remain a strong reference point for benchmarking due to their robustness and ease of implementation.

  • TabNet (Arik & Pfister, 2019): TabNet leverages a sequential attention mechanism to dynamically prioritize relevant features at each decision step, offering an elegant trade-off between predictive power and interpretability. By enforcing sparse attention, TabNet remains resistant to overfitting and enables human-understandable model explanations. Critical hyperparameters such as learning rate and batch size play pivotal roles in governing its convergence behavior and generalization performance. The model’s architecture is particularly well-suited for tasks demanding feature importance transparency.

  • TabTransformer (Huang et al., 2020): In adapting transformer architectures to the domain of tabular data, TabTransformer introduces column embedding and self-attention layers to capture intricate feature dependencies more effectively than conventional MLP-based approaches.

  • TabPFN (Hollmann et al., 2022): Characterized by its meta-learned, probabilistic foundation, TabPFN stands out by offering rapid and competitive predictions with minimal hyperparameter tuning requirements. Particularly advantageous for small or noisy datasets, TabPFN circumvents overfitting tendencies commonly observed in deep models.

  • FT-Transformer (Gorishniy et al., 2021): FT-Transformer revisits the transformer paradigm with feature tokenization to process tabular data as sequences of tokens, each corresponding to an input feature. This allows the model to fully exploit the attention mechanism across heterogeneous feature types. Empirical studies from FT-Transformer demonstrate that FT-Transformer often matches or surpasses tree-based models like XGBoost on benchmark datasets.

  • SAINT (Somepalli et al., 2021): SAINT advances the field by introducing both self-attention and inter-sample attention mechanisms to better model feature interactions not only within a single instance but also across the dataset. This dual-attention framework enriches the learned representations and captures complex global dependencies, thereby enhancing predictive accuracy.

  • AutoInt (Song et al., 2018): AutoInt leverages the self-attention mechanism to automatically learn high-order feature interactions in tabular data. By stacking multiple interacting layers equipped with residual connections, the model adaptively captures both low- and high-order dependencies without extensive manual feature engineering. The attention-based design not only enhances interpretability by highlighting influential feature interactions but also demonstrates strong performance across diverse real-world tabular prediction tasks.

Tuning parameters

To ensure a fair comparison of model performance, we conducted hyperparameter tuning using the random search strategy. For each machine learning model, a range of critical hyperparameters was systematically explored to identify the optimal configuration. Table 1 summarizes the machine learning models employed in the loan repayment capacity prediction task, together with their key hyperparameters and corresponding tuning ranges. Additionally, Table 2 provides the hyperparameter tuning details for advanced tabular models applied to this task. The hyperparameters of these advanced models were selected based on insights from relevant research articles and official implementation source codes.

Table 1:
Hyperparameters for machine learning models.
Model Hyperparameter Tuning values
Random Forest n_estimators 100, 200, 500
max_depth 10, 20, 30
min_samples_split 2, 5
Naive Bayes var_smoothing 1e−9, 1e−8, 1e−7, 1e−6
fit_prior [0.5, 0.5]
LightGBM num_leaves 50, 100, 150
learning_rate 0.01, 0.05, 0.1, 0.2
max_depth −1, 10, 20, 30
feature_fraction 0.7, 0.8, 0.9, 1.0
K-nearest neighbors n_neighbors 3, 5, 7, 9
Metric Euclidean, manhattan
Weights Uniform, distance
XGBoost learning_rate 0.01, 0.1, 0.2, 0.3
max_depth 3, 6, 9, 12
Subsample 0.5, 0.7, 1.0
CatBoost learning_rate 0.01, 0.1, 0.2, 0.3
Depth 3, 6, 9, 12
Iterations 50, 100, 150
DOI: 10.7717/peerj-cs.3339/table-1
Table 2:
Hyperparameters for advanced tabular models.
Model Hyperparameter Tuning values
MLP Learning rate 0.01–0.05
Dimension 256, 512
Number of layers 3, 5
Activation function ReLU, LeakyReLU
TabNet Learning rate 0.01–0.05
Batch size 512, 2,048
Decision steps 3, 10
Sparsity regularization 1e−5, 1e−3
TabTransformer Number of attention heads 4–8
Hidden size 128, 512
Number of transformer layers 4, 8
Learning rate 1e−4, 1e−3
TabPFN Learning rate 1e−4, 1e−3
Hidden size 128, 512
Number of attention heads 4, 8
FT-Transformer Embedding dimension 32, 128
Number of transformer layers 2, 4
Number of attention heads 4, 8
Dropout rate 0.1, 0.3
SAINT Learning rate 1e−4–3e−4
Number of attention heads 4, 8
Dropout rate 0.1, 0.3
Number of layers 4, 8
AutoInt Learning rate 1e−4–3e−4
Number of attention heads 4, 8
Embedding dimension 128, 256
DOI: 10.7717/peerj-cs.3339/table-2

Evaluation metrics

In this section, we introduce multiple evaluation metrics to assess model performance, each providing different insights into prediction quality. The formulas of the metrics are described as follows.

  • • Area Under the Receiver Operating Characteristic Curve (AUROC) AUROC=01TPR(FPR),dFPR

  • • Area Under the Precision-Recall Curve (AUPRC) PRAUC=01Precision(Recall)dRecall

  • • Accuracy (ACC) ACC=TP+TNTP+TN+FP+FN

  • • Matthews Correlation Coefficient (MCC) MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)

  • • F1-score F1=2×Precision×RecallPrecision+Recall.

Computational environment

All experiments were conducted on a machine equipped with an Intel Core i9-12900K processor, 64 GB of RAM, and an NVIDIA RTX 3090 GPU. The operating system used was Ubuntu 20.04 LTS. The software environment consisted of Python 3.9, with key libraries including Scikit-learn (version 1.6) PyTorch (version 2.0.0). Model training and evaluation were performed using Visual Studio Code with integrated support for Jupyter notebooks. The CUDA toolkit version 11.4 was employed to leverage GPU acceleration for deep learning tasks. On average, the training process for our model required approximately 40 min, while inference latency per sample was around 215 milliseconds, ensuring suitability for real-time applications. During training, RAM consumption remaining stable at around 4 GB.

Results and discussion

Ablation study

To validate the contribution of each component in our proposed architecture, we performed a detailed ablation study. Specifically, we investigated the effect of three critical modules: (1) Feature Encoder, (2) the Multi-head Attention, and (3) the Gated Transformer layers. For each ablation setting, we systematically removed or replaced the corresponding component and evaluated the model under the same experimental setup as described earlier.

As presented in Table 3, removing the Feature Encoder and directly feeding raw features into the Gated Transformer resulted in a notable drop in AUROC (0.8291 vs. 0.8517) and ACC (0.9028 vs. 0.9238), confirming the necessity of embedding alignment across heterogeneous feature types. Excluding the Multi-head Attention from the Gated Transformer led to a significant performance degradation across all metrics, particularly MCC (0.6623 vs. 0.6942), indicating that attention plays an essential role in stabilizing feature interactions. Finally, replacing the Gated Transformer with a standard Transformer reduced both F1 (0.9471 vs. 0.9617) and AUPRC (0.9054 vs. 0.9213), suggesting its effectiveness in capturing global contextual information for robust classification. Overall, these results demonstrate that each component contributes meaningfully to the final performance, with the gated self-attention mechanism providing the largest performance gain. This highlights the importance of both architectural refinements and embedding strategies in achieving good results for tabular data modeling.

Table 3:
Ablation study on key components of the proposed model.
Model variant AUROC AUPRC ACC MCC F1
Without feature encoder 0.8291 0.9017 0.9028 0.6512 0.9462
Without multi-head attention 0.8356 0.9078 0.9102 0.6623 0.9501
Gated Transformer Transformer 0.8402 0.9054 0.9157 0.6705 0.9471
Full model (Ours) 0.8517 0.9213 0.9238 0.6942 0.9617
DOI: 10.7717/peerj-cs.3339/table-3

Results of ML models comparison

Table 4 present the comparative performance of various machine learning models against our proposed method across multiple evaluation metrics. Overall, our model consistently outperforms all baselines. For AUROC, our approach achieves a score of 0.8517, representing a significant improvement over the second-best model, CatBoost, which attains 0.7636. Similarly, for AUPRC, our model records the highest value of 0.9213, surpassing all other methods, with CatBoost again being the closest at 0.9012 and Naive Bayes (0.8929). In addition, our method also demonstrates superior classification performance, achieving the highest ACC (0.9238), MCC (0.6942), and F1 (0.9617). Notably, while models such as LightGBM and XGBoost show competitive results, particularly in ACC and F1, they still lag behind our model by a noticeable margin across all metrics. These results collectively highlight the robustness and effectiveness of our proposed method compared to existing baseline models.

Table 4:
Performance comparison between baseline models and our proposed method.
Model AUROC AUPRC ACC MCC F1
Random Forest 0.7241 0.8819 0.8889 0.6119 0.9351
Naive Bayes 0.7514 0.8929 0.8641 0.5412 0.9173
LightGBM 0.7289 0.8837 0.8899 0.6153 0.9356
K-Nearest Neighbors 0.7125 0.8778 0.8632 0.5128 0.9187
XGBoost 0.7345 0.8859 0.8891 0.6109 0.9348
CatBoost 0.7636 0.9012 0.9091 0.6445 0.9438
Ours 0.8517 0.9213 0.9238 0.6942 0.9617
DOI: 10.7717/peerj-cs.3339/table-4

Results of advanced models comparison

In order to further assess the effectiveness of our proposed method, we conducted a comprehensive comparison against several advanced models specifically designed for tabular data. As shown in Table 5, our model achieves a score of 0.8517, outperforming the best-performing model, FT-Transformer (0.8294), by a margin of 2.7%. Similarly, for AUPRC, our method records 0.9213, compared to 0.9027 achieved by SAINT and AutoInt (0.8932). Moreover, our model exhibits the highest ACC (0.9238), outperforming TabPFN (0.9056) and FT-Transformer (0.9031). For the MCC and F1, two critical indicators for balanced performance in classification tasks, our method also yields the best results, indicating improved predictive stability and robustness. Overall, these comparisons highlight the superiority of our proposed approach over existing advanced techniques for tabular data modeling.

Table 5:
Performance comparison between advanced tabular models and our proposed method.
Model AUROC AUPRC ACC MCC F1
MLP 0.7513 0.8851 0.8725 0.6338 0.9072
TabNet 0.8023 0.8782 0.8921 0.6215 0.9382
TabTransformer 0.8157 0.8859 0.8973 0.6321 0.9425
TabPFN 0.8235 0.8927 0.9056 0.6587 0.9491
FT-Transformer 0.8294 0.8993 0.9031 0.6523 0.9478
SAINT 0.8171 0.9027 0.8992 0.6409 0.9447
AutoInt 0.8011 0.8932 0.8819 0.6320 0.9318
Ours 0.8517 0.9213 0.9238 0.6942 0.9617
DOI: 10.7717/peerj-cs.3339/table-5

Stability and robustness study

To assess the stability and reliability of our proposed method in predicting loan repayment capacity, we performed experiments across 15 different random seeds. Each seed represents a distinct initialization of the random number generator used in the model training process, ensuring that the evaluation captures the inherent variability and robustness of the model under different starting conditions. The results, presented in Table 6, show that the model’s performance metrics remain relatively stable across all trials. The mean values for key evaluation metrics including AUROC, AUPRC, ACC, MCC, and F1 are consistently high, with means of 0.8522, 0.9207, 0.9244, 0.6945, and 0.9610, respectively. This indicates that the model is robust to variations in initialization, as it maintains a strong predictive ability for loan repayment capacity regardless of the random seed used during training. The standard deviations for these metrics are also relatively low, ranging from 0.0042 to 0.0069. These small variations suggest that the model’s performance is not highly sensitive to small changes in initialization, further demonstrating its stability and reliability. Specifically, the low standard deviation in metrics such as AUROC and F1 is an important indicator of how consistently the model can discriminate between repaying and non-repaying loan applicants, which is a crucial aspect of predicting loan repayment capacity accurately.

Table 6:
Stability study across 15 random seeds.
Trial AUROC AUPRC ACC MCC F1
1 0.8517 0.9213 0.9238 0.6942 0.9617
2 0.8425 0.9152 0.9291 0.6827 0.9543
3 0.8571 0.9254 0.9167 0.7023 0.9635
4 0.8493 0.9137 0.9278 0.6912 0.9587
5 0.8659 0.9294 0.9223 0.6985 0.9682
6 0.8481 0.9186 0.9312 0.6881 0.9594
7 0.8564 0.9247 0.9245 0.6967 0.9618
8 0.8432 0.9161 0.9193 0.6854 0.9537
9 0.8603 0.9282 0.9267 0.7002 0.9659
10 0.8518 0.9194 0.9232 0.6949 0.9608
11 0.8578 0.9227 0.9205 0.6973 0.9626
12 0.8475 0.9171 0.9286 0.6892 0.9575
13 0.8612 0.9268 0.9257 0.7015 0.9664
14 0.8504 0.9202 0.9221 0.6935 0.9591
15 0.8551 0.9238 0.9240 0.6957 0.9610
Mean 0.8522 0.9207 0.9244 0.6945 0.9610
Std 0.0069 0.0048 0.0045 0.0050 0.0042
DOI: 10.7717/peerj-cs.3339/table-6

In conclusion, the stability across different random seeds is particularly significant in the context of loan repayment prediction, as it implies that the model is not overfitting to particular subsets of the data or to specific random initialization conditions. This robustness ensures that the model is likely to perform well on new, unseen loan data, making it a dependable tool for financial institutions in assessing the repayment likelihood of loan applicants.

Limitations

While the proposed Gated Transformer model achieves strong predictive performance on the LendingClub dataset, its applicability has several constraints. First, the model’s training and evaluation on a single platform-specific dataset may limit its generalizability to other financial institutions, credit products, or regional contexts without further adaptation. Second, as with most real-world credit data, historical biases in lending decisions and socioeconomic disparities may be embedded in the features, posing risks of perpetuating unfair outcomes if bias detection and mitigation are not implemented. Third, although the token-wise gating mechanism improves interpretability relative to conventional deep learning models, the architecture remains more complex than traditional statistical approaches, which may hinder adoption in regulatory environments requiring clear, transparent decision-making.

Conclusion

In this article, we introduced a novel Gated Transformer-based deep learning model tailored for tabular data in the domain of credit risk assessment. By effectively integrating categorical and numerical features into a unified embedding space and leveraging multi-head self-attention with token-wise gating, our model captures complex feature interactions and improves interpretability, addressing key challenges in economic predictive modeling. Experimental results demonstrate that our approach outperforms advanced tabular data models across multiple metrics, including accuracy, AUPRC, MCC, and F1, highlighting its robustness and predictive power. This work provides a flexible and powerful framework for extracting meaningful economic insights from complex credit datasets, facilitating more informed and responsible financial decision-making. In addition, future research should systematically evaluate the model’s transferability across diverse datasets and financial contexts to address the current limitation of platform dependency. Another important avenue lies in developing robust bias detection and mitigation strategies to prevent the reinforcement of historical socioeconomic disparities. Finally, simplifying the model architecture while preserving predictive performance may enhance its acceptance in highly regulated financial environments.

Supplemental Information

Code and data used in the study.

DOI: 10.7717/peerj-cs.3339/supp-1