Proactive detection of anomalous behavior in Ethereum accounts using XAI-enabled ensemble stacking with Bayesian optimization

View article
PeerJ Computer Science

Main article text

 

Introduction

Major contributions in this article

  • To mitigate the data sample imbalance, oversampling technique is deployed.

  • Bayesian optimization technique is implemented to decide the control parameters of the ML models.

  • Implementation of XAI techniques (SHAP, LIME, and ELI5) to interpret feature importance, adding an explainability layer to the model and improving decision transparency.

  • An ensemble stacking model that combines XGBoost, RF, and NN to provide robust fraud detection capabilities.

  • Extensive experimental validation demonstrating a model accuracy of 99.6%, benchmarked against existing state-of-the-art solutions.

Paper organization

Literature review

Materials and Methods

Dataset description and preprocessing

Exploratory data analysis and scaling

Hyperparameter tuning with Bayesian optimization

  • 1)

    Initial sampling: Bayesian optimization starts with an initial set of hyperparameter configurations, often chosen randomly or using a simple heuristic.

  • 2)

    Modeling the objective function: Bayesian optimization leverages a surrogate model, often a Gaussian process, to represent the objective function—such as validation accuracy or loss—based on hyperparameters. This surrogate model offers predictions of the objective function and its associated uncertainty.

    The Gaussian process (GP) regression model is commonly used as a surrogate model in Bayesian optimization. Given a set of observed data points (xi, yi) where xi are hyperparameter configurations and yi are corresponding objective function values, the GP model predicts the objective function f(x) at point x newly as a Gaussian distribution: F(x)=N(μ(x),σ2(x))

    where:

    µ(x) is the GP mean function, representing predicted objective function value at x.

    σ2(x) is variance function of the GP, representing the uncertainty or confidence in the prediction at x.

  • 3)

    Acquisition function: Using the surrogate model, an acquisition function such as Expected Improvement or Upper Confidence Bound is employed to select the next set of hyperparameters for evaluation. This function manages the trade-off between exploring new configurations and exploiting the most promising ones. Commonly used acquisition functions include:

    (a). Expected improvement (EI): EI(x)=E[max(0,fminf(x))]=(μ(x)fmin)Φ(z)+σ(x)ϕ(z)

    where, fmin is the minimum observed objective function value. Φ(z) is the cumulative distribution function of the standard normal distribution. ϕ(z) is the probability density function of the standard normal distribution. z=μ(x)fminσ(x) is the standardization of the predicted improvement.

    (b). Upper confidence bound (UCB):

    UCB(x)=μ(x)+βσ(x)

    where, β is a tunable parameter that balances exploration (higher values of β) and exploitation (lower values of β).

    (c). Probability of improvement (PI):

    PI(x)=Φ(μ(x)fminξσ(x))

    where, ξ is a parameter that controls the trade-off between exploitation and exploration.

  • 4)

    Evaluation: The selected hyperparameter configuration is evaluated using the actual objective function (e.g., training on a subset of data and validating on a separate validation set).

  • 5)

    Update surrogate model: The surrogate model is refined using the newly acquired data point, which includes the hyperparameter configuration and its corresponding objective function value.

  • 6)

    Iterate: Steps 3–5 are continuously executed until a predefined convergence criterion is satisfied, such as reaching a specific number of iterations.

Machine learning models

Random forest

eXtreme gradient boosting

Neural network

Ensemble stacking model

Results and discussion

Computational cost analysis

Performance comparison with state-of-the-art methods

Limitations and future directions

Conclusion

Additional Information and Declarations

Competing Interests

Author Contributions

Data Availability

Funding

This work was supported by the Vellore Institute of Technology. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more