R-NTN: a robust detection method for Ethereum phishing attacks based on multi-dimensional transaction features

View article
PeerJ Computer Science

Introduction

Blockchain technology, proposed by Satoshi Nakamoto in 2008, has garnered widespread attention due to its characteristics of decentralization, immutability, transparency, anonymity, security, and programmability (Nakamoto, 2008). Blockchain does not rely on any central authority or third-party trust; instead, it maintains a reliable database through a distributed ledger system, with all participants jointly maintaining the integrity of the database (Iansiti & Lakhani, 2017). It serves as the technological foundation for a variety of cryptocurrencies.

Ethereum is one of the largest cryptocurrency platforms built on blockchain technology. According to data released on the DeFi Llama website as of October 4, 2024, Ethereum’s market value is approximately $316.307 billion, with daily transaction counts reaching 12.3 million. However, phishing attacks targeting Ethereum pose a significant risk of economic losses. Within Ethereum, phishing differs from traditional financial fraud, typically involving lower costs, broader reach, and larger sums of money (Lin et al., 2023). For instance, as disclosed by Bitrace, a phishing incident in June 2024 led to a victim being defrauded of over $200,000 worth of Ether and Ethereum staking vouchers (Chen et al., 2020a). These phishing scams have caused enormous economic losses and have become a primary threat to the security of Ethereum transactions. Therefore, detecting phishing accounts is urgently needed to ensure the healthy development of Ethereum (Trozze et al., 2022; Anita & Vijayalakshmi, 2019). However, the anonymity of blockchain poses an obstacle to the detection of phishing accounts. Registration of Ethereum accounts generally does not require identity verification, providing phishing attackers with the opportunity to create and control numerous phishing accounts for fraudulent activities. Fortunately, Ethereum’s transaction records are public, allowing us to identify phishing accounts by analyzing the patterns in their transactions through Ethereum’s transaction data.

Identifying phishing accounts on Ethereum is currently a research hotspot. Some scholars detected phishing accounts based on traditional machine learning methods (Zhao, 2023), but this requires manual feature selection, which heavily relies on the researcher’s experience and may neglect the topological characteristics of accounts due to the independence between accounts (Farrugia, Ellul & Azzopardi, 2020). Other scholars have implemented detection using graph embedding methods (Chen et al., 2020c; Wang et al., 2022; Chen et al., 2020b; Xia, Liu & Wu, 2022), where these methods often model Ethereum transaction data as a directed graph, with accounts as nodes and transactions as directed edges. Then, graph embedding techniques are used to obtain embedding values for each node as features, and finally, downstream classifiers are used for classification. However, these methods do not consider the transaction behavior patterns of nodes, which can easily lead to the loss of important information. There are also scholars using graph neural networks (GNNs) methods (Sun et al., 2024; Liu et al., 2024), which also model Ethereum transactions as the input of neural networks. The GNN-based methods can achieve high classification performance, but the issue of extreme data imbalance makes these methods lack sufficient robustness. The number of accounts in the Ethereum transaction network is very large, while the number of ground truth labeled phishing nodes is very limited and widely distributed, making it difficult to construct a complete strongly connected graph that includes all phishing nodes, and it is also challenging to maintain a reasonable ratio of normal accounts to phishing accounts.

To address the aforementioned challenges, we propose Robust, Node behavior, Transaction structure, and Network (R-NTN; the source code of R-NTN has been released at: https://github.com/AuroraBorealis222/R-NTN.), a robust phishing accounts detection framework. Based on Ethereum transaction data, R-NTN constructs a 2-hop ego graph through random walks. It then uses a combination of traditional machine learning and graph embedding methods to build features and hand them over to downstream classifiers for phishing account identification. Specifically, we organize features from three dimensions: node behavior features, graph-based transaction features, and network features. The node behavior features and graph-based transaction features are extracted using prior knowledge, while the network features are obtained through graph embedding methods. By concatenating these features, we obtain features that can capture global information, and finally use downstream classifiers for identification. R-NTN can effectively utilize information from different dimensions in Ethereum transaction data, thereby overcoming the problem of data sample imbalance and enhancing the robustness of the detection model. Additionally, the design of global features also compensates for the shortcomings of traditional machine learning in capturing high-order information.

The main contributions of this article are summarized as follows:

  • Based on the characteristics of Ethereum transaction data, we extract features from three distinct dimensions, node behavior, transaction patterns, and network structure, thereby constructing more comprehensive node features.

  • We have designed a series of biased random walk strategies for feature augmentation, which can effectively circumvent the camouflage behavior of phishing nodes and uncover deep hidden information within transaction data.

  • We have conducted experimental validation on public Ethereum transaction data. The results demonstrate that our method outperforms baseline methods as well as state-of-the-art (SOTA) methods. Additionally, we performed experiments on small-scale datasets to evaluate robustness, and the results indicate that our method exhibits the best robustness among similar methods.

In the next section, we review the latest research advancements related to Ethereum phishing detection. ‘Research Preparations’ outlines the preparatory work for this study, while ‘Proposed Framework’ presents the framework of R-NTN, detailing specific metrics and algorithms. ‘Experiments’ describes the experimental process and the corresponding results. Finally, ‘Conclusion and Future Work’ concludes this article and discusses directions for future work.

Related work

Traditional machine learning-based methods

Traditional machine learning-based detection methods primarily identify Ethereum phishing accounts by extracting relevant features and applying downstream machine learning classifiers. Farrugia, Ellul & Azzopardi (2020) extracted 42 account features from collected transaction data and applied eXtreme Gradient Boosting (XGBoost) for classification. Using the XGBoost model, the authors assessed the most important features for distinguishing between normal and fraudulent accounts, such as the total duration of account usage, the average minimum amount of Ether sent, and the account balance. Bian et al. (2020) constructed features in two parts: one based on manually summarized features from transaction history, and the other consisting of statistical features generated with the automated feature construction tool Featuretools. They then applied various classifiers, including K-Nearest Neighbors (KNN), Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), Adaptive Boosting (AdaBoost), and Light Gradient Boosting (LightGBM) to detect fraudulent Ethereum accounts. Ibrahim, Elian & Ababneh (2021) used a feature selection protocol to highlight the most important features, ultimately obtaining a dataset with 26 features. They then applied three classification models—XGBoost, SVM, and logistic regression—to identify phishing nodes. Despite these methods achieving certain results, they fail to capture the full range of potential features within the Ethereum transaction network.

Graph embedding-based methods

The primary objective of graph embedding is to transform nodes in a graph into low-dimensional vector representations, thereby enabling the execution of downstream tasks. Common methods for graph embedding include matrix factorization, random walks, and deep learning techniques. On the Ethereum platform, random walks and deep learning methods are most frequently used for node embedding. Wu et al. (2020) conducted biased random walks based on transaction information, such as transaction amounts, timestamps, and their combinations. They then employed the Skip-gram model to learn node embeddings and used an SVM classifier for the downstream identification task. In Wang et al.’s (2022) work, embeddings for nodes, edges, and attributes were learned based on walk sequences, which proved particularly effective for heterogeneous networks. In contrast to graph embedding methods based on random walks, deep learning-based graph embeddings do not require walks; instead, they use deep learning techniques to directly learn embeddings. Chen et al. (2020b) proposed a detection method based on graph convolutional networks (GCN) and autoencoders. Specifically, the GCN acts as an encoder, and its output is used as a decoder to approximate the adjacency matrix, thereby generating node embeddings. These embeddings are then concatenated with the original node features for classification. Xia, Liu & Wu (2022) proposed an attribute-based ego graph embedding framework to detect phishing accounts on Ethereum by integrating both structural and attribute characteristics of Ethereum transaction records.

Graph neural networks-based methods

GNNs first compute the node features in the transaction graph and then input these features directly into the GNN for model training. Once the model is trained, it outputs the predicted results for the nodes. Sun et al. (2024) proposed the Attention-Based Graph Representation Learning (ABGRL) method, which utilizes multi-channel feature extraction for nodes and integrates these features through an adaptive attention mechanism, focusing on the most relevant information for the task. It also employs a self-supervised regression model to enhance the feature representation of nodes with low degrees, addressing the long-tail distribution problem and improving the accuracy of phishing account detection. Huang et al. (2024) proposed a novel framework called PEAE-GNN, which introduces a feature enhancement strategy based on structural features, transaction features, and interaction strength, to learn ego-graph representations through GNNs. Additionally, they proposed a graph-level representation method that sorts the updated node features in descending order and averages the top- n node features, thereby preserving key information while minimizing noise. Liu et al. (2024) introduced a heterogeneous GNN method based on neighbor relationship filtering, which uses random walks and reinforcement learning to assess the similarity between Ethereum transaction accounts and select the best neighbor nodes. By aggregating internal and external relationships to represent neighborhood connections and introducing an initial residual in the cross-relationship aggregation to prevent overfitting, this approach ensures the suitability of the neighbor aggregation process for deeper GNN architectures.

In addition to adaptive and heterogeneous GNN models, Jin & Yang (2024) explored phishing detection in Ethereum smart contracts by constructing a contract-level transaction graph. They applied Synthetic Minority Oversampling Technique–Edited Nearest Neighbors (SMOTE-ENN) to handle class imbalance and compared various backbone models, including Multi-Layer Perceptron (MLP) and GCN. Their results reveal that, under specific contract-level transaction structures, simpler models like MLP can surpass GCN in both efficiency and scalability, offering valuable insights into model selection for fraud detection in contract-based environments. Furthermore, Chang et al. (2024b) introduced an improved GAT-based model for detecting anomalous nodes in blockchain graphs. By integrating subtree-level attention and a bagging-stacking ensemble framework, their approach effectively captures multi-hop neighborhood dependencies and enhances detection robustness, especially under noisy or sparse transaction conditions.

Recently, contrastive learning has also been introduced into Ethereum fraud detection to enhance representation quality under limited supervision. Jin, Wang & Li (2024) proposed Meta-IFD, a generative-and-contrastive self-supervised learning framework that captures latent behavioral semantics by aligning node representations across multiple interaction views. The method leverages contrastive loss to distinguish phishing accounts from normal ones by maximizing mutual information between consistent node views while minimizing similarity with unrelated samples. Experimental results demonstrate that incorporating contrastive objectives significantly improves detection performance, especially under imbalanced or noisy data conditions.

Research preparations

Data acquisition

We utilized the blockchain dataset published on the XBlock (Data is available at https://xblock.pro/#/dataset/13.). Website and curated it throughout the research process. We refer to the dataset construction method proposed in literature (Xiang et al., 2023). However, unlike a completely random approach, we aim to capture more nodes that have strong relationships with the source node. Specifically, we assume that nodes with higher transaction volumes with the source node are more closely related. Given a weighted transaction graph G0(V0,E0,ω) as input (where ω:E0R+ represents the transaction amount on each edge), we generate the target subgraph G1(V1,E1) using an improved weighted random walk strategy.

  • 1.

    Designing a probability-based sampling mechanism based on transaction amount proportions, where the probability of selecting node u given node v is defined as: P(u|v)=ω(v,u)uV0ω(v,u).The summation in the denominator runs over all nodes u in the set V0.

  • 2.

    Introducing a bidirectional edge-weight merging operator to ensure graph symmetry, formulated as: ωsym=ω(uv)+ω(vu).

  • 3.

    Implementing a progressive neighborhood expansion under a dynamic path exclusion constraint.

The algorithm controls the maximum number of sampled nodes using parameter K and limits the walk depth with k. The final output is a refined subgraph that contains the source node and its highly weighted transaction-related nodes, providing a high-value subgraph construction method for phishing node detection. The detailed procedure is illustrated in Algorithm 1.

Algorithm 1 :
Amount-aware Ethereum k-hop subgraph extraction.
Input: Weighted transaction graph G0(V0,E0,ω) where ω:E0R+, source nodes SV0, max nodes KN+, hop threshold kN+
Output: Pruned k-hop subgraph G1(V1,E1)
 1: procedure PruneGraph ( G0,S,K,k)
 2:    GunSymmetrize(G0,)                                     merges edge weights
 3:    Γ                                                      Global neighborhood set
 4:   For each vS do
 5:        τv                                                      Local accumulator
 6:       while |τv|<K do
 7:           uWeightedSample(N(v),ω(v,))                Amount-proportional sampling
 8:           P{vu}                                             Weighted walk path
 9:          for i=2 to k do
10:              wWeightedSample(N(u)P,ω(u,))
11:              PP{uw}
12:              uw                                           Propagate weighted walk
13:          end for
14:           τvτvP
15:       end while
16:        ΓΓτv
17:    end for
18:     V1S{vΓ}
19:     E1{(u,w)E0u,wV1}
20:    return G1(V1,E1)
21: end procedure
DOI: 10.7717/peerj-cs.3445/table-5

Ethereum’s transaction structure

We visualized a subset of Ethereum’s transactions and nodes, where red nodes represent phishing nodes and green nodes represent normal nodes, as shown in Fig. 1A. The distribution of nodes in Ethereum exhibits the following characteristics: First, the network contains a large number of nodes, with low-degree nodes constituting 90% of the total, as illustrated in Fig. 1B. Most of these low-degree nodes tend to connect with high-degree nodes. Second, the number of ground-truth phishing nodes is significantly smaller than that of normal nodes, resulting in a highly imbalanced sample ratio of 1:2576.5. Finally, Ethereum transactions display community-like structures, where most nodes tend to connect with other nodes within the same community.

(A) Partial Ethereum transaction subgraph. (B) Cumulative degree distribution of Ethereum accounts. Most nodes have low degrees, while a few serve as hubs.

Figure 1: (A) Partial Ethereum transaction subgraph. (B) Cumulative degree distribution of Ethereum accounts. Most nodes have low degrees, while a few serve as hubs.

Formal definition

Building on the directed transaction network and the distribution characteristics of Ethereum nodes obtained in ‘Data Acquisition’, we formally define the problem. The identification of phishing accounts can be viewed as a fully supervised node classification task. We represent the transaction network as G(V,E), where V is the set of nodes in the transaction network, i.e., the set of account addresses within the network, and E is the set of directed edges representing transactions. Each directed edge signifies a transaction from the source node, the initiator of the transaction, to the target node, the recipient of the transaction. In our study, vV is defined as a set of ordered pairs v(n,y), where n is the node’s address and y is the label of the node with y{0,1}. The label 0 indicates a normal node, while the label 1 indicates a phishing node, which is the positive example we aim to identify. eE is defined as a quadruple ens,nd(ns,nd,a,t), where ns represents the initiating node of the transaction, nd represents the destination node, a denotes the amount of the transaction, and t denotes the timestamp of the transaction.

In the context of Ethereum phishing detection, anomalies are defined as accounts whose transactional behaviors and network structures significantly deviate from those of legitimate users. These anomalies typically exhibit suspicious patterns such as abnormally high transaction frequency, disproportionate incoming and outgoing fund flows, or frequent interactions with newly created or blacklisted addresses. In this work, we focus on point anomalies, where individual accounts demonstrate behavior that is statistically irregular in comparison to the majority of accounts. This includes accounts that act as hubs in short timeframes, or those forming peculiar structures in the transaction graph. Unlike collective or contextual anomalies, which require modeling group behavior or external conditions, point anomalies are more prevalent and practical in Ethereum fraud detection scenarios.

Our proposed method primarily focuses on feature engineering to capture more comprehensive information. We divide the feature construction process into three components, Node Behavior Features FNB, Graph-based Transaction Features FTN, and Network Features FG. These three feature sets are normalized and concatenated into a feature matrix X|V|×d where d represents the total dimensionality, which is the sum of the dimensions of FNB, FTN, and FG features.

Proposed framework

In this section, we propose R-NTN, a robust framework for detecting Ethereum phishing accounts, which leverages multi-dimensional features. By integrating downstream classifiers, we achieve effective detection of phishing nodes while addressing the challenge of insufficient robustness in detection models when handling imbalanced datasets. The architecture of the proposed framework is illustrated in Fig. 2. First, we obtain on-chain real data from the Ethereum network and construct a dataset based on this data. Features are then extracted from three dimensions and subsequently normalized. Finally, the normalized features from all three dimensions are input into a classifier for classification.

Overview of the R-NTN framework.

Figure 2: Overview of the R-NTN framework.

The dataset is constructed from the Ethereum transaction network. Features are extracted from three perspectives—behavioral, structural, and temporal—and then fused and fed into the detector for phishing account identification.

Node behavioral features

This section focuses on features that capture the transactional behavior of nodes. We define nine features, including out-degree, in-degree, total degree, total amount received, total amount sent, overall transaction volume, transaction frequency, frequency variability, and anomaly matching score.

Node degree. The degree of a node is a basic property in graph theory. The out-degree Dout denotes how many times the node sends ether to others, while the in-degree Din indicates how many times it receives ether. The total degree Dall represents the total number of transactions involving the node. These are formally defined as:

Dout/in=|{eE(ns,nd,a,t)ns/d=n}|

Dall=Din+Dout.

Node transaction amounts. Transaction amounts are vital features that reflect fund flows and activity levels on the Ethereum network. The total transaction amount Aall refers to the sum of all transactions associated with a node. The incoming transaction amount Ain captures the total value of ether received by the node, while the outgoing transaction amount Aout denotes the total value sent. These metrics help reveal economic behavior, detect potential phishing activity, and analyze fund movement patterns. Their formal definitions are as follows:

Ain/out(v)=eEae=(ns,nd,a,t)andnd/s=n

Aall=Ain+Aout.

In the formula, a represents the amount associated with the edge ens,nd, where e denotes the directed transaction between nodes ns and nd. The condition nd/s=n indicates that n is either the source node ns or the target node nd.

Node transaction features. We extract three transaction-based features for each node: transaction frequency, frequency variability, and anomaly matching degree. Transaction frequency measures the ratio between the number of transactions and the active duration of the node. It serves as a key indicator of abnormal behavior. For example, phishing nodes may initiate numerous transactions within a short period to increase their visibility and deceive other users. Transaction frequency variability captures fluctuations in transaction activity over time. A high variability suggests inconsistent behavior, such as sudden spikes in transaction volume or connections to many new nodes, which may indicate malicious intent. While high transaction volume and large amounts often suggest normal activity, phishing nodes may instead issue many small-value transactions to avoid detection. To identify such anomalies, we define the anomaly matching degree as a value between 0 and 1, representing the consistency between transaction amount and count. This feature helps distinguish phishing nodes that exhibit unusual transaction patterns. The formal definitions of transaction frequency f(v), frequency variability Diff(v), and anomaly matching score AnomalyMatch(v) are as follows:

f(v)=Dall(v)tlast(v)tfirst(v).

In this formula, tlast(v) and tfirst(v) represent the timestamps of the last and first transactions of node v in the dataset, respectively. dall denotes the total degree of the node.

Diff(v)=i=1n1(ΔtiΔt¯)2Dall(v)1.

In this formula, Diff(v) quantifies the variability of transaction frequency for node v. Here, Δti refers to the time interval between the i-th and (i+1)-th transactions, which represents the time difference between two consecutive transactions. Δt¯ indicates the average value of all transaction intervals, while Dall(v)1 represents the number of transaction intervals for the node.

AnomalyMatchScore(v)=Aall(v)Dall(v).

In this context, Aall(v) represents the total transaction amount associated with a given node v, while Dall(v) denotes the degree of the node, which is defined as the total number of transactions the node has participated in.

After completing the above steps, we construct the feature matrix χNB|N|×d, which captures the transactional behavior of each node. Here, |N| denotes the number of nodes, and d=9 is the number of behavioral features. We then apply min-max normalization to scale all feature values to a consistent range. Finally, this feature matrix is concatenated with additional feature sets to support the analysis of more complex behavior patterns, providing a strong foundation for downstream tasks.

Graph-based transaction features

Graph-based transaction network features capture complex interactions and structural patterns in transaction networks. These features reveal node connectivity, transaction distribution, and potential community structures. We categorize them into six groups: subgraph features, Pearson correlation coefficients, betweenness centrality, closeness centrality, PageRank scores, and graph density.

Subgraph features. Ethereum transaction networks are often imbalanced. To handle this, we preprocess by constructing subgraphs. Since phishing nodes may be spatially distant—possibly linked to different organizations or deliberately avoiding direct connections—strongly connected graphs are difficult to form. Thus, multiple subgraphs naturally exist. To understand their importance, relationships, and transaction behavior, we analyze specific subgraph metrics. These include average and standard deviation of in-degree, out-degree, and total degree, as well as maximum degrees for each subgraph. Given a subgraph S with node set VS and edge set ES, for any node vVS, the following definitions apply:

Avgin(S)=1|VS|vVSDin(v),Avgout(S)=1|VS|vVSDout(v),Avgall(S)=1|VS|vVSDall(v),where Avgin(S), Avgout(S), and Avgall(S) are the average in-degree, out-degree, and total degree of nodes in subgraph S, respectively. Here, Din(v) counts edges directed to node v, Dout(v) counts edges originating from v, and Dall(v)=Din(v)+Dout(v).

SDin(S)=vVS(Din(v)Din¯)2|VS|1,SDout(S)=vVS(Dout(v)Dout¯)2|VS|1,SDall(S)=vVS(Dall(v)Dall¯)2|VS|1,where SDin(S), SDout(S), and SDall(S) measure the dispersion of node degrees in S. The terms Din¯, Dout¯, and Dall¯ denote the average in-degree, out-degree, and total degree of all nodes in S. The denominator |VS|1 ensures unbiased estimation of the population standard deviation.

Let VS denote the set of nodes within the subgraph S. The notation |VS| represents the number of nodes in S. The maximum degrees Maxin(S), Maxout(S), and Maxall(S) are defined as the highest in-degree, out-degree, and total degree observed among nodes in S, respectively.

Pearson correlation coefficient. We use the Pearson correlation coefficient to measure the linear relationship between a node’s degree and the degrees of its neighbors. A coefficient near 1 indicates a strong positive correlation, meaning high-degree nodes tend to connect to other high-degree nodes, suggesting clustering. A value near −1 shows a strong negative correlation, where high-degree nodes link to low-degree nodes, reflecting specific connection preferences. A value close to 0 implies no clear linear correlation, indicating more complex or random connection patterns. Small absolute values of r may also signal nonlinear or stochastic network behaviors. The Pearson correlation coefficient r is defined as follows:

r=n(did¯)(xix¯)(did¯)2(xix¯)2where r represents the Pearson correlation coefficient. n denotes the number of neighboring nodes. di is the degree of the neighboring node i. d¯ is the average degree of all neighboring nodes. xi refers to the degree of the current node (constant across all neighbors). x¯ is the average degree of the current node (a constant value). Notably, The term x¯ remains constant because the current node’s degree is uniform for all its neighbors.

Betweenness centrality. Betweenness centrality identifies nodes that occupy key positions in transaction pathways. A high value means the node plays a critical role in controlling the flow of information and funds. A low value indicates the node has a minor role and limited influence. We calculate betweenness centrality for each node using the following formula:

CB(v)=svtσst(v)σst.

Here, CB(v) represents the betweenness centrality of node v. The term σst denotes the total number of shortest paths between nodes s and t, while σst(v) represents the number of such paths that pass through node v. The summation is performed over all node pairs (s,t), where neither s nor t is equal to v.

Closeness centrality. Closeness centrality measures how central a node is by calculating the average shortest path length from the node to all others in the network. A smaller average distance means the node is more central and can spread information faster. A larger distance indicates the node is more peripheral. For a node v, its closeness centrality CC(v) is defined as:

CC(v)=n1ivdiv.

Here, n represents the total number of nodes in the network, div denotes the length of the shortest path from node v to node i, and the summation is taken over all nodes i where iv.

PageRank. PageRank evaluates the importance of nodes within a network. Originally developed by Larry Page for ranking web pages in Google’s search engine, it assigns higher importance to nodes with many or high-quality incoming links. A high PageRank means a node is influential, having more or stronger incoming connections. A low PageRank indicates less influence due to fewer or weaker links. In network analysis, PageRank captures both the quantity and quality of incoming links. In this study, we use PageRank to measure node significance in transaction networks. The PageRank PR(v) for a node v is defined as:

PR(v)=1dn+duBvPR(u)L(u)where PR(v) represents the PageRank value of node v, n denotes the total number of nodes in the network, and d is the damping factor. The damping factor, in this context, reflects the probability that a random surfer will continue to follow the transaction path, indicating the persistence of fund flow and the level of transaction activity within the network. Bv represents the set of nodes that link to node v, while L(u) denotes the out-degree of node u. The summation is performed over all nodes u that point to v.

Graph density. In the Ethereum transaction network, density reflects the degree of connectivity within the network. A high-density network indicates numerous transaction relationships among addresses, whereas a low-density network may suggest that transaction activities are more dispersed. The density of subnetworks is considered a feature to reveal the intensity of transaction relationships between nodes within the network. High-density subnetworks indicate frequent internal transactions and close connections between nodes, which help capture the importance of different subnetworks and the relationship between transaction volumes in subgraphs and phishing nodes. The density D(V) of a transaction subgraph is defined as follows:

D(V)=2en(n+1)where e is the number of edges in the subgraph, and n is the number of nodes in the subgraph.

We extract the aforementioned features and apply min-max normalization to construct a feature matrix FN of size |V|×i, where i is a constant set to 14. These features not only capture the structural information of the network but also reveal dynamic relationships and behavioral patterns between nodes, thereby offering a valuable perspective for the model to comprehensively understand the network’s characteristics.

Graph-based network features

Although feature engineering can identify a large portion of phishing nodes, some malicious accounts in Ethereum transactions deliberately mimic legitimate user behavior to evade detection. These deceptive strategies pose significant challenges for traditional detection methods. Inspired by the work of Xiang et al. (2023), we propose a feature augmentation method based on a biased Skip-Gram model. This approach enables a deeper analysis of transaction patterns in the Ethereum transaction network, helping to uncover hidden structures and detect anomalous behavior.

In our method, transactions between Ethereum accounts and their neighbors are converted into random walk probabilities. These probabilities reflect variations in transaction amounts and timestamps. Specifically, a biased random walk is initiated from a randomly selected node in graph G, generating a sequence of nodes with a fixed length. The Skip-Gram model is then employed to learn low-dimensional vector representations for each node, resulting in a feature matrix XG|V|×sn. This process uses a gradient ascent algorithm to optimize the embedding parameter θv, thereby learning representations that maximize the co-occurrence probability between a node and its surrounding context nodes:

maxθvuC(v)log p(uv;θv)where θv represents the embedding vector of node v, log p(uv;θv) denotes the conditional probability of node u given v, C(v) is the set of contextual nodes of v.

The algorithm maps sequential node information into a vector space, ensuring that adjacent nodes in the original network remain geometrically close. This facilitates the understanding and prediction of transaction behaviors between nodes. Key hyperparameters are included. T: Controls the number of walks per node. L: Defines the length of generated node sequences.

Furthermore, we have designed two types of biased random walk strategies, one based on transaction amounts and the other based on timestamps.

Transaction-amount-based biased random walk. During the Ethereum transaction process, it is generally assumed that the larger the transaction amount, the stronger the connection between the two parties. To compute the probabilities for the random walk, we pre-calculate the transaction amounts associated with each node. The transition probability for the random walk is defined as follows:

p(uv)={TvuwN(v)Tvw,ifwN(v)Tvw>0,1|N(v)|,otherwise..where Tvu represents the transaction volume from node v to its neighbor u, N(v) denotes the set of neighbor nodes of v, and |N(v)| is the number of neighbors of v.

If the total transaction volume wN(v)Tvw for node v is greater than zero, the transition probability to neighbor u is proportional to Tvu relative to the total transaction volume across all neighbors of v. Conversely, if the total transaction volume of v is zero (i.e., no transactions occur), the transition probability is uniformly distributed among all neighbors. This process normalizes the transaction volumes among neighbors, ensuring that the sum of transition probabilities from node v equals 1. This allows the random walk to explore the network with bias determined by the magnitude of transaction volumes.

Timestamp-based biased random walk. Similarly, it is generally believed that the later the timestamp, the closer the connection between two nodes. Suppose we have a current node v and its set of all neighboring nodes N(v). Each neighboring node u has an associated timestamp Tvu for the last transaction with node v. We define the probability P(uv) of node v transitioning to the next node u as follows:

Similarly, it is commonly assumed that the more recent the timestamp, the stronger the connection between two nodes. Let us consider a current node v and its set of neighboring nodes N(v). Each neighboring node u is associated with a timestamp Tvu, representing the time of the most recent transaction involving node v and node u. The transition probability P(uv), which defines the likelihood of transitioning from node v to the next node u, is given as follows:

P(uv)=exp(f(Tv,Tvu)τ)kN(v)exp(f(Tv,Tvk)τ)

f(Tv,Tvu)=|TvTvu|.

Here, Tv denotes the last transaction timestamp of node v, and Tvu represents the last transaction timestamp between nodes v and u. The decay factor τ controls how temporal differences affect random walk probabilities. N(v) refers to the set of neighbors of node v. The resulting probability reflects the recency of transactions, favoring nodes with more recent interactions. This time-aware strategy enables the random walk to capture dynamic and time-dependent patterns in the network.

To ensure feature consistency, we normalize the extracted features FNB, FTN, and FG. This normalization improves model efficiency and accuracy during training. The normalized features are then concatenated to form a unified feature matrix X|V|×m. This integration not only preserves feature relationships but also provides the classifier with a richer set of information, which is crucial for improving overall model performance.

Experiments

In this section, we present the experimental evaluation of our methodology, followed by a comparison, analysis, and discussion of the results in relation to relevant methods.

Dataset

We utilized the Ethereum phishing transaction network dataset provided by XBlock for our experiments. This dataset originally contains 2,973,489 nodes and 13,551,303 edges. Among these, 1,154 nodes are labeled as phishing accounts based on data collected from the Etherscan labeled cloud, while the rest are regarded as normal user accounts. To ensure computational efficiency while focusing on relevant structural information, we applied a 2-hop subgraph extraction centered on the labeled phishing nodes, following the network construction strategy described in the previous section. After filtering, the resulting subgraph consists of 69,227 nodes and 231,999 edges, preserving all 1,154 phishing nodes. Detailed statistics of the processed dataset are provided in Table 1.

Table 1:
Node information within the dataset.
Account type Nodes Mean in-deg. Avg. Out-deg. Avg. Input amount Avg. Output amount
Fishing account 1,154 18.5363 5.48613 56.6362 39.3702
Normal account 68,073 3.0938 3.3150 49.6651 49.9578
DOI: 10.7717/peerj-cs.3445/table-1

Experimental setup

We evaluated the performance of the proposed method by comparing it with eight relevant approaches, including both classical and state-of-the-art techniques. Among these, Transaction to Vector (Trans2Vec) (Wu et al., 2020) is a classic node embedding model, while GCN (Kipf & Welling, 2017) and Graph Sample and Aggregate (GraphSAGE) (Hamilton, Ying & Leskovec, 2017) are traditional GNN-based models. GAT (Veličković et al., 2018), Graph Transformer (Graphormer) (Ying et al., 2021), Graph Attention Network Version 2 (GATv2) (Brody, Alon & Yahav, 2021), EGAT (Edge Aggregated Graph Attention Networks) (Wang, Chen & Chen, 2021), and Multi-level Graph Transformer (M-Graphormer) (Chang et al., 2024a) are state-of-the-art GNN models that incorporate attention mechanisms. The detailed descriptions of these methods are as follows:

Trans2Vec: A method for detecting Ethereum phishing nodes that simultaneously considers transaction timestamps and amounts for node embedding.

Graph Convolutional Network: Utilizes information from adjacent nodes in graph-structured data to learn node representations. GCN updates each node’s features by aggregating the features of neighboring nodes, thereby capturing complex relationships between nodes.

GraphSAGE: A graph neural network algorithm for generating node embeddings. It generates fixed-size embeddings for target nodes by sampling and aggregating the features of neighboring nodes, making it suitable for large-scale graph data. GraphSAGE can effectively embed any node in a graph, demonstrating good performance even in graphs with a large number of nodes.

GAT: Introduces an attention mechanism that dynamically assigns different weights to neighboring nodes. This enables GAT to more effectively aggregate information from neighboring nodes, improving the learning of node representations.

Graphormer: A graph neural network model based on the Transformer architecture, specifically designed for processing graph-structured data. It extends the self-attention mechanism to the graph domain, allowing the model to consider global dependencies when processing node representations. Graphormer captures complex graph patterns and long-range dependencies by effectively aggregating information from the entire graph, demonstrating excellent performance in various graph learning tasks.

GATv2: An improved version of the original GAT model, designed to enhance the flexibility and expressiveness of GNNs in node feature aggregation. It introduces a more powerful attention mechanism, allowing the model to consider richer contextual information when aggregating features from neighboring nodes, thereby improving the quality of node representations and overall model performance.

EGAT: Enhances the model’s capabilities by applying an attention mechanism to the edges of a graph, enabling dynamic adjustment of information flow based on the importance of different edges.

M-Graphormer: An enhanced version of the Graphormer model, designed for multi-channel graph transformers in node representation learning.

Hyperparameter settings

During the random walk phase, the parameters are configured as follows: Walk length: L=40, Number of walks per node: T=50,Embedding dimension: d=256. For the downstream classification task, an XGBoost (XGB) classifier is adopted with the following hyperparameters: Learning rate: learning_rate=0.5, Number of estimators: n_estimators=100, Maximum tree depth: max_depth=5.

To evaluate the classification performance, four metrics are used:

Precision=TPTP+FP

LRecall=TPTP+FN

F1-score=2×Precision×RecallPrecision+Recallwhere TP represents true positives (correctly predicted positive cases), FP represents false positives (incorrectly predicted positive cases), FN represents false negatives (incorrectly predicted negative cases), and TN represents true negatives (correctly predicted negative cases). The AUC (Area Under the Receiver Operating Characteristic (ROC) Curve) is approximated via numerical integration. A common implementation uses the trapezoidal rule to estimate the area under the ROC curve. Specifically, we have a sorted list of true positive rates (TPR) and false positive rates (FPR), and AUC can be approximated through the following steps:

  • 1.

    Plot pairs of false positive rate (FPR) and true positive rate (TPR) as points (xi,yi), where:

    • xi: FPR of the i-th point,

    • yi: TPR of the i-th point.

  • 2.

    Sort all points by their x-values (FPR) in ascending order.

  • 3.

    Approximate the area under the curve using the trapezoidal rule. For each pair of adjacent points (xi,yi) and (xi+1,yi+1), calculate the trapezoid area: Areai=(yi+yi+1)2×(xi+1xi).The total AUC is then the sum of all trapezoidal areas: AUC=i=1n1Areai.Area of the trapezoid between the i-th and (i+1)-th points.

Experimental results and analysis

Detection performance. In this subsection, we conduct a comprehensive evaluation and analysis of the model we proposed. Specifically, we employ classification methods to assess our model, with the results presented in Table 2 and Fig. 3.

Table 2:
Comparison of performance of different models.
Best values are bold and second-best values are underlined.
Method Precision Recall F1-score AUC
GCN 0.6217 ± 0.0300 0.7217 ± 0.0250 0.6680 ± 0.0280 0.6462 ± 0.0350
GraphSAGE 0.7719 ± 0.0200 0.9244 ± 0.0150 0.8413 ± 0.0180 0.8885 ± 0.0200
GAT 0.6983 ± 0.0220 0.9156 ± 0.0180 0.7923 ± 0.0210 0.8220 ± 0.0250
Trans2Vec 0.7459 ± 0.0280 0.6926 ± 0.0300 0.7007 ± 0.0250 0.6805 ± 0.0350
Graphormer 0.7794 ± 0.0170 0.9258 ± 0.0130 0.8463 ± 0.0150 0.9199 ± 0.0180
GATv2 0.7674 ± 0.0190 0.9296 ± 0.0120 0.8408 ± 0.0160 0.8823 ± 0.0220
EGAT 0.6977 ± 0.0240 0.9233 ± 0.0200 0.7985 ± 0.0230 0.8440 ± 0.0270
M-Graphormer 0.7805 ± 0.0150 0.9161 ± 0.0170 0.8533 ± 0.0140 0.9159 ± 0.0190
Ours 0.9152 ± 0.0050 0.9412 ± 0.0040 0.9237 ± 0.0050 0.9618 ± 0.0050
DOI: 10.7717/peerj-cs.3445/table-2
Comparison of precision, recall, F1-score, and AUC metrics across different models.

Figure 3: Comparison of precision, recall, F1-score, and AUC metrics across different models.

The proposed model consistently outperforms baselines, particularly in terms of F1 and AUC, indicating superior overall detection capability.

Our method demonstrates significant advantages in terms of precision, recall, F1-score, and AUC, outperforming the second-best method by 14.33%, 1.16%, 7.04%, and 4.19%, respectively. This is primarily attributed to our approach’s ability to capture comprehensive features, enabling not only effective identification of positive samples but also accurate detection of negative samples. Among all the comparison methods, GCN and Trans2Vec performed poorly. Specifically, GCN aggregates neighbor information by smoothing the adjacency matrix and node features. However, due to the complex interaction patterns between phishing accounts, GCN struggles to capture these intricate dependencies, thereby limiting its performance. On the other hand, Trans2Vec is a low-dimensional representation learning method that trains embedding vectors based on the similarity information of adjacent nodes. Although it considers transaction timestamps and amounts during the random walk process, it neglects behavioral information of nodes and graph-based trading features, extracting only local information. While GraphSAGE effectively handles large-scale graph data and supports inductive learning, it randomly samples neighbor nodes. This sampling strategy is effective for identifying phishing nodes but fails to capture sufficient information for normal nodes, as edge nodes often lack enough informative neighbors. Both GAT and GATv2 leverage attention mechanisms, but their performance is constrained due to the sparse distribution of phishing nodes and their significantly smaller proportion compared to normal nodes. Notably, GATv2 outperforms GAT, likely because its improved dynamic attention mechanism is more adept at capturing the complex information inherent in Ethereum transaction networks.

To further validate the robustness of our model, we conducted experiments using 60% and 30% of the total dataset to evaluate performance on smaller-scale datasets. The detailed numerical results are presented in Table 3, which summarizes precision, recall, F1-score, and AUC for all baseline models and our method (Ours). As shown in the table, reducing the dataset size generally results in performance degradation for most models, indicating sensitivity to data volume. Nevertheless, our model consistently achieves the highest values across all metrics, demonstrating superior generalization even with limited data.

Table 3:
Comparison of performance of different models on smaller-scale datasets.
Best values are bold, and second-best values are underlined.
Method 60% data 30% data
Precision Recall F1-scores AUC Precision Recall F1-score AUC
GCN 0.6095 ± 0.0300 0.6195 ± 0.0280 0.6572 ± 0.0270 0.6377 ± 0.0330 0.5902 ± 0.0350 0.5333 ± 0.0400 0.5604 ± 0.0380 0.5582 ± 0.0420
GraphSAGE 0.7317 ± 0.0220 0.8117 ± 0.0200 0.8014 ± 0.0210 0.8114 ± 0.0190 0.6667 ± 0.0240 0.7452 ± 0.0220 0.7040 ± 0.0230 0.7865 ± 0.0210
GAT 0.6411 ± 0.0250 0.8066 ± 0.0220 0.8126 ± 0.0200 0.7611 ± 0.0280 0.6001 ± 0.0320 0.5577 ± 0.0330 0.5782 ± 0.0310 0.7483 ± 0.0260
Trans2Vec 0.7497 ± 0.0280 0.6697 ± 0.0300 0.7072 ± 0.0290 0.6705 ± 0.0340 0.5699 ± 0.0370 0.6412 ± 0.0360 0.6032 ± 0.0340 0.6061 ± 0.0350
Graphormer 0.7662 ± 0.0190 0.8522 ± 0.0160 0.8176 ± 0.0170 0.9014 ± 0.0180 0.6921 ± 0.0200 0.8049 ± 0.0220 0.7446 ± 0.0210 0.8930 ± 0.0200
GATv2 0.7148 ± 0.0210 0.8848 ± 0.0180 0.8482 ± 0.0190 0.8548 ± 0.0220 0.7057 ± 0.0230 0.8365 ± 0.0210 0.7660 ± 0.0220 0.8111 ± 0.0230
EGAT 0.6910 ± 0.0250 0.8631 ± 0.0220 0.7672 ± 0.0230 0.8274 ± 0.0240 0.6212 ± 0.0280 0.7275 ± 0.0260 0.6706 ± 0.0270 0.7668 ± 0.0250
M-Graphormer 0.7303 ± 0.0200 0.8503 ± 0.0180 0.8398 ± 0.0170 0.8638 ± 0.0190 0.7016 ± 0.0220 0.7977 ± 0.0200 0.7466 ± 0.0210 0.8363 ± 0.0190
Ours 0.9202 ± 0.0050 0.9400 ± 0.0040 0.9256 ± 0.0050 0.9608 ± 0.0050 0.9308 ± 0.0060 0.9515 ± 0.0050 0.9434 ± 0.0050 0.9698 ± 0.0040
DOI: 10.7717/peerj-cs.3445/table-3

To better illustrate these results, we also visualized the performance of each model using bar charts for each metric, as shown in Figs. 4 and 5. Each bar represents the performance of a specific model on the 60% or 30% dataset, and error bars indicate the standard deviation across repeated experiments, reflecting performance variability. The visualizations highlight that our model not only achieves the highest mean performance but also exhibits smaller error bars compared to other models, indicating strong stability and robustness under reduced dataset sizes.

Performance of different models on the 60% dataset.

Figure 4: Performance of different models on the 60% dataset.

Each subfigure represents one evaluation metric: (A) Precision, (B) Recall, (C) F1-score, and (D) AUC. Our model achieves consistently superior results across all metrics on this subset.
Performance of different models on the 30% dataset.

Figure 5: Performance of different models on the 30% dataset.

Each subfigure represents one evaluation metric: (A) Precision, (B) Recall, (C) F1-score, and (D) AUC. The results show that even with limited training data, our model maintains robust and accurate performance across all metrics.

Specifically, in the F1-score and AUC charts, Ours maintains a clear advantage over all baselines for both 60% and 30% datasets, while some models such as GAT show larger fluctuations due to sensitivity to reduced data. Other baseline models, including GCN, GraphSAGE, and EGAT, exhibit moderate variability, whereas Graphormer, GATv2, and M-Graphormer remain relatively stable. Overall, the combination of tabular results in Table 3 and the visual bar charts in Figs. 4, 5 provides a comprehensive evaluation of model performance and confirms the effectiveness and robustness of our approach across different dataset scales.

Ablation study

To explore the contributions of the three feature modules, we conducted ablation experiments on our model. The performance of each single feature module is presented in Table 4. Specifically, the node behavior feature module FNB achieves an F1-score of 0.8566, the graph-based transaction network structure feature FTN achieves 0.9024, and the global feature module FG achieves 0.8699, while the full model incorporating all three modules achieves 0.9237. Among the single-feature modules, FTN shows the greatest contribution, followed by FG, and FNB contributes the least.

Table 4:
Performance comparison of single feature modules ( FNB, FTN, FG) and full model.
Precision Recall F1-score AUC
FNB 0.8571 ± 0.0200 0.8569 ± 0.0210 0.8566 ± 0.0190 0.9225 ± 0.0220
FTN 0.9034 ± 0.0180 0.9024 ± 0.0200 0.9024 ± 0.0170 0.9541 ± 0.0190
FG 0.8701 ± 0.0190 0.8701 ± 0.0200 0.8699 ± 0.0180 0.9067 ± 0.0210
Full 0.9152 ± 0.0050 0.9412 ± 0.0040 0.9237 ± 0.0050 0.9618 ± 0.0050
DOI: 10.7717/peerj-cs.3445/table-4

Based on these results, we analyzed the characteristics of each module. The graph-based transaction network structure feature FTN contributes the most due to its ability to capture the unique network structure of the Ethereum transaction network and the behavior patterns of phishing nodes. Nodes in the Ethereum transaction network exhibit varying importance, and phishing nodes often perform a large number of transactions to increase visibility for fraudulent purposes. Consequently, these nodes are easily identified by graph-based structural features. The global feature module FG contributes secondarily, which can be attributed to the construction of 2nd-order subgraphs during Ethereum transaction data organization. The distribution of labeled phishing nodes is scattered, leading to multiple subgraphs, and nodes cannot always obtain high-quality embeddings during random walks. Nonetheless, embeddings of phishing nodes within subgraphs still retain significance. The node behavior feature FNB contributes the least, as it is unable to fully capture complex transaction patterns. While FNB alone achieves a moderate F1-score, it is less effective than the other modules in recognizing phishing nodes with complex or atypical transaction behaviors.

To provide a more rigorous evaluation, we further conducted combination ablation experiments, where one feature module is removed at a time. The results are visualized in Fig. 6, showing the performance of module combinations FNB+FTN, FNB+FG, and FTN+FG in comparison to the full model. The F1-scores, Precision, Recall, and AUC all decrease when any single module is removed, confirming that each module contributes positively to the overall performance. Notably, removing the graph-based transaction network feature FTN results in the largest performance drop (F1 = 0.8958), highlighting its critical role in the model. Removing FNB or FG also leads to performance degradation, though to a lesser extent, indicating that these modules provide complementary information. Overall, these results demonstrate that all three feature modules are necessary for achieving the best performance, and the combination ablation study provides a more reliable assessment of their contributions than evaluating single modules alone.

Performance comparison of different module combinations in the proposed framework.

Figure 6: Performance comparison of different module combinations in the proposed framework.

Each bar represents a module or combination of modules ( FNB+FTN, FNB+FG, FTN+FG, Full). The results demonstrate that all components contribute positively, and their integration leads to the best overall performance.

Parameter sensitivity analysis

We conduct a sensitivity analysis to investigate the influence of key hyperparameters involved in the biased random walk and representation learning process. Specifically, we focus on three parameters: walk length L, number of walks per node T, and embedding dimension d. The evaluation considers both classification performance (measured by F1-score) and memory usage.

As shown in Fig. 7, increasing the walk length L (Fig. 7A) from 10 to 40 yields a consistent improvement in F1-score, indicating that longer walks capture more informative structural context. However, further increasing L to 60 leads to diminishing returns and a slight performance drop, likely due to noise accumulation. Additionally, memory usage grows linearly with L, highlighting a trade-off between effectiveness and resource cost. Based on this, we set L=40 as the optimal choice.

(A) Impact of walk length. (B) Impact of walks per node. (C) Impact of embedding dimension.

Figure 7: (A) Impact of walk length. (B) Impact of walks per node. (C) Impact of embedding dimension.

Similarly, Fig. 7B shows that increasing the number of walks per node T improves model performance up to T=50, beyond which the gain becomes marginal while memory usage continues to grow. Thus, T=50 achieves a good balance between performance and efficiency.

For embedding dimension d (Fig. 7C), the results indicate that performance benefits from increasing dimensionality up to d=256. Beyond this point, the improvement plateaus while memory consumption increases sharply. Therefore, we fix d=256 in our experiments to ensure expressive node representations without excessive resource consumption.

These observations support the choice of L=40, T=50, and d=256 as default settings, balancing detection performance with computational efficiency.

Conclusion and future work

Conclusion

In this work, we propose R-NTN, a novel phishing detection framework for Ethereum that models Ethereum accounts by extracting features from three distinct dimensions. This approach enables global and comprehensive detection, demonstrating robust performance under conditions of highly imbalanced data. Experimental results show that R-NTN outperforms both baseline and state-of-the-art methods in terms of detection accuracy. Furthermore, experiments on smaller-scale datasets confirm the superior robustness of our method compared to existing approaches.

Future work

Although the multi-dimensional feature extraction employed by R-NTN alleviates the issue of extreme data imbalance in Ethereum data samples to some extent, this challenge remains and continues to present difficulties. As a result, our future work will focus on leveraging graph self-supervised learning techniques, specifically through graph contrastive learning, to construct positive and negative samples. This will aim to improve the precision of Ethereum phishing node detection.