A hybrid anomaly detection method for high dimensional data
- Published
- Accepted
- Received
- Academic Editor
- Carlos Fernandez-Lozano
- Subject Areas
- Algorithms and Analysis of Algorithms, Data Mining and Machine Learning, Data Science
- Keywords
- Anomaly detection, Autoencoder, High-dimensional data, Support vector machine
- Copyright
- © 2022 Zhang et al.
- Licence
- This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
- Cite this article
- 2023. A hybrid anomaly detection method for high dimensional data. PeerJ Computer Science 9:e1199 https://doi.org/10.7717/peerj-cs.1199
Abstract
Anomaly detection of high-dimensional data is a challenge because the sparsity of the data distribution caused by high dimensionality hardly provides rich information distinguishing anomalous instances from normal instances. To address this, this article proposes an anomaly detection method combining an autoencoder and a sparse weighted least squares-support vector machine. First, the autoencoder is used to extract those low-dimensional features of high-dimensional data, thus reducing the dimension and the complexity of the searching space. Then, in the low-dimensional feature space obtained by the autoencoder, the sparse weighted least squares-support vector machine separates anomalous and normal features. Finally, the learned class labels to be used to distinguish normal instances and abnormal instances are outputed, thus achieving anomaly detection of high-dimensional data. The experiment results on real high-dimensional datasets show that the proposed method wins over competing methods in terms of anomaly detection ability. For high-dimensional data, using deep methods can reconstruct the layered feature space, which is beneficial for gaining those advanced anomaly detection results.
Introduction
Anomaly detection is an important part of data mining. There are many cases regarding anomaly appearance, such as system failure, abnormal behavior, data outliers, etc. Anomalies (aka outliers) have properties: (i) rarity, i.e., it is difficult to label anomalies because anomaly instances are sparse; and (ii) type diversity, i.e., there are many types of anomalies, such as, point anomalies, group anomalies, conditional anomalies, etc.
Anomalous features are more likely to be manifested in a low-dimensional space, but they are more hidden in a high-dimensional space. Due to high dimensionality, the distance contrast between data becomes similar (Yu & Chen, 2019; Menon & Kalyani, 2019), whereas, most anomaly detection methods explicitly or implicitly rely on distance contrast (Li, Lv & Yi, 2020). Obviously, high dimensionality easily leads to fail in anomaly detection for these methods relying on distance contrast. Furthermore, the data in a high-dimensional space presents a sparse distribution, which difficult affords rich information distinguishing abnormal and normal instances (Soleimani & Miller, 2016). In this case, anomaly detection for the data in a high-dimensional space is a challenge.
Currently, anomaly detection methods allow to be divided into the following categories: (i) Distance metric-based methods do not acquire data distribution, e.g., K-nearest neighbor (K-NN) (Chehreghani, 2016), Random Distances (Wang et al., 2020b). Distance metrics become more and more similar along with the increasing of data dimensionality, so such methods easily suffer negative effects of high dimensionality. (ii) Deep learning-based methods can not only learn deep features of the data, but also interpret the learned features (Yuan et al., 2018), e.g., Bayesian Variational Autoencoder (BVAE) (Daxberger & Hernández-Lobato, 2019), and these methods in Grosnit et al. (2022), Bourached et al. (2022), Grathwohl et al. (2019), Goodfellow, Bengio & Courville (2019) and Ian et al. (2014). Such detection methods have unsupervised detection methods and supervised detection methods, where unsupervised detection methods do not rely on data labels but they are very sensitive to noise and missing data (Parchami et al., 2017), e.g., Deep One-class Classification (DOC) (Ruff et al., 2018), Generative Adversarial Network (GAN) (Li et al., 2019), etc. The objective function of unsupervised detection methods is more used for data dimensionality reduction or data compression, so anomaly detection accuracy may be usually lower than that of supervised detection methods. Unlike unsupervised detection methods, Supervised detection methods show better detection performance because of relying on data labels, such as these methods in Metzen et al. (2017) and Grosse et al. (2017). Unfortunately, labeling the data is a challenge if the amount of data is large or the data is multi-dimensional. Therefore, supervised anomaly detection methods are difficult to adapt to anomaly detection of high-dimensional data and large-scale data. (iii) Deep hybrid-based methods, such as Deep Neural Networks-Support Vector Machine (DNN-SVM) (Inoue et al., 2017), deep autoencoder and ensemble k-nearest neighbor (DAE-KNN) (Song et al., 2017), consist of deep methods and traditional detection methods; therefore, they inherit the characteristics of deep detection methods and traditional detection methods. Meanwhile, they own natural advantages in anomaly detection, but there needs trade-off computational cost and calculation accuracy. A typical representative of (iv) traditional detection methods are the support vector machine (SVM)-based methods, such as OC-SVM (Ergen, Hassan Mirza & Serdar Kozat, 2017), SVM (Erfanin et al., 2016; Shi & Zhang, 2021). SVM-based methods are susceptible to the linear inseparability of high-dimensional data (Jerónimo Pasadas et al., 2020); therefore, Wang et al. (2020a) proposed the improved SVM-based method.
The motivation of this article is to detect anomalies of high-dimensional data. Here, this article proposes a hybrid approach combining an autoencoder and a sparse weighted least squares support vector machine, namely AE-SWLS-SVM. Firstly, the autoencoder is used to extract low-dimensional features of high-dimensional data, thereby reducing data dimensionality and the complexity of the searching space. Then, in the low-dimensional feature space obtained by the autoencoder, the sparse weighted least squares support vector machine separates normal and abnormal features. Finally, the class labels being used to distinguish normal instances and abnormal instances are sent out, thereby realizing anomaly detection of high-dimensional data.
We summarize main contributions of this works.
-
The proposed AE-SWLS-SVM can adapted well to high-dimensional environments during anomaly detection. Since the autoencoder captures the layered features from high-dimensional data, which provides beneficial environments for the sparse weighted least squares-SVM to distinguish the normal features and abnormal features.
-
For high-dimensional data, the layered feature space reconstructed by deep methods is beneficial for gaining those advanced anomaly detection results. The contrast of distance between the data becomes difficulty as data dimensionality increases in high-dimensional spaces, however, in the layered feature space reconstructed by deep methods, the contrast of distance between the data becomes significant.
Materials and methods
Overall scheme
Figure 1 displays the overall scheme of the proposed method, which includes feature extraction, feature separation and instance reconstruction. In the first stage, namely feature extraction, the encoder captures low-dimensional features from the input data, which provides good environments for feature separation in the next stage. In the second stage, i.e., feature separation, the sparse weighted least squares-support vector machine achieves the separation of abnormal and normal features in the low-dimensional feature space. In the third stage, namely instance reconstruction, the decoder reconstructs normal and abnormal instances from the separated normal and abnormal features. Finally, the learned class labels are output.
Feature extraction
Autoencoders show excellent ability in capturing low-dimensional features of high-dimensional data. For simplicity, denotes that the autoencoder extracts low-dimensional features from high-dimensional data, where ℜh represents an h-dimensional high-dimensional space, and the data dimension in the space is h dimensionality. Similarly, ℜl represents the corresponding low-dimensional feature space, and the dimensionality is l, and l <h. Ωrepresents our autoencoder, which consists of an input layer, an output layer and multiple hidden layers. The formal description for Ω is as follow.
Input layer InΩachieves the mapping of input data, i.e., the data in ℜh ismapped onto InΩ.
Multiple hidden layers. The encoder and the decoder contain hidden layers, respectively, so we need to describe them, respectively.
(i) Hidden layers in the encoder. The input and the output of the n-th hidden layer in the i-th iteration are denoted as , respectively. are calculated by Eqs. (1) and (2). (1) (2)
(ii) Hidden layers in the decoder. Correspondingly, the input and the output of the m-th hidden layer in the i-th iteration are denoted as , respectively, as following, (3) (4)
where are activation function in coding hidden layers and decoding hidden layers, respectively. w and b weight in hidden layers and bias.
Output layers outΩ sends out the reconstructed instances. Ω isformally defined as follow (5)
where x, z are the input and the reconstructed input, respectively. The loss function LΩ of Ω is given in Eq. (6) (6)
Indeed, proper regularization can improve the ability of autoencoders to capture features (Lu et al., 2017), therefore, LΩ is regularized by introducing regularization item and J–S(Jensen–Shannon) divergence JSsparse, having that (7)
where L2 (Olshausen & Field, 1997) is the regularization item and optimizes the weight of Ω in order to ensure that the components of w are as balanced as possible. JSsparse divergence is a variant based on K–L divergence. Because JSsparse divergence is symmetric, it can solve the problem of K-L divergence asymmetry (Cattai et al., 2021; Li et al., 2021). JSsparse divergence calculation formula is as follows (8)
where P1 represents the true distribution of the data. P2 is the theoretical distribution of the data or an approximate distribution of P1.
Feature separation
Support Vector Machine (SVM) is often used for classification tasks because of excellent classification ability. Anomaly detection can be thought as a binary-classification of normal and abnormal classes. Based on this, we improved the structure of SVM, i.e., sparse weighted least squares was implemented to SVM, denoted as SWLS-SVM.
Given the input sample , is the label of . SWLS-SVM is defined as following (9)
where w and b are weight and bias. λ is a regularization parameter. βi is weight coefficient. ξi and Li are an error item and the error function, respectively. ϕ(•) is a non-linear mapping function. The goal of SWL-SVM is to minimize the .
SWLS-SVM employs a structural risk model, including empirical risk and the regularisation term . If the regularization parameter λ is larger, then the empirical risk becomes more important, certainly, this easily lead to over-fitting. If the regularization parameter λ is small, then the empirical risk becomes less important, so that the effect of anomalies on the function Li can be ignored.
Equation (9) is a convex quadratic programming problem, and the dual problem can be obtained by using Lagrange multipliers. The constrained objective function in Eq. (9) can be transformed into an unconstrained Lagrangian function, having that (10)
where αi > 0 is the Lagrange multiplier. To minimize F(w, b, ξi), we set the partial derivatives of w, b, ξi, αi to zero, respectively. (11)
The mapping function ϕ(•) in Eq. (11) can be converted into a kernel function κ(•) according to the KKT(Karush-Kuhn–Tucker) (Peng & Xu, 2013) condition, having that (12)
Let Eq. (12) be taken into Eq. (11), the linear Eq. (13) can be obtained by eliminating the variables w and ξi, as follows (13)
By solving Eq. (13), the bias b and the Lagrange multiplier αi can be obtained. (14)
Any semi-positive definite symmetric function can be used as a kernel function (Jayasumana et al., 2014), therefore, we obtain a positive definite function through calculating the cumulative distribution function (Jayasumana et al., 2013), having (15)
where A, B are the non-negative kernel parameters.
Model
Model structure and training
The proposed AE-SWLS-SVM consists of encoding module, feature separation module and decoding module, as shown in Fig. 2. The role of each module is described below.
Module 1, i.e., encoding module. In the module, hidden layers capture the low-dimensional features of the input data. The error of capturing low-dimensional features can be minimized by Eq. (7).
Module 2, i.e., feature separation module. The module separates normal and abnormal features. In the low-dimensional feature space, the kernel function in Eq. (15) completes the separation of normal features and abnormal features.
Module 3, i.e., decoding module. According to the separated features of the output, hidden layers in the module reconstruct normal and abnormal instances. Finally, the input layer outputs the learned normal and abnormal class labels.
The objective function of AE-SWLS-SVM consists of the loss function LΩ of the autoencoder in Eq. (7) and the error function in Eq. (14) of SWLS-SVM, having that (16)
The performance of AE-SWLS-SVM is related to parameters, so parameters of the model need to be cross-validated in advance, as shown in Algorithm 1. Firstly, the training set is used to train the model, and then the testing set is used to validate the trained model. By analyzing the testing results, the optimal kernel parameter value and the number of neurons are selected. The procedure in Step 2 to Step 18 mainly performs cross-validation of the number of neurons to obtain the optimal number of neurons opt(δ), where the procedure between Step 3 and Step 14 is the cross-validation of the kernel parameter to obtain the optimal kernel parameter value opt(γ).
After obtaining the optimal parameters, we use the training set to train AE-SWLS-SVM, as shown in Algorithm 2. The process of Step 1 to Step 9 realizes the training of the model. During the training, the objective function is iteratively calculated, when the model converges, we stop the training of the model and save the current training accuracy. From Step 10 to Step 14, we select the maximum trainingaccuracy in the tmax-th training as the final output accuracy of the model, and then save the model for the tmax-th training.
Algorithm 1. Cross-validation of parameters. | |
Input: iteration epoch T, the number of neurons δ1, δ2, constant Δδ, kernel parameter sets γk, training set Train_set, testing set Test_set. | |
Output: | |
Optimal kernel parameter value opt(γ), optimal number of neurons opt(δ). | |
Begin | |
1 | fort=1 toTdo: |
2 | forδ=δ1 toδ2do: |
3 | foreachγ inγk: |
4 | Train model SWLS-SVM(Train_set, δ, γ); |
5 | fori=1 toIdo: |
6 | Calculate the weight coefficient βi; |
7 | Learn objective function Ttarget; |
8 | Calculate training accuracy Train_Acc= SWLS-SVM(Train_set, δ, γ) ; |
9 | end for |
10 | Test model SWLS-SVM(Test_set, δ, γ) ; |
11 | Calculate testing accuracy Test_Acc= SWLS-SVM(Test_set, δ, γ); |
12 | end foreach |
13 | Select the optimal γso that maximize testing accuracy γ(max) = argmax(Test_Acc(γmax)); |
14 | Obtain optimal kernel parameter opt(γ)=γ(max); |
15 | δ = δ + Δδ; |
16 | end for |
17 | Select the optimal δso that maximize testing accuracy δ(max) = argmax(Test_Acc(δmax, γmax)); |
18 | Obtain the optimal number of neurons opt(δ) = δ(max); |
19 | end for |
End |
Algorithm 2. Model training. | |
Input: iteration epoch T, opt(γ), opt(δ), training set Training_set. | |
Output: training accuracy Training_acc. | |
Begin | |
1 | fort=1 toTdo: |
2 | Train model SWLS-SVM(Training_set; opt(γ); opt(δ)) ; style=padding-left:1pc;, |
3 | fori=1 toIdo: |
4 | Calculate the weight coefficient βi; |
5 | Learn objective function Ttarget; |
6 | Calculate training accuracy Training_acc (t) = SWLS-SVM(Training_set; opt(γ)) ; |
7 | Save the t-th training accuracy T _acc (save(t)) =Training_acc (t) ; |
8 | end for |
9 | end for |
10 | traverse the saved training accuracy T _acc (save(t)) ; |
11 | Select the maximum training accuracy in the tmax-th training |
12 | tmax=arg max(T _acc (save(t))); |
13 | Save the model in the tmax-th training Save=SWLS-SVM(Training_set;opt(γ);opt(δ);tmax); |
14 | Output training accuracy in the tmax-th Training_acc=Training_acc (tmax); |
End |
Model parameters
In order to better train AE-SWLS-SVM, the parameters that have a significant impact on the training results are investigated.
(1) Kernel parameter. To induce the best performance of AE-SWLS-SVM, the optimal value of kernel parameter is searched within a certain range.
(2) Weight coefficient βi is calculated as follows, (17)
where isthe standard deviation of indicates the absolute value of the error rate.
(3) Activation function. Activation functions used in machine learning are usually Sigmoid, tanh, ELU, ReLU, etc. The output of Sigmoid is only 0 and 1, which is suitable for the judgment of normal instances and abnormal instances. Therefore, Sigmoid is used as the activation function for the proposed AE-SWLS-SVM.
(4) Optimizer and learning rate. Adam is used as an optimizer for AE-SWLS-SVM. Adam not only possesses AdaGrad’s ability to handle sparse gradients, but also has RMSProp’s ability to handle non-stationary targets (Kingma & Lei Ba, 2015). Moreover, Adam can provide different adaptive learning rates for different hyper parameters.
(5) Number of neurons. The number of neurons was determined by cross-validation. Given that data dimension and data volume of the input data, using a certain range to configure the number of neurons can improve the ability of the model to resist over-fitting.
(6) Iteration epoch. During the training of AE-SWLS-SVM, we observe the change of training accuracy and dynamically adjust iteration epoch until the model can converge, and then stop training.
Experiments
Datasets
In practical applications, real high-dimensional anomaly datasets are difficult to be obtained. Therefore, we selected real high-dimensional datasets of being often used for classification, and then used the method (Campos et al., 2016) to convert these high-dimensional datasets into anomaly detection datasets. We considered seven high-dimensional classification datasets (i.e., U1–U7, the dimensions are greater than 165) to test the anomaly detection ability of the model. Then, we randomly selected five high-dimensional datasets from U1–U7 as the training set to train our model, and the selected process was repeated five times independently. In addition, three benchmark datasets (namely B1, B2, B3) were also selected, after our model is well trained, benchmark dataset B1, B2, B3 are used for parametric cross-validation and model structure validation. Table 1 gives a detailed description of these 10 datasets (seven high-dimensional datasets and three benchmark datasets).
Assessment metrics
Accuracy and F1-score are used as evaluation indicators, and the calculation formula is as follows (18) (19)
where TP represents the number of correctly predicted anomalous instances. TN represents the number of correctly predicted normal instances. FP represents the number of normal instances that are predicted to be anomalous instances. FN represents the number of anomalous instances that are predicted to be normal instances.
# | Benchmark datasets | Description (normal vs. anomaly) | Data volume | Anomaly ratio | Data dimension | |
---|---|---|---|---|---|---|
Normal | Anomaly | |||||
B1 | Shuttle | Class ‘1’ vs. Others | 1,000 | 13 | 1.28% | 9 |
B2 | PenDigits | Other vs. Class ‘4’ | 9,868 | 20 | 0.20% | 16 |
B3 | Waveform | Others vs. Class ‘0’ | 3,343 | 100 | 2.9% | 21 |
# | High-dimensional datasets | Description (normal vs. outliers) | Data volume | Anomaly ratio | Data dimension | |
---|---|---|---|---|---|---|
Normal | Anomaly | |||||
U1 | Arcene | Normal patterns vs cancer | 8,459,427 | 540,573 | 6.01% | 10,000 |
U2 | Gisette | Zero vs non-zero values | 28,278,760 | 4,221,240 | 12.99% | 5,000 |
U3 | Micro Mass | Zero vs non-zero values | 464,819 | 3,181 | 0.68% | 1,300 |
U4 | Malware | Zero vs non-zero values | 2,894,954 | 37,772 | 1.29% | 1,087 |
U5 | CNAE | Zero vs non-zero values | 918,537 | 7,023 | 0.76% | 857 |
U6 | Epileptic Seizure | eizure vs. non-seizure | 11,321 | 179 | 1.56% | 179 |
U7 | Musk | Musk vs non-musk | 79,699 | 269 | 0.34% | 168 |
Comparison methods
The proposed method is a hybrid method based on deep network architectures, so deep hybrid-based methods, i.e., DNN-SVM (Inoue et al., 2017), DAE-KNN (Song et al., 2017), and deep networks-based methods, i.e., GAN (Li et al., 2019), are used for comparisons. In addition, distance metric-based methods, i.e., K-NN (Chehreghani, 2016) are also considered.
As for the proposed method, the adjustment range of the kernel parameter is defined as γk = {0.1, 0.3, 0.5, 1, 2, 3, 5 }, the number of neurons δ = [δ1, δ2], δ1 = 10, δ2 = 110, and step Δδ = 20. To have a fair comparison, the comparison method adopts the parameters observed in the corresponding literature.
We implemented corresponding algorithms of these methods using Python 3.8 on the Tensorflow 2.0 on Linux System. The environment setting are Intel i7 3.0 GHz CPU, and 32G memory. These algorithms run on the same GPU, and adopt the same configuration.
Results and Discussion
Parameter testing of the model
AE-SWLS-SVM contains multiple hidden layers and is tested in the range {1, 2, 3, 5, 7, 10, 20, 30 }. The seven high-dimensional datasets are used as experiment datasets, i.e., U1, U2, U3, U4, U5, U6, U7, we randomly selected five high-dimensional datasets from U1-U7 as the training set to train our AE-SWLS-SVM; meanwhile, the selected process was repeated five times independently. After our AE-SWLS-SVM is well trained, the benchmark dataset B1 is used as the testing set to test the number of hidden layers of AE-SWLS-SVM.
Testing results show that when the number of hidden layers reaches 3, the detection performance (including Accuracy and F1-score) of AE-SWLS-SVM tends to be stable, as shown in Fig. 3A. This shows that the proposed model is stable on the cases considered.
Let the number of hidden layers be equal to 3, then parameter are tested on the benchmark dataset B2 and B3, respectively, as shown in Figs. 3B and 3C. Results show that the detection performance of AE-SWLS-SVM is the best when kernel parameter γ is equal to 0.5 and the number of neurons δ is equal to 50. Therefore, in the subsequent experiments, the parameters of AE-SWLS-SVM are configured as the number of hidden layers is 3, and γ = 0.5, δ = 50.
Performance comparison
Results on datasets U1-U7 show that AE-SWLS-SVM outperforms competitors DNN-SVM, DAE-KNN, GAN and K-NN in terms of anomaly detection performance on most datasets, as shown in Fig. 4. On higher dimensional datasets, such as U1 (dimension = 10,000) and U2 (dimension = 5,000), anomaly detection advantages of AE-SWLS-SVM are very significant. Unfortunately, the competitor K-NN almost fails on datasets U1 and U2. This is because the contrast of distance between the data becomes more difficulty as data dimensionality increases in high-dimensional spaces, unfortunately, the measurement manner based on distance may difficulty measures the attribute similarity of high-dimensional data. In fact, K-NN based on distance methods is prone to rely on measurement of distance. Although the detection results of these four competitors are better than our AE-SWLS-SVM on datasets U4 and U5, the advantages are not significant. Overall, AE-SWLS-SVM shows more advantages for anomaly detection on high-dimensional data.
Discussion
Insights
Compared with the above competitors, our model has outstanding advantages in terms of anomaly detection for the following reasons.
The autoencoder can capture the low-dimensional layered features from the input data, which is crucial because they provide a sufficient condition for the separation both anomaly features and normal features. The loss function of the autoencoder in Eq. (7) can minimize the error of the extracted low-dimensional layered features. This provides good environments for the kernel in Eq. (15) separating anomaly features from normal features. Through iteratively learning the objective function in Eq. (16), our model can gain good detection accuracy for anomalies.
For these anomaly detection methods, (i) distance metric-based detection methods, such as K-NN (Chehreghani, 2016), can work in low-dimensional spaces well, while their detection capabilities are restricted because the contrast between data points in high-dimensional spaces becomes similar, while (ii) deep learning-based detection methods, e.g., GAN (Li et al., 2019), are suitable for working upon complex high-dimensional spaces due to owning nonlinear layers extracting important features or learning useful representations. Certainly, in terms of Generative Adversarial Networks (GANs), there exists an unavoidable mode collapsing so that the training is not easy. In regard to (iii) deep hybrid-based detection methods, such as DNN-SVM (Inoue et al., 2017), DAE-KNN (Song et al., 2017), the methods more and more popular since they inherit the advantages of deep network architectures and traditional detection methods. However, deep hybrid detection methods also show poor detection capabilities when traditional detection methods depend on data distribution or easily occur over-fitting, for example, DBN-Random Forest (Kam Ho, 1995) shows poor noise resistance and encounters high risk of over-fitting, due to random forest easily suffers over-fitting on the sample with relatively large noise (Zheng & Zhao, 2020; Popolin Neto & Paulovich, 2021).
Limitations
The detection performance of the proposed method relies on the extracted features, which means that the quality of the extracted feature has important effects on the ability of the method. Additionally, due to lacking real anomaly datasets, the detection accuracies of most anomaly detection methods are restricted so that it is difficult to truly reflect the detection capabilities of them.
Conclusion
This article proposes a hybrid method of combining an autoencoder and a sparse weighted least squares support vector machine for anomaly detection on high-dimensional data. The key thought is that the autoencoder extracts the low-dimensional layered features from high-dimensional data, in order to reduce the dimension of the data and the complexity of the searching space. In the low-dimensional feature space, the sparse weighted least squares-support vector machine separates anomalous features from normal features. Finally, the class labels of being used to distinguish normal instances and abnormal instances are output, thereby completing anomaly detection of high-dimensional data. Results show that the proposed method is superior to competitors in terms of anomaly detection ability of high-dimensional data. In future works, we will look at addressing the issue of anomaly detection of noise interference. Noise can mask the rare anomalies so that anomalies are likely to be mistaken as noise.