A hybrid anomaly detection method for high dimensional data
 Published
 Accepted
 Received
 Academic Editor
 Carlos FernandezLozano
 Subject Areas
 Algorithms and Analysis of Algorithms, Data Mining and Machine Learning, Data Science
 Keywords
 Anomaly detection, Autoencoder, Highdimensional data, Support vector machine
 Copyright
 © 2022 Zhang et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
 Cite this article
 2023. A hybrid anomaly detection method for high dimensional data. PeerJ Computer Science 9:e1199 https://doi.org/10.7717/peerjcs.1199
Abstract
Anomaly detection of highdimensional data is a challenge because the sparsity of the data distribution caused by high dimensionality hardly provides rich information distinguishing anomalous instances from normal instances. To address this, this article proposes an anomaly detection method combining an autoencoder and a sparse weighted least squaressupport vector machine. First, the autoencoder is used to extract those lowdimensional features of highdimensional data, thus reducing the dimension and the complexity of the searching space. Then, in the lowdimensional feature space obtained by the autoencoder, the sparse weighted least squaressupport vector machine separates anomalous and normal features. Finally, the learned class labels to be used to distinguish normal instances and abnormal instances are outputed, thus achieving anomaly detection of highdimensional data. The experiment results on real highdimensional datasets show that the proposed method wins over competing methods in terms of anomaly detection ability. For highdimensional data, using deep methods can reconstruct the layered feature space, which is beneficial for gaining those advanced anomaly detection results.
Introduction
Anomaly detection is an important part of data mining. There are many cases regarding anomaly appearance, such as system failure, abnormal behavior, data outliers, etc. Anomalies (aka outliers) have properties: (i) rarity, i.e., it is difficult to label anomalies because anomaly instances are sparse; and (ii) type diversity, i.e., there are many types of anomalies, such as, point anomalies, group anomalies, conditional anomalies, etc.
Anomalous features are more likely to be manifested in a lowdimensional space, but they are more hidden in a highdimensional space. Due to high dimensionality, the distance contrast between data becomes similar (Yu & Chen, 2019; Menon & Kalyani, 2019), whereas, most anomaly detection methods explicitly or implicitly rely on distance contrast (Li, Lv & Yi, 2020). Obviously, high dimensionality easily leads to fail in anomaly detection for these methods relying on distance contrast. Furthermore, the data in a highdimensional space presents a sparse distribution, which difficult affords rich information distinguishing abnormal and normal instances (Soleimani & Miller, 2016). In this case, anomaly detection for the data in a highdimensional space is a challenge.
Currently, anomaly detection methods allow to be divided into the following categories: (i) Distance metricbased methods do not acquire data distribution, e.g., Knearest neighbor (KNN) (Chehreghani, 2016), Random Distances (Wang et al., 2020b). Distance metrics become more and more similar along with the increasing of data dimensionality, so such methods easily suffer negative effects of high dimensionality. (ii) Deep learningbased methods can not only learn deep features of the data, but also interpret the learned features (Yuan et al., 2018), e.g., Bayesian Variational Autoencoder (BVAE) (Daxberger & HernándezLobato, 2019), and these methods in Grosnit et al. (2022), Bourached et al. (2022), Grathwohl et al. (2019), Goodfellow, Bengio & Courville (2019) and Ian et al. (2014). Such detection methods have unsupervised detection methods and supervised detection methods, where unsupervised detection methods do not rely on data labels but they are very sensitive to noise and missing data (Parchami et al., 2017), e.g., Deep Oneclass Classification (DOC) (Ruff et al., 2018), Generative Adversarial Network (GAN) (Li et al., 2019), etc. The objective function of unsupervised detection methods is more used for data dimensionality reduction or data compression, so anomaly detection accuracy may be usually lower than that of supervised detection methods. Unlike unsupervised detection methods, Supervised detection methods show better detection performance because of relying on data labels, such as these methods in Metzen et al. (2017) and Grosse et al. (2017). Unfortunately, labeling the data is a challenge if the amount of data is large or the data is multidimensional. Therefore, supervised anomaly detection methods are difficult to adapt to anomaly detection of highdimensional data and largescale data. (iii) Deep hybridbased methods, such as Deep Neural NetworksSupport Vector Machine (DNNSVM) (Inoue et al., 2017), deep autoencoder and ensemble knearest neighbor (DAEKNN) (Song et al., 2017), consist of deep methods and traditional detection methods; therefore, they inherit the characteristics of deep detection methods and traditional detection methods. Meanwhile, they own natural advantages in anomaly detection, but there needs tradeoff computational cost and calculation accuracy. A typical representative of (iv) traditional detection methods are the support vector machine (SVM)based methods, such as OCSVM (Ergen, Hassan Mirza & Serdar Kozat, 2017), SVM (Erfanin et al., 2016; Shi & Zhang, 2021). SVMbased methods are susceptible to the linear inseparability of highdimensional data (Jerónimo Pasadas et al., 2020); therefore, Wang et al. (2020a) proposed the improved SVMbased method.
The motivation of this article is to detect anomalies of highdimensional data. Here, this article proposes a hybrid approach combining an autoencoder and a sparse weighted least squares support vector machine, namely AESWLSSVM. Firstly, the autoencoder is used to extract lowdimensional features of highdimensional data, thereby reducing data dimensionality and the complexity of the searching space. Then, in the lowdimensional feature space obtained by the autoencoder, the sparse weighted least squares support vector machine separates normal and abnormal features. Finally, the class labels being used to distinguish normal instances and abnormal instances are sent out, thereby realizing anomaly detection of highdimensional data.
We summarize main contributions of this works.

The proposed AESWLSSVM can adapted well to highdimensional environments during anomaly detection. Since the autoencoder captures the layered features from highdimensional data, which provides beneficial environments for the sparse weighted least squaresSVM to distinguish the normal features and abnormal features.

For highdimensional data, the layered feature space reconstructed by deep methods is beneficial for gaining those advanced anomaly detection results. The contrast of distance between the data becomes difficulty as data dimensionality increases in highdimensional spaces, however, in the layered feature space reconstructed by deep methods, the contrast of distance between the data becomes significant.
Materials and methods
Overall scheme
Figure 1 displays the overall scheme of the proposed method, which includes feature extraction, feature separation and instance reconstruction. In the first stage, namely feature extraction, the encoder captures lowdimensional features from the input data, which provides good environments for feature separation in the next stage. In the second stage, i.e., feature separation, the sparse weighted least squaressupport vector machine achieves the separation of abnormal and normal features in the lowdimensional feature space. In the third stage, namely instance reconstruction, the decoder reconstructs normal and abnormal instances from the separated normal and abnormal features. Finally, the learned class labels are output.
Feature extraction
Autoencoders show excellent ability in capturing lowdimensional features of highdimensional data. For simplicity, ${\Re}^{h}\stackrel{\Omega}{\u27f6}{\Re}^{l}$ denotes that the autoencoder extracts lowdimensional features from highdimensional data, where ℜ^{h} represents an hdimensional highdimensional space, and the data dimension in the space is h dimensionality. Similarly, ℜ^{l} represents the corresponding lowdimensional feature space, and the dimensionality is l, and l <h. Ωrepresents our autoencoder, which consists of an input layer, an output layer and multiple hidden layers. The formal description for Ω is as follow.
Input layer In_{Ω}achieves the mapping of input data, i.e., the data in ℜ^{h} ismapped onto In_{Ω}.
Multiple hidden layers. The encoder and the decoder contain hidden layers, respectively, so we need to describe them, respectively.
(i) Hidden layers in the encoder. The input and the output of the nth hidden layer in the ith iteration are denoted as ${E}_{n,i}^{\mathrm{in}},{E}_{n,i}^{\mathrm{out}}$, respectively. ${E}_{n,i}^{\mathrm{out}},{E}_{n,i}^{\mathrm{in}}$ are calculated by Eqs. (1) and (2). (1)${E}_{n}^{\mathrm{out}}={\nabla}_{n}^{e}\left({w}_{{}_{n}}^{i}{E}_{n,i}^{\mathrm{in}}+{b}_{{}_{n}}^{i}\right)$ (2)$E}_{n,i}^{\mathrm{in}}={E}_{n1,i}^{\mathrm{out}$
(ii) Hidden layers in the decoder. Correspondingly, the input and the output of the mth hidden layer in the ith iteration are denoted as ${D}_{m,i}^{\mathrm{in}},{D}_{m}^{\mathrm{out}}$, respectively, as following, (3)${D}_{m}^{\mathrm{out}}={\Delta}_{m}^{d}\left({w}_{{}_{m}}^{i}{D}_{m,i}^{\mathrm{in}}+{b}_{{}_{m}}^{i}\right)$ (4)$D}_{m,i}^{\mathrm{in}}={D}_{m1,i}^{\mathrm{out}$
where ${\nabla}_{n}^{e},{\Delta}_{m}^{d}$ are activation function in coding hidden layers and decoding hidden layers, respectively. w and b weight in hidden layers and bias.
Output layers out_{Ω} sends out the reconstructed instances. Ω isformally defined as follow (5)$x\underset{\mathrm{input}}{\to}I{n}_{\Omega}\stackrel{{E}_{n,i}^{\mathrm{in}}...{E}_{n,i}^{\mathrm{out}},{D}_{m,i}^{\mathrm{in}}...,{D}_{m}^{\mathrm{out}}}{\u27f6}{\mathrm{out}}_{\Omega}\underset{\mathrm{output}}{\to}z$
where x, z are the input and the reconstructed input, respectively. The loss function L_{Ω} of Ω is given in Eq. (6) (6)$L}_{\Omega}=\left\rightxz{}^{2$
Indeed, proper regularization can improve the ability of autoencoders to capture features (Lu et al., 2017), therefore, L_{Ω} is regularized by introducing regularization item and J–S(Jensen–Shannon) divergence JS_{sparse}, having that (7)$L}_{\Omega}=\left\rightxz{}^{2}+{L}_{2}+J{S}_{\mathrm{sparse}$
where L_{2} (Olshausen & Field, 1997) is the regularization item and optimizes the weight of Ω in order to ensure that the components of w are as balanced as possible. JS_{sparse} divergence is a variant based on K–L divergence. Because JS_{sparse} divergence is symmetric, it can solve the problem of KL divergence asymmetry (Cattai et al., 2021; Li et al., 2021). JS_{sparse} divergence calculation formula is as follows (8)$\left\{\begin{array}{c}J{S}_{\mathrm{sparse}}\left({P}_{1}\left\right{P}_{2}\right)=\frac{1}{2}\left[KL\left({P}_{1}\left\right\frac{{P}_{1}+{P}_{2}}{2}\right)+KL\left({P}_{2}\left\right\frac{{P}_{1}+{P}_{2}}{2}\right)\right]\hfill \\ KL\left({P}_{1}\left\right\frac{{P}_{1}+{P}_{2}}{2}\right)=\sum _{x\in X}{P}_{1}\left(x\right)log\frac{1}{{P}_{1}\left(x\right)}+\sum _{x\in X}{P}_{1}\left(x\right)log\frac{1}{{P}_{2}\left(x\right)}\hfill \\ KL\left({P}_{2}\left\right\frac{{P}_{1}+{P}_{2}}{2}\right)=\sum _{x\in X}{P}_{2}\left(x\right)log\frac{1}{{P}_{2}\left(x\right)}+\sum _{x\in X}{P}_{2}\left(x\right)log\frac{1}{{P}_{1}\left(x\right)}\hfill \end{array}\right.$
where P_{1} represents the true distribution of the data. P_{2} is the theoretical distribution of the data or an approximate distribution of P_{1}.
Feature separation
Support Vector Machine (SVM) is often used for classification tasks because of excellent classification ability. Anomaly detection can be thought as a binaryclassification of normal and abnormal classes. Based on this, we improved the structure of SVM, i.e., sparse weighted least squares was implemented to SVM, denoted as SWLSSVM.
Given the input sample $\left\{\left({\stackrel{\u02c6}{x}}_{i},{\stackrel{\u02c6}{y}}_{i}\right)i=1,2,..,n\right\}$, ${\stackrel{\u02c6}{y}}_{i}$ is the label of ${\stackrel{\u02c6}{x}}_{i}$. SWLSSVM is defined as following (9)$\left.\begin{array}{c}min\frac{1}{2}\left\rightw{}^{2}+\frac{1}{2}\lambda \sum _{i=1}^{n}{\beta}_{i}{\xi}_{i}^{2}\hfill \\ {L}_{i}={w}^{T}\varphi \left({\stackrel{\u02c6}{x}}_{i},{\stackrel{\u02c6}{y}}_{i}\right)+b+{\xi}_{i}\hfill \end{array}\right\}$
where w and b are weight and bias. λ is a regularization parameter. β_{i} is weight coefficient. ξ_{i} and L_{i} are an error item and the error function, respectively. ϕ(•) is a nonlinear mapping function. The goal of SWLSVM is to minimize the $\frac{1}{2}\left\rightw{}^{2}+\frac{1}{2}\lambda {\sum}_{i=1}^{n}{\beta}_{i}{\xi}_{i}^{2}$.
SWLSSVM employs a structural risk model, including empirical risk $\frac{1}{2}\lambda {\sum}_{i=1}^{n}{\beta}_{i}{\xi}_{i}^{2}$ and the regularisation term $\frac{1}{2}\left\rightw{}^{2}$. If the regularization parameter λ is larger, then the empirical risk becomes more important, certainly, this easily lead to overfitting. If the regularization parameter λ is small, then the empirical risk becomes less important, so that the effect of anomalies on the function L_{i} can be ignored.
Equation (9) is a convex quadratic programming problem, and the dual problem can be obtained by using Lagrange multipliers. The constrained objective function in Eq. (9) can be transformed into an unconstrained Lagrangian function, having that (10)$F\left(w,b,{\xi}_{i}\right)=\frac{1}{2}\left\rightw{}^{2}+\frac{1}{2}\lambda \sum _{i=1}^{n}{\beta}_{i}{\xi}_{i}^{2}\sum _{i=1}^{n}{\alpha}_{i}\left({w}^{T}\varphi \left({\stackrel{\u02c6}{x}}_{i},{\stackrel{\u02c6}{y}}_{i}\right)+b+{\xi}_{i}{L}_{i}\right)$
where α_{i} > 0 is the Lagrange multiplier. To minimize F(w, b, ξ_{i}), we set the partial derivatives of w, b, ξ_{i}, α_{i} to zero, respectively. (11)$\left.\begin{array}{c}\frac{\partial F\left(w,b,{\xi}_{i}\right)}{\partial w}=0\to w=\sum _{i=1}^{n}{\alpha}_{i}\varphi \left({\stackrel{\u02c6}{x}}_{i},{\stackrel{\u02c6}{y}}_{i}\right)\hfill \\ \frac{\partial F\left(w,b,{\xi}_{i}\right)}{\partial b}=0\to \sum _{i=1}^{n}{\alpha}_{i}=0\hfill \\ \frac{\partial F\left(w,b,{\xi}_{i}\right)}{\partial {\xi}_{i}}=0\to {\alpha}_{i}=\lambda {\beta}_{i}{\xi}_{i}\hfill \\ \frac{\partial F\left(w,b,{\xi}_{i}\right)}{\partial {\alpha}_{i}}=0\to {w}^{T}\varphi \left({\stackrel{\u02c6}{x}}_{i},{\stackrel{\u02c6}{y}}_{i}\right)+b+{\xi}_{i}{L}_{i}=0\hfill \end{array}\right\}.$
The mapping function ϕ(•) in Eq. (11) can be converted into a kernel function κ(•) according to the KKT(KarushKuhn–Tucker) (Peng & Xu, 2013) condition, having that (12)$\kappa \left(\left({\stackrel{\u02c6}{x}}_{i},{\stackrel{\u02c6}{y}}_{i}\right),\left({\stackrel{\u02c6}{x}}_{j},{\stackrel{\u02c6}{y}}_{j}\right)\right)=\varphi {\left({\stackrel{\u02c6}{x}}_{i},{\stackrel{\u02c6}{y}}_{i}\right)}^{T}\varphi \left({\stackrel{\u02c6}{x}}_{j},{\stackrel{\u02c6}{y}}_{j}\right)$
Let Eq. (12) be taken into Eq. (11), the linear Eq. (13) can be obtained by eliminating the variables w and ξ_{i}, as follows (13)$\left[\begin{array}{cccc}\hfill 0\hfill & \hfill 1\hfill & \hfill \vdots \hfill & \hfill 1\hfill \\ \hfill 1\hfill & \hfill \kappa \left(\left({\stackrel{\u02c6}{x}}_{1},{\stackrel{\u02c6}{y}}_{1}\right),\left({\stackrel{\u02c6}{x}}_{1},{\stackrel{\u02c6}{y}}_{1}\right)\right)+1/\left(\lambda {\beta}_{1}\right)\hfill & \hfill \vdots \hfill & \hfill \kappa \left(\left({\stackrel{\u02c6}{x}}_{1},{\stackrel{\u02c6}{y}}_{1}\right),\left({\stackrel{\u02c6}{x}}_{n},{\stackrel{\u02c6}{y}}_{n}\right)\right)\hfill \\ \hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ \hfill 1\hfill & \hfill \kappa \left(\left({\stackrel{\u02c6}{x}}_{n},{\stackrel{\u02c6}{y}}_{n}\right),\left({\stackrel{\u02c6}{x}}_{1},{\stackrel{\u02c6}{y}}_{1}\right)\right)\hfill & \hfill \vdots \hfill & \hfill \kappa \left(\left({\stackrel{\u02c6}{x}}_{n},{\stackrel{\u02c6}{y}}_{n}\right),\left({\stackrel{\u02c6}{x}}_{n},{\stackrel{\u02c6}{y}}_{n}\right)\right)+1/\left(\lambda {\beta}_{n}\right)\hfill \end{array}\right]\u2022\left[\begin{array}{c}\hfill b\hfill \\ \hfill {\alpha}_{1}\hfill \\ \hfill \vdots \hfill \\ \hfill {\alpha}_{n}\hfill \end{array}\right]=\left[\begin{array}{c}\hfill 0\hfill \\ \hfill {L}_{1}\hfill \\ \hfill \vdots \hfill \\ \hfill {L}_{n}\hfill \end{array}\right].$
By solving Eq. (13), the bias b and the Lagrange multiplier α_{i} can be obtained. (14)$f\left(\stackrel{\u02c6}{x},\stackrel{\u02c6}{y}\right)=\sum _{i=1}^{n}{\alpha}_{i}\kappa \left(\left(\stackrel{\u02c6}{x},\stackrel{\u02c6}{y}\right),\left({\stackrel{\u02c6}{x}}_{i},{\stackrel{\u02c6}{y}}_{i}\right)\right)+b$
Any semipositive definite symmetric function can be used as a kernel function (Jayasumana et al., 2014), therefore, we obtain a positive definite function through calculating the cumulative distribution function (Jayasumana et al., 2013), having (15)$\kappa \left({x}_{1},{x}_{2}\right)=\left(1{\left(1{x}_{1}^{A}\right)}^{B},1{\left(1{x}_{2}^{A}\right)}^{B}A,B\right)$
where A, B are the nonnegative kernel parameters.
Model
Model structure and training
The proposed AESWLSSVM consists of encoding module, feature separation module and decoding module, as shown in Fig. 2. The role of each module is described below.
Module 1, i.e., encoding module. In the module, hidden layers capture the lowdimensional features of the input data. The error of capturing lowdimensional features can be minimized by Eq. (7).
Module 2, i.e., feature separation module. The module separates normal and abnormal features. In the lowdimensional feature space, the kernel function in Eq. (15) completes the separation of normal features and abnormal features.
Module 3, i.e., decoding module. According to the separated features of the output, hidden layers in the module reconstruct normal and abnormal instances. Finally, the input layer outputs the learned normal and abnormal class labels.
The objective function of AESWLSSVM consists of the loss function L_{Ω} of the autoencoder in Eq. (7) and the error function $f\left(\stackrel{\u02c6}{x},\stackrel{\u02c6}{y}\right)$ in Eq. (14) of SWLSSVM, having that (16)${T}_{\mathrm{target}}=min\left\{{L}_{\Omega}+f\left(\stackrel{\u02c6}{x},\stackrel{\u02c6}{y}\right)\right\}.$
The performance of AESWLSSVM is related to parameters, so parameters of the model need to be crossvalidated in advance, as shown in Algorithm 1. Firstly, the training set is used to train the model, and then the testing set is used to validate the trained model. By analyzing the testing results, the optimal kernel parameter value and the number of neurons are selected. The procedure in Step 2 to Step 18 mainly performs crossvalidation of the number of neurons to obtain the optimal number of neurons opt(δ), where the procedure between Step 3 and Step 14 is the crossvalidation of the kernel parameter to obtain the optimal kernel parameter value _{opt(γ)}.
After obtaining the optimal parameters, we use the training set to train AESWLSSVM, as shown in Algorithm 2. The process of Step 1 to Step 9 realizes the training of the model. During the training, the objective function is iteratively calculated, when the model converges, we stop the training of the model and save the current training accuracy. From Step 10 to Step 14, we select the maximum trainingaccuracy in the t_{max}th training as the final output accuracy of the model, and then save the model for the t_{max}th training.
Algorithm 1. Crossvalidation of parameters.  
Input: iteration epoch T, the number of neurons δ_{1}, δ_{2}, constant Δδ, kernel parameter sets γ_{k}, training set Train_set, testing set Test_set.  
Output:  
Optimal kernel parameter value _{opt(γ)}, optimal number of neurons opt(δ).  
Begin  
1  fort=1 toTdo: 
2  for_{δ=δ1} to_{δ2}do: 
3  foreachγ inγ_{k}: 
4  Train model SWLSSVM(Train_set, δ, γ); 
5  fori=1 toIdo: 
6  Calculate the weight coefficient β_{i}; 
7  Learn objective function T_{target}; 
8  Calculate training accuracy Train_Acc= SWLSSVM(Train_set, δ, γ) ; 
9  end for 
10  Test model SWLSSVM(Test_set, δ, γ) ; 
11  Calculate testing accuracy Test_Acc= SWLSSVM(Test_set, δ, γ); 
12  end foreach 
13  Select the optimal γso that maximize testing accuracy γ(max) = argmax(Test_Acc(γ_{max})); 
14  Obtain optimal kernel parameter _{opt(γ)=γ(max)}; 
15  δ = δ + Δδ; 
16  end for 
17  Select the optimal δso that maximize testing accuracy δ(max) = argmax(Test_Acc(δ_{max}, γ_{max})); 
18  Obtain the optimal number of neurons opt(δ) = δ(max); 
19  end for 
End 
Algorithm 2. Model training.  
Input: iteration epoch T, _{opt(γ)}, opt(δ), training set Training_set.  
Output: training accuracy Training_acc.  
Begin  
1  fort=1 toTdo: 
2  Train model SWLSSVM(Training_set; _{opt(γ)}; opt(δ)) ; style=paddingleft:1pc;, 
3  fori=1 toIdo: 
4  Calculate the weight coefficient β_{i}; 
5  Learn objective function T_{target}; 
6  Calculate training accuracy Training_acc (t) = SWLSSVM(Training_set; _{opt(γ)}) ; 
7  Save the tth training accuracy T _acc (save(t)) =Training_acc (t) ; 
8  end for 
9  end for 
10  traverse the saved training accuracy T _acc (save(t)) ; 
11  Select the maximum training accuracy in the t_{max}th training 
12  t_{max}=arg max(T _acc (save(t))); 
13  Save the model in the t_{max}th training _{Save=SWLSSVM(Training_set;opt(γ);opt(δ);tmax)}; 
14  Output training accuracy in the t_{max}th Training_acc=Training_acc (t_{max}); 
End 
Model parameters
In order to better train AESWLSSVM, the parameters that have a significant impact on the training results are investigated.
(1) Kernel parameter. To induce the best performance of AESWLSSVM, the optimal value of kernel parameter is searched within a certain range.
(2) Weight coefficient β_{i} is calculated as follows, (17)${\beta}_{i}=\left\{\begin{array}{c}1,\left\frac{{\xi}_{i}}{\stackrel{\u0304}{s}}\right\ge 1\hfill \\ \left\frac{{\xi}_{i}}{\stackrel{\u0304}{s}}\right\left\frac{{\xi}_{i}}{\stackrel{\u0304}{s}}\right<1\hfill \\ 1{0}^{7}\left\frac{{\xi}_{i}}{\stackrel{\u0304}{s}}\right=0\hfill \end{array}\right.$
where $\stackrel{\u0304}{s}$ isthe standard deviation of ${\xi}_{i}.{\xi}_{i}/\stackrel{\u0304}{s}$ indicates the absolute value of the error rate.
(3) Activation function. Activation functions used in machine learning are usually Sigmoid, tanh, ELU, ReLU, etc. The output of Sigmoid is only 0 and 1, which is suitable for the judgment of normal instances and abnormal instances. Therefore, Sigmoid is used as the activation function for the proposed AESWLSSVM.
(4) Optimizer and learning rate. Adam is used as an optimizer for AESWLSSVM. Adam not only possesses AdaGrad’s ability to handle sparse gradients, but also has RMSProp’s ability to handle nonstationary targets (Kingma & Lei Ba, 2015). Moreover, Adam can provide different adaptive learning rates for different hyper parameters.
(5) Number of neurons. The number of neurons was determined by crossvalidation. Given that data dimension and data volume of the input data, using a certain range to configure the number of neurons can improve the ability of the model to resist overfitting.
(6) Iteration epoch. During the training of AESWLSSVM, we observe the change of training accuracy and dynamically adjust iteration epoch until the model can converge, and then stop training.
Experiments
Datasets
In practical applications, real highdimensional anomaly datasets are difficult to be obtained. Therefore, we selected real highdimensional datasets of being often used for classification, and then used the method (Campos et al., 2016) to convert these highdimensional datasets into anomaly detection datasets. We considered seven highdimensional classification datasets (i.e., U1–U7, the dimensions are greater than 165) to test the anomaly detection ability of the model. Then, we randomly selected five highdimensional datasets from U1–U7 as the training set to train our model, and the selected process was repeated five times independently. In addition, three benchmark datasets (namely B1, B2, B3) were also selected, after our model is well trained, benchmark dataset B1, B2, B3 are used for parametric crossvalidation and model structure validation. Table 1 gives a detailed description of these 10 datasets (seven highdimensional datasets and three benchmark datasets).
Assessment metrics
Accuracy and F1score are used as evaluation indicators, and the calculation formula is as follows (18)$\mathrm{Accuracy}=\frac{\mathrm{TP\; +\; TN}}{\mathrm{TP\; +\; FP\; +\; TN\; +\; FN}}$ (19)$\text{F1score}=\frac{2\mathrm{TP}}{2\mathrm{TP}+\mathrm{FP}+\mathrm{FN}}$
where TP represents the number of correctly predicted anomalous instances. TN represents the number of correctly predicted normal instances. FP represents the number of normal instances that are predicted to be anomalous instances. FN represents the number of anomalous instances that are predicted to be normal instances.
#  Benchmark datasets  Description (normal vs. anomaly)  Data volume  Anomaly ratio  Data dimension  

Normal  Anomaly  
B1  Shuttle  Class ‘1’ vs. Others  1,000  13  1.28%  9 
B2  PenDigits  Other vs. Class ‘4’  9,868  20  0.20%  16 
B3  Waveform  Others vs. Class ‘0’  3,343  100  2.9%  21 
#  Highdimensional datasets  Description (normal vs. outliers)  Data volume  Anomaly ratio  Data dimension  

Normal  Anomaly  
U1  Arcene  Normal patterns vs cancer  8,459,427  540,573  6.01%  10,000 
U2  Gisette  Zero vs nonzero values  28,278,760  4,221,240  12.99%  5,000 
U3  Micro Mass  Zero vs nonzero values  464,819  3,181  0.68%  1,300 
U4  Malware  Zero vs nonzero values  2,894,954  37,772  1.29%  1,087 
U5  CNAE  Zero vs nonzero values  918,537  7,023  0.76%  857 
U6  Epileptic Seizure  eizure vs. nonseizure  11,321  179  1.56%  179 
U7  Musk  Musk vs nonmusk  79,699  269  0.34%  168 
Comparison methods
The proposed method is a hybrid method based on deep network architectures, so deep hybridbased methods, i.e., DNNSVM (Inoue et al., 2017), DAEKNN (Song et al., 2017), and deep networksbased methods, i.e., GAN (Li et al., 2019), are used for comparisons. In addition, distance metricbased methods, i.e., KNN (Chehreghani, 2016) are also considered.
As for the proposed method, the adjustment range of the kernel parameter is defined as γ_{k} = {0.1, 0.3, 0.5, 1, 2, 3, 5 }, the number of neurons _{δ} = [_{δ1}, _{δ2}], δ_{1} = 10, δ_{2} = 110, and step Δδ = 20. To have a fair comparison, the comparison method adopts the parameters observed in the corresponding literature.
We implemented corresponding algorithms of these methods using Python 3.8 on the Tensorflow 2.0 on Linux System. The environment setting are Intel i7 3.0 GHz CPU, and 32G memory. These algorithms run on the same GPU, and adopt the same configuration.
Results and Discussion
Parameter testing of the model
AESWLSSVM contains multiple hidden layers and is tested in the range {1, 2, 3, 5, 7, 10, 20, 30 }. The seven highdimensional datasets are used as experiment datasets, i.e., U1, U2, U3, U4, U5, U6, U7, we randomly selected five highdimensional datasets from U1U7 as the training set to train our AESWLSSVM; meanwhile, the selected process was repeated five times independently. After our AESWLSSVM is well trained, the benchmark dataset B1 is used as the testing set to test the number of hidden layers of AESWLSSVM.
Testing results show that when the number of hidden layers reaches 3, the detection performance (including Accuracy and F1score) of AESWLSSVM tends to be stable, as shown in Fig. 3A. This shows that the proposed model is stable on the cases considered.
Let the number of hidden layers be equal to 3, then parameter are tested on the benchmark dataset B2 and B3, respectively, as shown in Figs. 3B and 3C. Results show that the detection performance of AESWLSSVM is the best when kernel parameter γ is equal to 0.5 and the number of neurons δ is equal to 50. Therefore, in the subsequent experiments, the parameters of AESWLSSVM are configured as the number of hidden layers is 3, and γ = 0.5, δ = 50.
Performance comparison
Results on datasets U1U7 show that AESWLSSVM outperforms competitors DNNSVM, DAEKNN, GAN and KNN in terms of anomaly detection performance on most datasets, as shown in Fig. 4. On higher dimensional datasets, such as U1 (dimension = 10,000) and U2 (dimension = 5,000), anomaly detection advantages of AESWLSSVM are very significant. Unfortunately, the competitor KNN almost fails on datasets U1 and U2. This is because the contrast of distance between the data becomes more difficulty as data dimensionality increases in highdimensional spaces, unfortunately, the measurement manner based on distance may difficulty measures the attribute similarity of highdimensional data. In fact, KNN based on distance methods is prone to rely on measurement of distance. Although the detection results of these four competitors are better than our AESWLSSVM on datasets U4 and U5, the advantages are not significant. Overall, AESWLSSVM shows more advantages for anomaly detection on highdimensional data.
Discussion
Insights
Compared with the above competitors, our model has outstanding advantages in terms of anomaly detection for the following reasons.
The autoencoder can capture the lowdimensional layered features from the input data, which is crucial because they provide a sufficient condition for the separation both anomaly features and normal features. The loss function of the autoencoder in Eq. (7) can minimize the error of the extracted lowdimensional layered features. This provides good environments for the kernel in Eq. (15) separating anomaly features from normal features. Through iteratively learning the objective function in Eq. (16), our model can gain good detection accuracy for anomalies.
For these anomaly detection methods, (i) distance metricbased detection methods, such as KNN (Chehreghani, 2016), can work in lowdimensional spaces well, while their detection capabilities are restricted because the contrast between data points in highdimensional spaces becomes similar, while (ii) deep learningbased detection methods, e.g., GAN (Li et al., 2019), are suitable for working upon complex highdimensional spaces due to owning nonlinear layers extracting important features or learning useful representations. Certainly, in terms of Generative Adversarial Networks (GANs), there exists an unavoidable mode collapsing so that the training is not easy. In regard to (iii) deep hybridbased detection methods, such as DNNSVM (Inoue et al., 2017), DAEKNN (Song et al., 2017), the methods more and more popular since they inherit the advantages of deep network architectures and traditional detection methods. However, deep hybrid detection methods also show poor detection capabilities when traditional detection methods depend on data distribution or easily occur overfitting, for example, DBNRandom Forest (Kam Ho, 1995) shows poor noise resistance and encounters high risk of overfitting, due to random forest easily suffers overfitting on the sample with relatively large noise (Zheng & Zhao, 2020; Popolin Neto & Paulovich, 2021).
Limitations
The detection performance of the proposed method relies on the extracted features, which means that the quality of the extracted feature has important effects on the ability of the method. Additionally, due to lacking real anomaly datasets, the detection accuracies of most anomaly detection methods are restricted so that it is difficult to truly reflect the detection capabilities of them.
Conclusion
This article proposes a hybrid method of combining an autoencoder and a sparse weighted least squares support vector machine for anomaly detection on highdimensional data. The key thought is that the autoencoder extracts the lowdimensional layered features from highdimensional data, in order to reduce the dimension of the data and the complexity of the searching space. In the lowdimensional feature space, the sparse weighted least squaressupport vector machine separates anomalous features from normal features. Finally, the class labels of being used to distinguish normal instances and abnormal instances are output, thereby completing anomaly detection of highdimensional data. Results show that the proposed method is superior to competitors in terms of anomaly detection ability of highdimensional data. In future works, we will look at addressing the issue of anomaly detection of noise interference. Noise can mask the rare anomalies so that anomalies are likely to be mistaken as noise.