NFAD: fixing anomaly detection using normalizing flows
 Published
 Accepted
 Received
 Academic Editor
 Donghyun Kim
 Subject Areas
 Artificial Intelligence, Computer Vision, Data Mining and Machine Learning, Data Science
 Keywords
 Anomaly detection, Deep learning, Semisupervised learning, Normalizing flows
 Copyright
 © 2021 Ryzhikov et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
 Cite this article
 2021. NFAD: fixing anomaly detection using normalizing flows. PeerJ Computer Science 7:e757 https://doi.org/10.7717/peerjcs.757
Abstract
Anomaly detection is a challenging task that frequently arises in practically all areas of industry and science, from fraud detection and data quality monitoring to finding rare cases of diseases and searching for new physics. Most of the conventional approaches to anomaly detection, such as oneclass SVM and Robust AutoEncoder, are oneclass classification methods, i.e., focus on separating normal data from the rest of the space. Such methods are based on the assumption of separability of normal and anomalous classes, and subsequently do not take into account any available samples of anomalies. Nonetheless, in practical settings, some anomalous samples are often available; however, usually in amounts far lower than required for a balanced classification task, and the separability assumption might not always hold. This leads to an important task—incorporating known anomalous samples into training procedures of anomaly detection models. In this work, we propose a novel modelagnostic training procedure to address this task. We reformulate oneclass classification as a binary classification problem with normal data being distinguished from pseudoanomalous samples. The pseudoanomalous samples are drawn from lowdensity regions of a normalizing flow model by feeding tails of the latent distribution into the model. Such an approach allows to easily include known anomalies into the training process of an arbitrary classifier. We demonstrate that our approach shows comparable performance on oneclass problems, and, most importantly, achieves comparable or superior results on tasks with variable amounts of known anomalies.
Introduction
The anomaly detection (AD) problem is one of the important tasks in the analysis of realworld data. Possible applications range from the dataquality certification (for example, Borisyak et al., 2017) to finding the rare specific cases of the diseases in medicine (Spence, Parra & Sajda, 2001). The technique can be also used in credit card fraud detection (Aleskerov, Freisleben & Rao, 1997), complex systems failure predictions (Xu & Li, 2013), and novelty detection in time series data (Schmidt & Simic, 2019).
Formally, AD is a classification problem with a representative set of normal samples and a small, nonrepresentative or empty set of anomalous examples. Such a setting makes conventional binary classification methods to be overfitted and not to be robust w.r.t. novel anomalies (Görnitz et al., 2012). In contrast, conventional oneclass classification (OC) methods (Breunig et al., 2000; Liu, Ting & Zhou, 2012) are typically robust against all types of outliers. However, OCmethods do not take into account known anomalies which often result to suboptimal performance in cases when normal and anomalous classes are not perfectly separable (Campos et al., 2016; Pang, Shen & Van den Hengel, 2019). The research in the area addresses several challenges (Pang et al., 2021) that lie in the field of increasing precision, generalizing to unknown anomaly classes, and tackling multidimensional data. Several reviews of classical (Zimek, Schubert & Kriegel, 2012; Aggarwal, 2016; Boukerche, Zheng & Alfandi, 2020; Belhadi et al., 2020) and deeplearning methods (Pang et al., 2021) were published that describe the literature in detail. With the advancement of the neural generative modeling, methods based on generative adversarial networks (Schlegl et al., 2017), variational autoencoders (Xu et al., 2018), and normalizing flows (Pathak, 2019) are introduced for the AD task.
We propose^{1} addressing the classimbalanced classification task by modifying the learning procedure that effectively makes anomaly detection methods suitable for a twoclass classification. Our approach relies on imbalanced dataset augmentation by surrogate anomalies sampled from normalizing flowbased generative models.
Problem statement
Classical AD methods consider anomalies a priori significantly different from the normal samples (Aggarwal, 2016). In practice, while such samples are, indeed, most likely to be anomalous, often some anomalies might not be distinguishable from normal samples (Hunziker et al., 2017; Pol et al., 2019; Borisyak et al., 2017). This provides a strong motivation to include known anomalous samples into the training procedure to improve the performance of the model on these ambiguous samples. Technically, this leads to a binary classification problem which is typically solved by minimizing crossentropy loss function L_{BCE}: (1)${f}^{\ast}\left(x\right)={\mathrm{argmin}}_{f}{L}_{BCE}\left(f\right);$ (2)${L}_{BCE}\left(f\right)=P\left({C}^{+}\right){\mathbb{E}}_{x\sim {C}^{+}}\mathrm{log}\phantom{\rule{1em}{0ex}}f\left(x\right)+P\left({C}^{}\right){\mathbb{E}}_{x\sim {C}^{}}\mathrm{log}\left(1f\left(x\right)\right);$ where: f is a arbitrary model (e.g., a neural network), C^{+} and C^{−} denote normal and anomalous classes. In this case, the solution f^{∗} approaches the optimal Bayesian classifier: (3)${f}^{\ast}\left(x\right)=P\left({C}^{+}\mid x\right)=\frac{p\left(x\mid {C}^{+}\right)p\left({C}^{+}\right)}{p\left(x\mid {C}^{+}\right)p\left({C}^{+}\right)+p\left(x\mid {C}^{}\right)p\left({C}^{}\right)}.$
Notice that f^{∗} implicitly relies on the estimation of the probability densities P(x∣C^{+}) and P(x∣C^{−}). A good estimation of these densities is possible only when a sufficiently large and representative sample is available for each class. In practical settings, this assumption certainly holds for the normal class. However, the anomalous dataset is rarely large or representative, often consisting of only a few samples or covering only a portion of all possible anomaly types.^{2} With only a small number of examples (or a nonrepresentative sample) to estimate the second term of Eq. (2), L_{BCE} effectively does not depend on f(x) in x ∈ suppC^{−}∖suppC^{+}, which leads to solutions with arbitrary predictions in the area, i.e., to classifiers that are not robust to novel anomalies.
Oneclass classifiers avoid this problem by aiming to explicitly separate the normal class from the rest of the space (Liu, Ting & Zhou, 2008; Scholkopf & Smola, 2018). As discussed above, this approach, however, ignores available anomalous samples, potentially leading to incorrect predictions on ambiguous samples.
Recently, semisupervised AD algorithms like 1 + ɛclassification method (Borisyak et al., 2020), Deep Semisupervised AD method (Ruff et al., 2019), Feature Encoding with AutoEncoders for Weaklysupervised Anomaly Detection (Zhou et al., 2021) and Deep Weaklysupervised Anomaly Detection (Pang et al., 2019) were put forward. They aim to combine the main properties of both unsupervised (oneclass) and supervised (binary classification) approaches: proper posterior probability estimations of binary classification and robustness against novel anomalies of oneclass classification.
In this work, we propose a method that extends the 1 + ɛclassification method (Borisyak et al., 2020) scheme by exploiting normalizing flows. The method is based on sampling the surrogate anomalies to augment the existing anomalies dataset using advanced techniques.
Normalizing flows
The normalizing flows (Rezende & Mohamed, 2015b) generative model aims to fit the exact probability distribution of data. It represents a set of invertible transformations {f_{i}(⋅; θ_{i})} with parameters θ_{i}, to obtain a bijection between the given distribution of training samples and some domain distribution with known probability density function(PDF). However, in the case of nontrivial bijection z_{0}↔z_{k}, the distribution density at the final point z_{k} (training sample) differs from the density at point z_{0} (domain). This is due to the fact that each nontrivial transformation f_{i}(⋅; θ_{i}) changes the infinitesimal volume at some points. Thus, the task is not only to find a flow of invertible transformations {f_{i}(⋅; θ_{i})}, but also to know how the distribution density is changed at each point after each transformation f_{i}(⋅; θ_{i}).
Consider the multivariate transformation of variable z_{i} = f_{i}(z_{i−1}; θ_{i}) with parameters θ_{i} for i > 0. Then, Jacobian for a given transformation f_{i}(z_{i−1}; θ_{i}) at given point z_{i−1} has the following form: (4)$J\left({\mathit{f}}_{i}{\mathit{z}}_{i1}\right)=\left[\begin{array}{ccc}\hfill \frac{\partial {\mathit{f}}_{i}}{\partial {z}_{i1}^{1}}\hfill & \hfill \dots \hfill & \hfill \frac{\partial {\mathit{f}}_{i}}{\partial {z}_{i1}^{n}}\hfill \end{array}\right]=\left[\begin{array}{ccc}\hfill \frac{\partial {f}_{i1}^{1}}{\partial {z}_{i1}^{1}}\hfill & \hfill \dots \hfill & \hfill \frac{\partial {f}_{i1}^{1}}{\partial {z}_{i1}^{n}}\hfill \\ \hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ \hfill \frac{\partial {f}_{i1}^{m}}{\partial {z}_{i1}^{1}}\hfill & \hfill \dots \hfill & \hfill \frac{\partial {f}_{i1}^{m}}{\partial {z}_{i1}^{n}}\hfill \end{array}\right]$
Then, the distribution density at point z_{i} after the transformation f_{i} of point z_{i−1} can be written in a following common way: (5)$p\left({\mathit{z}}_{i}\right)=\frac{p\left({\mathit{z}}_{i1}\right)}{detJ\left({\mathit{f}}_{i}{\mathit{z}}_{i1}\right)},$ where detJ(f_{i}z_{i−1}) is a determinant of the Jacobian matrix J(f_{i}z_{i−1}) (Rezende & Mohamed, 2015).
Thus, given a flow of invertible transformations $\mathit{f}={\left\{{\mathit{f}}_{i}\left(\cdot ;{\theta}_{i}\right)\right\}}_{i=1}^{N}$ with known ${\left\{detJ\left({\mathit{f}}_{i}\cdot \right)\right\}}_{i=1}^{N}$ and domain distribution of z_{0} with known p.d.f. p(z_{0}), we obtain likelihood p(x) for each object x = z_{N}. This way, the parameters ${\left\{{\theta}_{i}\right\}}_{i=1}^{N}$ of NF model f can be fitted by explicit maximizing the likelihood p(x) for training objects x ∈ X. In practice, MonteCarlo estimate of logp(X) = logΠ_{x∈X}p(x) = Σ_{x∈X}logp(x) is optimized, which is an equivalent optimization procedure. Also, the likelihood p(X) can be used as a metric of how well the NF model f fits given data X.
The main bottleneck of that scheme is located in that detJ(⋅⋅) computation, which is O(n^{3}) in a common case (n is the dimension of variable z). In order to deal with that problem, specific normalizing flows with specific families of transformations f are used, for which Jacobian computation is much faster (Rezende & Mohamed, 2015; Papamakarios, Pavlakou & Murray, 2017; Kingma et al., 2016; Chen et al., 2019).
Algorithm
The suggested NFbased AD method (NFAD) is a twostep procedure. In the first step, we train normalizing flow on normal samples to sample new surrogate anomalies. Here, we assume that anomalies differ from normal samples, and its likelihood p_{NF}(x^{−}C^{+}) is less than likelihood of normal samples p_{NF}(x^{+}C^{+}). In the second step, we sample new surrogate anomalies from tails of normal samples distribution using NF and train an arbitrary binary classifier on normal samples and a mixture of real and sampled surrogate anomalies.
Step 1. Training normalizing flow
We train normalizing flow on normal samples. It can be trained by a standard for normalizing flows scheme of maximization the loglikelihood (see ‘Normalizing flows’): (6)$\underset{\theta}{max}{L}_{NF}$ (7)${L}_{NF}={\mathbb{E}}_{x\sim {C}^{+}}\phantom{\rule{0.25em}{0ex}}log\phantom{\rule{1em}{0ex}}{p}_{f}\left(x\right)$ (8)$={\mathbb{E}}_{z\sim {f}^{1}\left({C}^{+};\theta \right)}\phantom{\rule{0.25em}{0ex}}\left[\right.logp\left(z\right)logdetJ\left(fz\right)\left]\right.,$ where f(⋅; θ) is NF transformation with parameters θ, J(fz) is Jacobian of transformation f(z; θ) at point z, z are samples from multivariate standard normal domain distribution $p\left(z\right)=\mathcal{N}\left(z0,I\right)$, x are normal samples from the training dataset, ${p}_{f}\left(x\right)=\frac{p\left(z\right)}{J\left(fz\right)}{\left\right.}_{z={f}^{1}\left(x;\theta \right)}$.
After NF for sampling is trained, it can be used to sample new anomalies. To produce new anomalies, we sample z from tails of normal domain distribution, where pvalue of tails is a hyperparameter (see Fig. 1).
Here, we assume that test time anomalies are either represented in the given anomalous training set or novelties w.r.t. normal class. In other words, p(xC^{+}) of novelties x must be relatively small. Nevertheless, p(x) obtained by NF might be drastically different from the corresponding domain point likelihood p(z) because of nonunit Jacobian of NF transformations Eq. (8). Such distribution density distortion is illustrated in Fig. 2 and makes the proposed sampling scheme of anomalies to be incomplete. Because of such distortion, some points in the tails of the domain can correspond to normal samples, and some points in the body of domain distribution can correspond to anomalies. To fix it, we propose Jacobian regularization of normalizing flows (Fig. 2) by introducing extra regularization term. It penalizes the model for nonunit Jacobian: (9)$L}_{J}={\mathbb{E}}_{z\sim \mathcal{N}\left(0,1\right)}log{\left(detJ\left(fz\right)\right)}^{2$ (10)$\underset{\theta}{max}\left[\right.{L}_{NF}\lambda \ast {L}_{J}\left]\right.,\lambda \ge 0,$ where λ denotes the regularization hyperparameter. We estimate the regularization term L_{J} in Eq. (9) by direct sampling of z from the domain distribution $\mathcal{N}\left(0,I\right)$ to cover the whole sampling space. The theorem below proofs that any level of expected distortion can be obtained with such a regularization:
Let Ω ⊂ ℝ^{d} a sample space with probability (domain) distribution $\mathcal{D}$, C^{+} ⊂ Ω a class of normal samples, f(⋅; θ):ℝ^{d} → ℝ^{d} is a set of invertible transformations parametrized by θ and ∃θ_{0}:f(⋅; θ_{0}) = (identical transformation exists). Then I∀ɛ > 0 ∃λ ≥ 0 such that [E_{z∼𝒟}log(detJ(fz))^{2}]_{θ∗} < ɛ∣∀z ∼ Ω ∈ ℝ^{d}, where ${\theta}^{\ast}\in {\mathrm{argmin}}_{\theta}\left[\right.{\mathbb{E}}_{x\sim {C}^{+}}\phantom{\rule{0.25em}{0ex}}log{p}_{f}\left(x\right)+\lambda {\mathbb{E}}_{z\sim \mathcal{D}}log{\left(detJ\left(fz\right)\right)}^{2}\left]\right.$, ${p}_{f}\left(x\right)=\frac{p\left(z\right)}{J\left(fz\right)}{\left\right.}_{z={f}^{1}\left(x;\theta \right)}$.
Proof. Suppose the opposite. Let ∃ɛ > 0 s.t. $\forall \lambda \ge 0:\left[\right.{\mathbb{E}}_{z\sim \mathcal{D}}log{\left(\leftJ\left(fz\right)\right\right)}^{2}{\left]\right.}_{{\theta}^{\ast}}\ge \varepsilon $ for all ${\theta}^{\ast}\in {\mathrm{argmin}}_{\theta}\left[\right.{\mathbb{E}}_{x\sim {C}^{+}}\phantom{\rule{0.25em}{0ex}}log{p}_{f}\left(x\right)+\lambda {\mathbb{E}}_{z\sim \mathcal{D}}log{\left(detJ\left(fz\right)\right)}^{2}\left]\right.$.
Since ∃θ_{0}:f(⋅; θ_{0}) = I, ${p}_{f}\left(f\left(z;{\theta}_{0}\right)\right)=p\left(z\right)\left\right.\forall z\sim \Omega $, the term $\left[\right.{\mathbb{E}}_{z\sim \mathcal{D}}log{\left(detJ\left(fz\right)\right)}^{2}{\left]\right.}_{{\theta}_{0}}=0$
since $\frac{p\left(z\right)}{p\left(f\left(z;{\theta}_{0}\right)\right)}=detJ\left(fz\right){}_{{\theta}_{0}}=1\left\right.\forall z\in \Omega$
Let $\left[\right.{\mathbb{E}}_{x\sim {C}^{+}}\phantom{\rule{0.25em}{0ex}}log{p}_{f}\left(x\right){\left]\right.}_{{\theta}_{0}}={\mathbb{E}}_{z\sim {C}^{+}}\phantom{\rule{0.25em}{0ex}}logp\left(z\right)={c}_{0}$, ${min}_{\theta}\left[\right.{\mathbb{E}}_{x\sim {C}^{+}}\phantom{\rule{0.25em}{0ex}}log{p}_{f}\left(x\right)\left]\right.={c}_{min}<{c}_{0}$ (minimum exists since negative log likelihood is lower bounded by 0). Then ∀λ:
${c}_{0}>{c}_{min}+\lambda \left[\right.{\mathbb{E}}_{z\sim \mathcal{D}}log{\left(detJ\left(fz\right)\right)}^{2}{\left]\right.}_{{\theta}^{\ast}}\ge {c}_{min}+\lambda \varepsilon$ But $\lambda >\frac{{c}_{0}{c}_{min}}{\varepsilon}$ leads to contradiction.□
In this work, we use Neural Spline Flows (NSF, Durkan et al., 2019) and Inverse (IAF, Kingma et al., 2016) Autoregressive Flows for tabular anomalies sampling. We also use Residual Flow (ResFlow, Chen et al., 2019) for anomalies sampling on image datasets. All the flows satisfy the conditions of Theorem 4.1. The proposed algorithms are called ‘nfadnsf‘, ‘nfadiaf‘ and ‘nfadresflow‘ respectively.
Step 2. Training classifier
Once normalizing flow for anomaly sampling is trained, a classifier can be trained on normal samples and a mixture of real and surrogate anomalies sampled from NF (Fig. 3). During the research, we used binary crossentropy objective Eq. (2) to train the classifier. We do not focus on classifier configuration since any classification model can be used at this step.
Final algorithm
The final scheme of the algorithm is shown in Fig. 3 accompanied with pseudocode Algorithm 1. All training details are given in Appendix A.
Input : Normal samples C+, anomaly samples C− (may be
empty), pvalue of tail p p p, number of epochs for NF
ENF, number of epochs for classifier ECLF
Output: Anomalies classifier gϕ
for epoch from 1 to ENF do
sample minibatch of normal samples X+ ∼ C+;
calculate NF bijection between points on gaussian Z+ and
normal samples X+: Z+ = f−1(X+;θ);
update parameters θ of NF f with the following gradient
ascend: ∇θ log p(X+) = ∇θ[
log p(Z+) − log detJ(fZ+)]
;
end
for epoch from 1 to ECLF do
sample ~ Z from gaussian tail: ~ Z ∼N(0,1) s.t. p( ~Z) ≤ p p p;
sample surrogate anomalies ~ X using NF: ~ X = f( ~Z;θ);
sample minibatch of normal samples: X+ ∼ C+;
sample minibatch of anomalies (if C− is not empty):
X− ∼ C−;
update parameters ϕ of classifier gϕ with the following
gradient descent:
∇ϕ[
log gϕ(X+) + log (1 − gϕ(X−)) + log (
1 − gϕ( ~X))
]
;
end
Algorithm 1: NFAD algorithm
Results
We evaluate the proposed method on the following tabular and image datasets: KDD99 (Stolfo et al., 1999), SUSY (Whiteson, 2014), HIGGS (Baldi, Sadowski & Whiteson, 2014), MNIST (LeCun et al., 1998a), Omniglot (Lake, Salakhutdinov & Tenenbaum, 2015) and CIFAR (Krizhevsky, Hinton et al., 2009). In order to reflect typical AD cases behind the approach, we derive multiple tasks from each dataset by varying sizes of anomalous datasets.
As the proposed method targets problems that are intermediate between oneclass and twoclass problems, we compare the proposed approach with the following algorithms:

oneclass methods: Robust AutoEncoder (RAEOC, (Chalapathy, Krishna Menon & Chawla, 2017)) and Deep SVDD (Ruff et al., 2018).

conventional twoclass classification;

semisupervised methods: dimensionality reduction by an Deep AutoEncoder followed by twoclass classification (DAE), Feature Encoding with AutoEncoders for Weaklysupervised Anomaly Detection (FEAWAD, (Zhou et al., 2021)), DevNet (Pang, Shen & Van den Hengel, 2019), 1 + ɛ method (Borisyak et al., 2020) (‘*ope’), Deep SAD (Ruff et al., 2019) and Deep Weaklysupervised Anomaly Detection (PRO, (Pang et al., 2019))
We compare the algorithms using the ROC AUC metric to avoid unnecessary optimization for thresholddependent metrics like accuracy, precision, or F1. Tables 1, 2 and 3 show the experimental results on tabular data. Tables 4, 5 and 6 show the experimental results on image data. Also, some of the aforementioned algorithms like DevNet are applicable only to tabular data and not reported on image data. In these tables, columns represent tasks with a varying number of negative samples presented in the training set: numbers in the header indicate either number of classes that form negative class (in case of KDD, CIFAR, OMNIGLOT and MNIST datasets) or a number of negative samples used (HIGGS and SUSY); ‘oneclass’ denotes the absence of known anomalous samples. As oneclass algorithms do not take into account negative samples, their results are identical for the tasks with any number of known anomalies. The best score in each column is highlighted in bold font.
one class  1  2  4  8  

raeoc  0.972 ± 0.006  0.972 ± 0.006  0.972 ± 0.006  0.972 ± 0.006  0.972 ± 0.006 
deepsvddoc  0.939 ± 0.014  0.939 ± 0.014  0.939 ± 0.014  0.939 ± 0.014  0.939 ± 0.014 
twoclass  –  0.571 ± 0.213  0.700 ± 0.182  0.687 ± 0.268  0.619 ± 0.257 
dae  –  0.685 ± 0.258  0.531 ± 0.286  0.758 ± 0.171  0.865 ± 0.087 
bruteforceope  0.564 ± 0.122  0.667 ± 0.175  0.606 ± 0.261  0.737 ± 0.187  0.541 ± 0.257 
hmceope  0.739 ± 0.245  0.885 ± 0.152  0.919 ± 0.055  0.863 ± 0.094  0.958 ± 0.023 
rmspropeope  0.765 ± 0.216  0.960 ± 0.017  0.854 ± 0.187  0.964 ± 0.016  0.976 ± 0.011 
deepeope  0.602 ± 0.279  0.701 ± 0.230  0.528 ± 0.300  0.749 ± 0.209  0.785 ± 0.259 
devnet  –  0.557 ± 0.104  0.594 ± 0.111  0.698 ± 0.163  0.812 ± 0.164 
feawad  –  0.862 ± 0.088  0.913 ± 0.069  0.892 ± 0.101  0.937 ± 0.083 
deepsad  0.803 ± 0.236  0.868 ± 0.182  0.942 ± 0.022  0.943 ± 0.069  0.968 ± 0.007 
pro  –  0.726 ± 0.179  0.728 ± 0.163  0.870 ± 0.128  0.905 ± 0.106 
nfad (iaf)  0.981 ± 0.001  0.984 ± 0.002  0.993 ± 0.002  0.997 ± 0.002  0.997 ± 0.002 
nfad (nsf)  0.704 ± 0.007  0.875 ± 0.121  0.901 ± 0.082  0.926 ± 0.041  0.945 ± 0.022 
one class  100  1000  10000  1000000  

raeoc  0.531 ± 0.000  0.531 ± 0.000  0.531 ± 0.000  0.531 ± 0.000  0.531 ± 0.000 
deepsvddoc  0.513 ± 0.000  0.513 ± 0.000  0.513 ± 0.000  0.513 ± 0.000  0.513 ± 0.000 
twoclass  –  0.504 ± 0.017  0.529 ± 0.007  0.566 ± 0.006  0.858 ± 0.002 
dae  –  0.502 ± 0.003  0.522 ± 0.003  0.603 ± 0.002  0.745 ± 0.005 
bruteforceope  0.508 ± 0.000  0.500 ± 0.009  0.520 ± 0.003  0.572 ± 0.005  0.859 ± 0.001 
hmceope  0.509 ± 0.000  0.523 ± 0.005  0.567 ± 0.008  0.648 ± 0.005  0.848 ± 0.001 
rmspropeope  0.503 ± 0.000  0.506 ± 0.008  0.531 ± 0.008  0.593 ± 0.011  0.861 ± 0.000 
deepeope  0.531 ± 0.000  0.537 ± 0.011  0.560 ± 0.008  0.628 ± 0.005  0.860 ± 0.001 
devnet  –  0.565 ± 0.011  0.697 ± 0.006  0.748 ± 0.004  0.748 ± 0.003 
feawad  –  0.551 ± 0.009  0.555 ± 0.014  0.554 ± 0.020  0.549 ± 0.018 
deepsad  0.502 ± 0.010  0.511 ± 0.006  0.561 ± 0.016  0.740 ± 0.011  0.833 ± 0.002 
pro  –  0.533 ± 0.022  0.569 ± 0.011  0.570 ± 0.012  0.582 ± 0.015 
nfad (iaf)  0.572 ± 0.009  0.574 ± 0.008  0.586 ± 0.009  0.623 ± 0.007  0.750 ± 0.008 
nfad (nsf)  0.531 ± 0.010  0.519 ± 0.008  0.554 ± 0.009  0.659 ± 0.007  0.807 ± 0.007 
one class  100  1000  10000  1000000  

raeoc  0.586 ± 0.000  0.586 ± 0.000  0.586 ± 0.000  0.586 ± 0.000  0.586 ± 0.000 
deepsvddoc  0.568 ± 0.000  0.568 ± 0.000  0.568 ± 0.000  0.568 ± 0.000  0.568 ± 0.000 
twoclass  –  0.652 ± 0.031  0.742 ± 0.011  0.792 ± 0.004  0.878 ± 0.000 
dae  –  0.715 ± 0.020  0.766 ± 0.009  0.847 ± 0.002  0.876 ± 0.000 
bruteforceope  0.597 ± 0.000  0.672 ± 0.020  0.748 ± 0.012  0.792 ± 0.003  0.878 ± 0.000 
hmceope  0.528 ± 0.000  0.738 ± 0.019  0.770 ± 0.012  0.816 ± 0.006  0.877 ± 0.000 
rmspropeope  0.528 ± 0.000  0.714 ± 0.019  0.760 ± 0.016  0.807 ± 0.004  0.877 ± 0.000 
deepeope  0.652 ± 0.000  0.670 ± 0.054  0.746 ± 0.024  0.813 ± 0.003  0.878 ± 0.000 
devnet  –  0.747 ± 0.023  0.849 ± 0.002  0.853 ± 0.002  0.854 ± 0.004 
feawad  –  0.758 ± 0.019  0.760 ± 0.028  0.760 ± 0.022  0.762 ± 0.025 
deepsad  0.534 ± 0.022  0.581 ± 0.027  0.785 ± 0.014  0.860 ± 0.009  0.872 ± 0.008 
pro  –  0.833 ± 0.008  0.861 ± 0.002  0.863 ± 0.001  0.863 ± 0.002 
nfad (iaf)  0.701 ± 0.007  0.801 ± 0.007  0.829 ± 0.007  0.868 ± 0.006  0.880 ± 0.000 
nfad (nsf)  0.785 ± 0.001  0.811 ± 0.013  0.855 ± 0.012  0.865 ± 0.001  0.876 ± 0.003 
one class  1  2  4  

nnoc  0.787 ± 0.139  0.787 ± 0.139  0.787 ± 0.139  0.787 ± 0.139 
raeoc  0.978 ± 0.017  0.978 ± 0.017  0.978 ± 0.017  0.978 ± 0.017 
deepsvddoc  0.641 ± 0.086  0.641 ± 0.086  0.641 ± 0.086  0.641 ± 0.086 
twoclass  –  0.879 ± 0.108  0.957 ± 0.050  0.987 ± 0.014 
dae  –  0.934 ± 0.035  0.964 ± 0.032  0.984 ± 0.012 
bruteforceope  0.783 ± 0.120  0.915 ± 0.096  0.968 ± 0.041  0.986 ± 0.015 
hmceope  0.694 ± 0.167  0.933 ± 0.060  0.974 ± 0.023  0.989 ± 0.011 
rmspropeope  0.720 ± 0.186  0.933 ± 0.062  0.977 ± 0.023  0.990 ± 0.009 
deepeope  0.793 ± 0.129  0.942 ± 0.048  0.979 ± 0.016  0.991 ± 0.007 
deepsad  0.636 ± 0.114  0.859 ± 0.094  0.908 ± 0.071  0.947 ± 0.059 
pro  –  0.911 ± 0.096  0.944 ± 0.065  0.952 ± 0.079 
nfad (resflow)  0.682 ± 0.115  0.909 ± 0.959  0.935 ± 0.111  0.972 ± 0.019 
one class  1  2  4  

nnoc  0.532 ± 0.101  0.532 ± 0.101  0.532 ± 0.101  0.532 ± 0.101 
raeoc  0.585 ± 0.126  0.585 ± 0.126  0.585 ± 0.126  0.585 ± 0.126 
deepsvddoc  0.546 ± 0.058  0.546 ± 0.058  0.546 ± 0.058  0.546 ± 0.058 
twoclass  –  0.659 ± 0.093  0.708 ± 0.086  0.748 ± 0.082 
dae  –  0.587 ± 0.109  0.634 ± 0.109  0.671 ± 0.093 
bruteforceope  0.540 ± 0.101  0.688 ± 0.087  0.719 ± 0.079  0.757 ± 0.073 
hmceope  0.547 ± 0.116  0.678 ± 0.091  0.709 ± 0.084  0.739 ± 0.074 
rmspropeope  0.565 ± 0.111  0.678 ± 0.081  0.715 ± 0.083  0.746 ± 0.069 
deepeope  0.564 ± 0.094  0.674 ± 0.100  0.690 ± 0.092  0.719 ± 0.099 
deepsad  0.532 ± 0.061  0.653 ± 0.072  0.680 ± 0.069  0.689 ± 0.065 
pro  –  0.635 ± 0.081  0.653 ± 0.075  0.670 ± 0.069 
nfad (resflow)  0.597 ± 0.083  0.800 ± 0.095  0.863 ± 0.042  0.877 ± 0.045 
one class  1  2  4  

nnoc  0.521 ± 0.166  0.521 ± 0.166  0.521 ± 0.166  0.521 ± 0.166 
raeoc  0.771 ± 0.221  0.771 ± 0.221  0.771 ± 0.221  0.771 ± 0.221 
deepsvddoc  0.640 ± 0.153  0.640 ± 0.153  0.640 ± 0.153  0.640 ± 0.153 
twoclass  –  0.799 ± 0.162  0.862 ± 0.115  0.855 ± 0.125 
dae  –  0.737 ± 0.134  0.821 ± 0.104  0.805 ± 0.121 
bruteforceope  0.503 ± 0.213  0.724 ± 0.222  0.765 ± 0.208  0.825 ± 0.126 
hmceope  0.710 ± 0.178  0.801 ± 0.139  0.842 ± 0.112  0.842 ± 0.115 
rmspropeope  0.678 ± 0.274  0.821 ± 0.143  0.855 ± 0.112  0.863 ± 0.111 
deepeope  0.696 ± 0.172  0.808 ± 0.140  0.851 ± 0.110  0.842 ± 0.122 
deepsad  0.832 ± 0.123  0.856 ± 0.123  0.885 ± 0.095  0.884 ± 0.091 
pro  –  0.750 ± 0.160  0.765 ± 0.163  0.787 ± 0.153 
nfad (resflow)  0.567 ± 0.108  0.727 ± 0.188  0.868 ± 0.111  0.870 ± 0.102 
Discussion
Our tests suggest that the best results are achieved when the normal class distribution has single mode and convex borders. These requirements are dataspecific and can not be effectively addressed in our algorithm. The effects can be seen in Fig. 2, where two modes result in the “bridge” in the reconstructed standard class shape, and the nonconvexity of the borders ends up in the worse separation line description.
Also, hyperparameters like Jacobian regularization λ and tail size p must be accurately chosen. This fact is illustrated in Figs. 1 and 2, where we show the different samples quality and the performance of our algorithm for different hyperparameters values. To find suitable values, some heuristics can be used. For instance, optimal tail location p can be estimated based on known anomalies from the training dataset, whereas Jacobian regularization λ in the NF training process can be linearly scheduled like KL factor in (Hasan et al., 2020).
On tabular data (Tables 1, 2 and 3), the proposed NFAD method shows statistically significant improvement over other AD algorithms in many experiments, where the amount of anomalous samples is extremely low.
On image data (Tables 4, 5 and 6), the proposed method shows competitive quality along with other stateoftheart AD methods, significantly outperforming the existing algorithms on CIFAR dataset.
Our experiments suggest the main reason for the proposed method to have lower performance with respect to others on image data is a tendency of normalizing flows to estimate the likelihood of images by its local features instead of common semantics, as described by Kirichenko, Izmailov & Wilson (2020). We also find that the overfitting of the classifier must be carefully monitored and addressed, as this might lead to the deterioration of the algorithm.
However, the results obtained on HIGGS, KDD, SUSY and CIFAR10 datasets demonstrated the big potential of the proposed method over previous AD algorithms. With the advancement of new ways of NF application to images, the results are expected to improve for this class of datasets as well. In particular, we believe our method to be widely applicable in the industrial environment, where the task of AD can take advantage of both tabular and imagelike datasets.
It also should be emphasized that unlike stateoftheart AD algorithms (Pang et al., 2019; Zhou et al., 2021; Ruff et al., 2019), we propose a modelagnostic data augmentation algorithm that does not modify AD model training scheme and architecture. It enriches the input training anomalies set requiring only normal samples in the augmentation process (Fig. 3).
Conclusion
In this work, we present a new modelagnostic anomaly detection training scheme that deals efficiently with hardtoaddress problems both by oneclass or twoclass methods. The solution combines the best features of oneclass and twoclass approaches. In contrast to oneclass approaches, the proposed method makes the classifier effectively utilize any number of known anomalous examples, but, unlike conventional twoclass classification, does not require an extensive number of anomalous samples. The proposed algorithm significantly outperforms the existing anomaly detection algorithms in most realistic anomaly detection cases. This approach is especially beneficial for anomaly detection problems, in which anomalous data is nonrepresentative, or might drift over time.
The proposed method is fast, stable and flexible both in terms of training and inference stages; unlike previous methods, any classifier can be used in the scheme with any number of anomalies in the training dataset. Such a universal augmentation scheme opens wide prospects for further anomaly detection study and makes it possible to use any classifier on any kind of data. Also, the results on datasets with images are improvable with new techniques of normalizing flows become available.