Adaptive divergence for rapid adversarial optimization
 Published
 Accepted
 Received
 Academic Editor
 Charles Elkan
 Subject Areas
 Data Mining and Machine Learning, Optimization Theory and Computation, Scientific Computing and Simulation
 Keywords
 Adversarial optimization, Blackbox optimization, Computer simulations
 Copyright
 © 2020 Borisyak et al.
 Licence
 This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
 Cite this article
 2020. Adaptive divergence for rapid adversarial optimization. PeerJ Computer Science 6:e274 https://doi.org/10.7717/peerjcs.274
Abstract
Adversarial Optimization provides a reliable, practical way to match two implicitly defined distributions, one of which is typically represented by a sample of real data, and the other is represented by a parameterized generator. Matching of the distributions is achieved by minimizing a divergence between these distribution, and estimation of the divergence involves a secondary optimization task, which, typically, requires training a model to discriminate between these distributions. The choice of the model has its tradeoff: highcapacity models provide good estimations of the divergence, but, generally, require large sample sizes to be properly trained. In contrast, lowcapacity models tend to require fewer samples for training; however, they might provide biased estimations. Computational costs of Adversarial Optimization becomes significant when sampling from the generator is expensive. One of the practical examples of such settings is finetuning parameters of complex computer simulations. In this work, we introduce a novel family of divergences that enables faster optimization convergence measured by the number of samples drawn from the generator. The variation of the underlying discriminator model capacity during optimization leads to a significant speedup. The proposed divergence family suggests using lowcapacity models to compare distant distributions (typically, at early optimization steps), and the capacity gradually grows as the distributions become closer to each other. Thus, it allows for a significant acceleration of the initial stages of optimization. This acceleration was demonstrated on two finetuning problems involving Pythia event generator and two of the most popular blackbox optimization algorithms: Bayesian Optimization and Variational Optimization. Experiments show that, given the same budget, adaptive divergences yield results up to an order of magnitude closer to the optimum than JensenShannon divergence. While we consider physicsrelated simulations, adaptive divergences can be applied to any stochastic simulation.
Introduction
Adversarial Optimization (AO), introduced in Generative Adversarial Networks (Goodfellow et al., 2014), became popular in many areas of machine learning and beyond with applications ranging from generative (Radford, Metz & Chintala, 2015) and inference tasks (Dumoulin et al., 2016), improving image quality (Isola et al., 2017) to tuning stochastic computer simulations (Louppe, Hermans & Cranmer, 2017).
AO provides a reliable, practical way to match two implicitly defined distributions, one of which is typically represented by a sample of real data, and the other is represented by a parameterized generator. Matching of the distributions is achieved by minimizing a divergence between these distribution, and estimation of the divergence involves a secondary optimization task, which, typically, requires training a model to discriminate between these distributions. The model is referred to as discriminator or critic (for simplicity, we use term discriminator everywhere below).
Training a highcapacity model, however, is computationally expensive (Metz et al., 2016) as each step of divergence minimization is accompanied by fitting the discriminator; therefore, adversarial training often requires significantly more computational resources than, for example, a classification model with a comparable architecture of the networks.^{1} Nevertheless, in conventional settings like GAN, this problem is not pronounced for at least two reasons. Firstly, the generator is usually represented by a deep neural network, and sampling is computationally cheap; thus, for properly training the discriminator, a sample of a sufficient size can be quickly drawn. Secondly, GAN training procedures are often regarded not as minimization of a divergence, but as gamelike dynamics (Li et al., 2017; Mescheder, Geiger & Nowozin, 2018); such dynamics typically employ gradient optimization with small incremental steps, which involve relatively small sample sizes for adapting the previous discriminator to an updated generator configuration.
Computational costs of AO become significant when sampling from the generator is computationally expensive, or optimization procedure does not operate by performing small incremental steps (Metz et al., 2016). One of the practical examples of such settings is finetuning parameters of complex computer simulations. Such simulators are usually based on physics laws expressed in computational mathematical forms like differential or stochastic equations. Those equations relate input or initial conditions to the observable quantities under conditions of parameters that define physics laws, geometry, or other valuable property of the simulation; these parameters do not depend on inputs or initial conditions. It is not uncommon that such simulations have very high computational complexity. For example, the simulation of a single proton collision event in the CERN ATLAS detector takes several minutes on a single core CPU (The ATLAS Collaboration, 2010). Due to typically high dimensionality, it takes a considerable amount of samples for finetuning, which in turn increases the computational burden.
Another essential property of such computer simulations is the lack of gradient information over the simulation parameters. Computations are represented by sophisticated computer programs, which are challenging to differentiate.^{2} Thus, global blackbox optimization methods are often employed; Bayesian Optimization is one of the most popular approaches.
In this work, we introduce a novel family of divergences that enables faster optimization convergence measured by the number of samples drawn from the generator. The variation of the underlying discriminator model capacity during optimization leads to a significant speedup. The proposed divergence family suggests using lowcapacity models to compare distant distributions (typically, at early optimization steps), and the capacity gradually grows as the distributions become closer to each other. Thus, it allows for a significant acceleration of the initial stages of optimization. Additionally, the proposed family of divergences is broad, which offers a wide range of opportunities for further research.
We demonstrate the basic idea with some toy examples, and with a realistic challenge of tuning Pythia event generator (Sjöstrand, Mrenna & Skands, 2006; Sjostrand et al., 2015) following Louppe, Hermans & Cranmer (2017) and Ilten, Williams & Yang (2017). We consider physicsrelated simulations; nevertheless, all proposed methods are simulationagnostic.
Background
Adversarial Optimization, initially introduced for Generative Adversarial Networks (GAN) (Goodfellow et al., 2014), offers a general strategy for matching two distributions. Consider feature space $\mathcal{X}$, groundtruth distribution P, and parametrized family of distributions Q_{ψ} implicitly defined by a generator with parameters ψ. Formally, we wish to find such ψ^{∗}, that P = Q_{ψ∗} almost everywhere. AO achieves that by minimizing a divergence or a distance between P and Q_{ψ} with respect to ψ. One of the most popular divergences is Jensen–Shannon divergence: (1)$\mathrm{JSD}\left(P,{Q}_{\psi}\right)=\frac{1}{2}\left[\mathrm{KL}\left(P\parallel {M}_{\psi}\right)+\mathrm{KL}\left({Q}_{\psi}\parallel {M}_{\psi}\right)\right]=\frac{1}{2}\underset{x\sim P}{\mathbb{E}}log\frac{P\left(x\right)}{{M}_{\psi}\left(x\right)}+\frac{1}{2}\underset{x\sim {Q}_{\psi}}{\mathbb{E}}log\frac{Q\left(x\right)}{{M}_{\psi}\left(x\right)};$ where: KL —Kullback–Leibler divergence, ${M}_{\psi}\left(x\right)=\frac{1}{2}\left(P\left(x\right)+{Q}_{\psi}\left(x\right)\right)$. The main insight of Goodfellow et al. (2014) is that JSD can be estimated by training a discriminator f to distinguish between P and Q_{ψ}: (2)$log2\underset{f\in \mathcal{F}}{min}L\left(f,P,{Q}_{\psi}\right)=log2\underset{f\in \mathcal{F}}{min}\left\{\frac{1}{2}\underset{x\sim P}{\mathbb{E}}log\left(\right.f\left(x\right)\left)\right.\frac{1}{2}\underset{x\sim {Q}_{\psi}}{\mathbb{E}}log\left(\right.1f\left(x\right)\left)\right.\right\}=log2+\left\{\frac{1}{2}\underset{x\sim P}{\mathbb{E}}log\left(\right.{f}^{\ast}\left(x\right)\left)\right.+\frac{1}{2}\underset{x\sim {Q}_{\psi}}{\mathbb{E}}log\left(\right.1{f}^{\ast}\left(x\right)\left)\right.\right\}=log2+\left\{\frac{1}{2}\underset{x\sim P}{\mathbb{E}}log\frac{P\left(x\right)}{{Q}_{\psi}\left(x\right)+P\left(x\right)}+\frac{1}{2}\underset{x\sim {Q}_{\psi}}{\mathbb{E}}log\frac{{Q}_{\psi}\left(x\right)}{{Q}_{\psi}\left(x\right)+P\left(x\right)}\right\}=\frac{1}{2}\underset{x\sim P}{\mathbb{E}}log\frac{P\left(x\right)}{{M}_{\psi}\left(x\right)}+\frac{1}{2}\underset{x\sim {Q}_{\psi}}{\mathbb{E}}log\frac{Q\left(x\right)}{{M}_{\psi}\left(x\right)}=\mathrm{JSD}\left(P,{Q}_{\psi}\right);$ where: L —crossentropy loss function, $\mathcal{F}=\left\{f:\mathcal{X}\to \left[0,1\right]\right\}$ is the set of all possible discriminators, and f^{∗} is the optimal discriminator. Similar formulations also exist for other divergences such as Wasserstein (Arjovsky, Chintala & Bottou, 2017) and Cramer (Bellemare et al., 2017) distances.
In classical GAN, both generator and discriminator are represented by differentiable neural networks. Hence, a subgradient of JSD(P, Q_{ψ}) can be easily computed (Goodfellow et al., 2014). The minimization of the divergence can be performed by a gradient method, and the optimization procedure goes iteratively following those steps:

using parameters of the discriminator from the previous iteration as an initial guess, adjust f by performing several steps of the gradient descent to minimize $\mathcal{L}\left(f,P,{Q}_{\psi}\right)$;

considering f as a constant, compute the gradient of $\mathcal{L}\left(f,P,{Q}_{\psi}\right)$ w.r.t. ψ, perform one step of the gradient ascent.
For computationally heavy generators, gradients are usually practically unfeasible; therefore, we consider blackbox optimization methods. One of the most promising methods for blackbox AO is Adversarial Variational Optimization (Louppe, Hermans & Cranmer, 2017), which combines AO with Variational Optimization (Wierstra et al., 2014). This method improves upon conventional Variational Optimization (VO) over Jensen–Shannon divergence by training a single discriminator to distinguish samples from groundtruth distribution and samples from a mixture of generators, where the mixture is defined by the search distribution of VO. This eliminates the need to train a classifier for each individual set of parameters drawn from the search distribution.
Bayesian Optimization (BO) (Mockus, 2012) is another commonly used blackbox optimization method, with applications including tuning of complex simulations (Ilten, Williams & Yang, 2017). As we demonstrate in ‘Experiments, BO can be successfully applied for Adversarial Optimization.
Adaptive Divergence
Notice, that in equation Eq. (2) minimization is carried over the set of all possible discriminators $\mathcal{F}=\left\{f:\mathcal{X}\mapsto \left[0,1\right]\right\}$. In practice, this is intractable and set $\mathcal{F}$ is approximated by a model such as Deep Neural Networks. Everywhere below, we use terms ‘lowcapacity’ and ‘highcapacity’ to describe the set of feasible discriminator functions: lowcapacity models are either represent a narrow set of functions (e.g., logistic regression, shallow decision trees) or are heavily regularized (see ‘Implementation’ for more examples of capacity regulation); highcapacity models are sufficient for estimating JSD for an Adversarial Optimization problem under consideration.
In conventional GAN settings, the generator is represented by a neural network, sampling is computationally cheap, and usage of highcapacity discriminators is satisfactory. In our case, as was discussed above, simulations tend to be computationally heavy, which, combined with a typically slow convergence of blackbox optimization algorithms, might make AO with a highcapacity model practically intractable.
The choice of the model has its tradeoff: highcapacity models provide good estimations of JSD, but, generally, require large sample sizes to be properly trained. In contrast, lowcapacity models tend to require fewer samples for training; however, they might provide biased estimations. For example, if the classifier is represented by a narrow set of functions $M\subseteq \mathcal{F}$, then quantity: (3)${D}_{M}\left(P,Q\right)=log2\underset{f\in M}{min}L\left(f,P,Q\right);$ might no longer be a divergence, so we refer to it as pseudodivergence.
A function D:Π(𝒳) × Π(𝒳) → ℝ is a pseudodivergence, if:
(P1) $\forall P,Q\in \Pi \left(\mathcal{X}\right):D\left(P,Q\right)\ge 0$;
(P2) $\forall P,Q\in \Pi \left(\mathcal{X}\right):\left(P=Q\right)\Rightarrow D\left(P,Q\right)=0$; where $\Pi \left(\mathcal{X}\right)$ —set of all probability distributions on space $\mathcal{X}$.
It is tempting to use a pseudodivergence D_{M} produced by a lowcapacity model M for Adversarial Optimization, however, a pseudodivergence might not guarantee proper convergence as there might exist such ψ ∈ Ψ, that JSD(P, Q_{ψ}) > 0, while D(P, Q_{ψ}) = 0. For example, naive Bayes classifier is unable to distinguish between P and Q that have the same marginal distributions. Nevertheless, if model M is capable of distinguishing between P and some Q_{ψ}, D_{M} still provides information about the position of the optimal parameters in the configuration space ψ^{∗} by narrowing search volume, Ilten, Williams & Yang (2017) offers a good demonstration of this statement.
The core idea of this work is to replace Jensen–Shannon divergence with a socalled adaptive divergence that gradually adjusts model capacity depending on the ‘difficulty’ of the classification problem with the most ‘difficult’ problem being distinguishing between two equal distributions. Formally, this gradual increase in model complexity can be captured by the following definitions.
A family of pseudodivergences 𝒟 = {D_{α}:Π(𝒳) × Π(𝒳) → ℝ∣α ∈ [0, 1]} is ordered and complete with respect to Jensen–Shannon divergence if:
(D0) D_{α} is a pseudodivergence for all α ∈ [0, 1];
(D1) $\forall P,Q\in \Pi \left(\mathcal{X}\right):\forall 0\le {\alpha}_{1}<{\alpha}_{2}\le 1:{D}_{{\alpha}_{1}}\left(P,Q\right)\le {D}_{{\alpha}_{2}}\left(P,Q\right)$;
(D2) $\forall P,Q\in \Pi \left(\mathcal{X}\right):{D}_{1}\left(P,Q\right)=\mathrm{JSD}\left(P,Q\right)$.
There are numerous ways to construct a complete and ordered w.r.t. JSD family of pseudodivergences. In the context of Adversarial Optimization, we consider the following three methods. The simplest one is to define a nested family of models $\mathcal{M}=\left\{{M}_{\alpha}\subseteq \mathcal{F}\mid \phantom{\rule{1em}{0ex}}\alpha \in \left[0,1\right]\right\}$, (e.g., by changing number of hidden units of a neural network), then use pseudodivergence Eq. (3) to form a desired family.
Alternatively, for a parameterized model M = {f(θ, ⋅)∣θ ∈ Θ}, one can use a regularization R(θ) to control ‘capacity’ of the model: (4)${D}_{\alpha}\left(P,Q\right)=log2L\left(f\left({\theta}^{\ast},\cdot \right),P,Q\right);$ ${\theta}^{\ast}={\mathrm{argmin}}_{\theta \in \Theta}L\left(f\left(\theta ,\cdot \right),P,Q\right)+c\left(1\alpha \right)\cdot R\left(\theta \right);$ where c:[0, 1] → [0, + ∞) is a strictly increasing function and c(0) = 0.
The third, boostingbased method is applicable for a discrete approximation: (5)${D}_{c\left(i\right)}\left(P,Q\right)=log2L\left({F}_{i},P,Q\right);$ ${F}_{i}={F}_{i1}+\rho \cdot {\mathrm{argmin}}_{f\in B}L\left({F}_{i1}+f,P,Q\right);$ ${F}_{0}\equiv \frac{1}{2};$ where: ρ —learning rate, B —base estimator, c:ℤ_{+} → [0, 1] —a strictly increasing function for mapping ensemble size onto α ∈ [0, 1].
Although Definition 2 is quite general, in this paper, we focus on families of pseudodivergence produced in a manner similar to the examples above. All these examples introduce a classification algorithm parameterized by α, then define pseudodivergences D_{α} by substituting the optimal discriminator in Equation Eq. (2) with the discriminator trained in accordance with this classification algorithm with the parameter α. Of course, one has to make sure that the resulting family of pseudodivergences is ordered and complete w.r.t. Jensen–Shannon divergence. appendix provides formal definitions and proofs for the examples above.
With this class of pseudodivergences in mind, we refer to α as capacity of the pseudodivergence ${D}_{\alpha}\in \mathcal{D}$ relative to the family $\mathcal{D}$, or simply as capacity if the family $\mathcal{D}$ is clear from the context. In the examples above, capacity of pseudodivergence is directly linked to the capacity of underlying discriminator models: to the size of the model in equation Eq. (3), to the strength of the regularization in equation Eq. (4) (which, similar to the previous case, effectively restricts the size of the set of feasible models) or to the size of the ensemble for a boostingbased family of divergences in equation Eq. (5).
Finally, we introduce a function that combines a family of pseudodivergences into a single divergence.
If a family of pseudodivergences $\mathcal{D}=\left\{{D}_{\alpha}\mid \alpha \in \left[0,1\right]\right\}$ is ordered and complete with respect to Jensen–Shannon divergence, then adaptive divergence ${\mathrm{AD}}_{\mathcal{D}}$ produced by $\mathcal{D}$ is defined as: (6)${\mathrm{AD}}_{\mathcal{D}}\left(P,Q\right)=inf\left\{{D}_{\alpha}\left(P,Q\right)\mid {\mathrm{D}}_{\alpha}\left(P,Q\right)\ge \left(1\alpha \right)log2\right\}.$
We omit index in ${\mathrm{AD}}_{\mathcal{D}}$ when the family $\mathcal{D}$ is clear from the context or is not important.
A linear ‘threshold’ function τ(α) = 1 − α is used in the definition, however, it can be replaced by any strictly decreasing τ:[0, 1] → [0, 1], such that τ(0) = 1 and τ(1) = 0: (7)${\mathrm{AD}}_{\mathcal{D}}\left(P,Q\right)=inf\left\{{D}_{\alpha}\left(P,Q\right)\mid {\mathrm{D}}_{\alpha}\left(P,Q\right)\ge \tau \left(\alpha \right)log2\right\},$ but, since one can redefine the family $\mathcal{D}$ as ${\mathcal{D}}^{\prime}=\left\{{D}_{\tau \left(\alpha \right)}\mid \alpha \in \left[0,1\right]\right\}$, this effectively leads to the same definition. Nevertheless, it might be convenient in practice to use τ other than τ(α) = 1 − α as most model families have a natural ordering, e.g., regularization strength.
The coefficient log2 naturally arises as the maximal value of Jensen–Shannon divergence as well as an upper bound of any pseudodivergence based on equation Eq. (3) if the function f_{0}(x) = 1∕2 is included in the underlying classification model M. Since almost all popular models are capable of learning constant estimators, log2 is included in the definition. Nevertheless, to adopt Definition 3 for exotic models or divergences other than Jensen–Shannon (e.g., Wasserstein distance), this coefficient (and, possibly, the ‘threshold’ function) should be reconsidered.
Note, that due to property (D1), D_{α}(P, Q) is a nondecreasing function of α, while (1 − α)log2 is a strictly decreasing one. Hence, if family $\mathcal{D}$ is such that for any two distributions P and Q D_{α}(P, Q) is continuous w.r.t. α, equation Eq. (6) can be simplified: (8)${\mathrm{AD}}_{\mathcal{D}}\left(P,Q\right)={\mathrm{D}}_{{\alpha}^{\ast}}\left(P,Q\right),$ where α^{∗} is the root of the following equation: (9)${D}_{\alpha}\left(P,Q\right)=\left(1\alpha \right)log2.$ A general procedure for computing ${\mathrm{AD}}_{\mathcal{D}}$ for this case is outlined in Algorithm 1.
Intuitively, an adaptive divergence ${\mathrm{AD}}_{\mathcal{D}}$ switches between members of $\mathcal{D}$ depending on the ‘difficulty’ of separating P and Q. For example, consider family $\mathcal{D}$ produced by equation Eq. (4) with a highcapacity neural network as model M and l_{2} regularization R on its weights. For a pair of distant P and Q, even a highly regularized network is capable of achieving low crossentropy loss and, therefore, ${\mathrm{AD}}_{\mathcal{D}}$ takes values of the pseudodivergence based on such network. As distribution Q moves close to P, ${\mathrm{AD}}_{\mathcal{D}}$ lowers the regularization coefficient, effectively increasing the capacity of the underlying model.
The idea behind adaptive divergences can be viewed from a different angle. Given two distributions P and Q, it scans the producing family of pseudodivergences, starting from α = 0 (the least powerful pseudodivergence), and if some pseudodivergence reports high enough value, it serves as a ‘proof’ of differences between P and Q. If all pseudodivergences from the family $\mathcal{D}$ report 0, then P and Q are equal almost everywhere as the family always includes JSD as a member. Formally, this intuition can be expressed with the following theorem.
If ${\mathrm{AD}}_{\mathcal{D}}$ is an adaptive divergence produced by an ordered and complete with respect to Jensen–Shannon divergence family of pseudodivergences $\mathcal{D}$, then for any two distributions P and Q: JSD(P, Q) = 0 if and only if AD(P, Q) = 0.
A formal proof of Theorem 1 can be found in Appendix A2. Combined with the observation that AD(P, Q) ≥ 0 regardless of P and Q, the theorem states that AD is a divergence in the same sense as JSD. This, in turn, allows to use adaptive divergences as a replacement for Jensen–Shannon divergence in Adversarial Optimization.
As can be seen from the definition, adaptive divergences are designed to utilize lowcapacity pseudodivergences (with underlying lowcapacity models) whenever it is possible: for a pair of distant P and Q one needs to train only a lowcapacity model to estimate AD, using the most powerful model only to prove equality of distributions. As lowcapacity models generally require fewer samples for training, AD allows an optimization algorithm to run for more iterations within the same time restrictions.
Properties of ${\mathrm{AD}}_{\mathcal{D}}$ highly depend on the family $\mathcal{D}$, and choice of the latter might either negatively or positively impact convergence of a particular optimization algorithm. Figure 1 demonstrates both cases: here, we evaluate JSD and four variants of ${\mathrm{AD}}_{\mathcal{D}}$ on two synthetic examples. In each example, the generator produces a rotated version of the groundtruth distribution and is parameterized by the angle of rotation (groundtruth distributions and examples of generator distributions are shown in Figs. 1A and 1D). In Figs. 1B and 1C AD shows behavior similar to that of JSD (both being monotonous and maintaining a significant slope in the respective ranges). In Fig. 1E, both variants of AD introduce an additional local minimum: as the rotation angle approaches π∕2, marginal feature distributions become identical, which interferes with decisiontreebased algorithms (this is especially pronounced for AD with logarithmic capacity function as it prioritizes lowcapacity models). This behavior is expected to impact convergence of gradientbased algorithms negatively.
In contrast, in Fig. 1F neuralnetworkbased AD with l_{2} regularization stays monotonous in the range [0, π∕2] and keeps a noticeable positive slope, in contrast to saturated JSD. The positive slope is expected to improve convergence of gradientbased algorithms and, possibly, some variants of Bayesian Optimization. In contrast, neuralnetworkbased AD with dropout regularization behaves in a manner similar to adaptive divergences in Fig. 1E. The most likely explanation is that l_{2} regularization mostly changes magnitude of the predictions without significantly affecting the decision surface and, therefore, largely replicates behavior of JSD, while dropout effectively lowers the number of units in the network, which biases the decision surface towards a straight line (i.e., towards logistic regression).
Implementation
A general algorithm for computing an adaptive divergence is presented in Algorithm 1. This algorithm might be an expensive procedure as the algorithm probes multiple pseudodivergences, and for each of these probes, generally, a model needs to be trained from scratch. However, two of the most commonly used machine learning models, boostingbased methods (Friedman, 2001) and Neural Networks, allow for more efficient estimation algorithms due to the iterative nature of training procedures for such models.
Gradient boosted decision trees
Gradient Boosted Decision Trees (Friedman, 2001) (GBDT) and, generally, boostingbased methods, being ensemble methods, intrinsically produce an ordered and complete with respect to Jensen–Shannon divergence family of pseudodivergences in the manner similar to equation Eq. (5). This allows for an efficient AD estimation procedure shown by Algorithm 2. Here, the number of base estimators serves as capacity of pseudodivergences, and mapping to α ∈ [0, 1] is defined through an increasing capacity function c:ℤ_{+} → [0, 1].^{3}
In our experiments, for ensembles of maximal size N, we use the following capacity functions: (10)$\text{linear capacity:}c\left(i\right)={c}_{0}\frac{i}{N};$ (11)$\text{logarithmic capacity:}c\left(i\right)={c}_{0}\frac{log\left(i+1\right)}{log\left(N+1\right)}.$
Notice, however, that Equation Eq. (5) defines a discrete variant of AD, which most certainly will result in a discontinuous function.^{4} This effect can be seen on Fig. 1E.
Neural networks
There is a number of ways to regulate the capacity of a neural network. One of the simplest options is to vary the total number of units in the network. This, however, would almost certainly result in a discontinuous adaptive divergence, similarly to Gradient Boosted Decision Trees (Fig. 1E), which is not ideal even for blackbox optimization procedures.
In this work, we instead use wellestablished dropout regularization Srivastava et al. (2014). Effects of dropout are somewhat similar to varying number of units in a network, but at the same time dropout offers a continuous parametrization—it is clear that setting dropout probability p to 0 results in an unregularized network, while p = 1 effectively restricts classifier to a constant output and intermediate values of p produce models in between these extreme cases. To produce a family of pseudodivergences we equip dropout regularization with a linear capacity function: c(α) = 1 − α, where α corresponds to dropout probability p.
Methods with explicit regularization terms can also be used to produce a family of pseudodivergences. In this work, we examine l_{2} regularization on network weights as one of the most widely used. In this case, a family of pseudodivergences is defined by equation Eq. (4) with a logarithmic capacity function: c(α) = − log(α).
Regularization methods mentioned above were selected primarily due to their simplicity and popularity in the field. Our experiments indicate that these methods perform well. Nevertheless, further studies are required to determine bestperforming regularization techniques.
In our experiments, we observe that unregularized networks require significantly more samples to be properly trained than regularized ones. To reduce discriminator variance, we suggest to use additional regularization r, strength of which is independent from the capacity parameter α, e.g.: (12)${D}_{\alpha}\left(P,Q\right)=log2L\left(f\left({\theta}^{\ast},\cdot \right),P,Q\right);$ ${\theta}^{\ast}={\mathrm{argmin}}_{\theta \in \Theta}L\left(f\left(\theta ,\cdot \right),P,Q\right)+c\left(1\alpha \right)\cdot R\left(\theta \right)+r\left(\theta \right).$
In this work, following Louppe, Hermans & Cranmer (2017), we use gradient regularization r = R_{1} suggested by Mescheder, Geiger & Nowozin (2018). Note, that such family of pseudodivergences is no longer complete w.r.t Jensen–Shannon divergence, i.e., D_{1} ≠ JSD. Nevertheless, D_{1} is still a proper divergence (Mescheder, Geiger & Nowozin, 2018) (which closely resembles JSD), and all results in this work hold with respect to such divergences including main theorems and claims, i.e., the family defined above still produces a (generalized) variant of adaptive divergence.
The proposed procedures for estimating AD is outlined in Algorithms 3 and 4. As chosen regularization methods result in families of pseudodivergences continuous w.r.t α, the proposed algorithm employs equation Eq. (8), i.e., it varies the strength of the regularization depending on the current values of the crossentropy. The values of the loss function are estimated with an exponential moving average over losses on minibatches during iterations of Stochastic Gradient Descent, with the idea that, for slowly changing loss estimations and small enough learning rate, network training should converge (Liu, Simonyan & Yang, 2018). We find that initializing exponential moving average with log2, which corresponds to the absent regularization, works best.
Experiments
Adaptive divergence was designed to require fewer samples than its conventional counterparts. However, for practical purposes, it is meaningless to consider this quantity outside the context of optimization. To illustrate this claim, consider the following divergence: $\mathrm{ID}\left(P,Q\right)=\left\{\begin{array}{cc}0,\phantom{\rule{10.00002pt}{0ex}}\hfill & \text{if}P=Q\text{almost everywhere};\hfill \\ 1,\phantom{\rule{10.00002pt}{0ex}}\hfill & \text{otherwise}.\hfill \end{array}\right.$
Such divergence can be estimated in a manner similar to that of adaptive divergence: starting with a lowcapacity model, train the model to distinguish between P and Q, if the model reports any differences between distributions, return 1, otherwise increase the capacity of the model and repeat, until a sufficiently high capacity is reached, in which case return 0. In terms of the number of samples, ID is expected to be more efficient than AD; at the same time, ID is a textbook example of intrinsically hard optimization problem, rendering it useless for Adversarial Optimization. Therefore, we judge the performance of adaptive divergence only within an optimization procedure.
Note that adaptive divergence is not expected to improve the optimization surface; nevertheless, as Fig. 1 demonstrates, the improvement is seemingly present in some instances; however, our experiments show that it does not play any significant role (see Appendix A3 for details). In the cases, when degradation of the optimization surface takes place, global optimization procedures, such as Bayesian Optimization, are still expected to benefit from the usage of AD by being able to perform more steps within the same budget on the number of generator calls.
We compare adaptive divergence against JSD on three tasks,^{5} each task is presented by a parametrized generator, ’realworld’ samples are drawn from the same generator with some nominal parameters. Optimization algorithms are expected to converge to these nominal parameters.
We evaluate the performance of adaptive divergences with two blackbox optimization algorithms, namely Bayesian Optimization and Adversarial Variational Optimization. As computational resources spent by simulators are of our primary concern, we measure convergence of Adversarial Optimization with respect to the number of samples generated by the simulation, which is expected to be roughly proportional to the total time in case of computationally heavy simulations. We chose to neglect the time spent on training models as the proposed methods are intended for simulations that are significantly more computationally intensive than training of any model with a reasonable capacity, for example, running ATLAS simulation (The ATLAS Collaboration, 2010) for the same number of times as budgets in our experiments would require several years on a singlecore CPU.
To measure the number of samples required to estimate a divergence, we search for the minimal number of samples such that the difference between train and validation losses is within 10^{−2} for Gradient Boosted Decision Trees and 5⋅10^{−2} for Neural Networks.^{6} As a significant number of samples is involved in loss estimation, for simplicity, we use point estimations of losses. For GBDT, we utilize a bisection rootfinding routine to reduce time spent on retraining classifiers; however, for more computationally expensive simulators, it is advised to gradually increase the size of the training set until the criterion is met.
For each experiment, we report convergence plots—Euclidean distance from the current guess to the nominal parameters as a function of the number of examples generated by the simulator. As the performance of Bayesian Optimization is influenced by choice of the initial points (in our experiments, 5 points uniformly drawn from the search space), each experiment involving Bayesian Optimization is repeated 100 times, and aggregated results are reported. Similarly, experiments with Variational Optimization are repeated 20 times each.^{6}
XORlike synthetic data
This task repeats one of the synthetic examples presented in Fig. 1D: ground truth distribution is an equal mixture of two Gaussian distributions, the generator produces a rotated version of the groundtruth distribution with the angle of rotation being the single parameter of the generator. The main goal of this example is to demonstrate that, despite significant changes in the shape of the divergence, global optimization algorithms, like Bayesian Optimization, can still benefit from the fast estimation procedures offered by adaptive divergences.
For this task, we use an adaptive divergence based on Gradient Boosted Decision Trees (100 trees with the maximal depth of 3) with linear and logarithmic capacity functions given by Eqs. (10) and (11) and c_{0} = 1∕4. Gaussian Process Bayesian Optimization with Matern kernel (ν = 3∕2 and scaling from [10^{−3}, 10^{3}] automatically adjusted by Maximum Likelihood fit) is employed as optimizer.
Convergence of the considered divergences is shown in Fig. 2. As can be seen from the results, adaptive divergences tend to request fewer generator calls per estimation; and, given the same budget, both variants of adaptive divergence converge on parameters around an order of magnitude closer to the optimum than traditional JSD. Notice, that the initial rapid progress slows as optimizer approaches the optimum, and the slope of the curves becomes similar to that of JSD: this can be explained by AD approaching JSD as probed distributions become less distinguishable from the groundtruth one.
Pythia hyperparameter tuning
This task is introduced by Ilten, Williams & Yang (2017) and involves tuning hyperparameters of the Pythia event generator, a highenergy particle collision simulation used at CERN. For this task, electronpositron collisions are simulated at a centerofmass energy 91.2 GeV. As initial electron and positron collide and annihilate, new particles are created, some of which are unstable and might decay into more stable particles. A collision event is described by the properties of the final (stable) products. This process is intrinsically stochastic (due to the laws of physics) and covers a large space of possible outcomes, moreover, even with relatively large changes in generator’s hyperparameters, outcome distributions overlap significantly, which makes it an excellent example for adversarial optimization. The nominal parameters of the Pythia event generator are set to the values of the Monash tune (Skands, Carrazza & Rojo, 2014).
In work by Ilten, Williams & Yang (2017), various physicsmotivated statistics of events are used as observables,^{7} with a total of more than 400 features. The same statistics were originally used to obtain the Monash tune. For the purposes of the experiment, we consider one hyperparameter, namely alphaSValue, with the nominal value of 0.1365 and search range [0.06, 0.25].^{7}
We repeat settings of the experiment^{8} described by Ilten, Williams & Yang (2017). We employ Gradient Boosting over Oblivious Decision Trees (CatBoost implementation by Prokhorenkova et al., 2018) with 100 trees of depth 3 and other parameters set to their default values. We use Gaussian Process Bayesian Optimization with Matern kernel (ν = 3∕2 and scaling from [10^{−3}, 10^{3}] automatically adjusted by Maximum Likelihood fit) as optimizer. Comparison of unmodified Jensen–Shannon divergence with adaptive divergences with linear and logarithmic capacity functions (defined by Eqs. (10) and (11) and c_{0} = 1∕4) presented onFig. 3.^{8}
Results, shown in Fig. 3, indicate that, given the same budget, Bayesian Optimization over adaptive divergences yields solutions about an order of magnitude closer to the nominal value than Jensen–Shannon divergence. This acceleration can be attributed to the proposed estimation procedures that require far fewer generator calls than JSD. Additionally, notice that the slope of the convergence curves for AD gradually approaches that of AD as the proposal distributions become closer to the groundtruth one.
Pythia alignment
In order to test the performance of adaptive divergences with Adversarial Variational Optimization, we repeat the Pythiaalignment experiment suggested by Louppe, Hermans & Cranmer (2017). The settings of this experiment are similar to the previous one. In this experiment, however, instead of collecting physicsmotivated statistics, we consider a simplified detector simulation, represented by a 32 × 32 spherical grid with cells uniformly distributed in pseudorapidity ν ∈ [ − 5, 5] and azimuthal angle ϕ ∈ [ − π, π] space. Each cell of the detector records the energy of particles passing through it. The detector has 3 parameters: x, y, zoffsets of the detector center relative to the collision point, where zaxis is placed along the beam axis, the nominal offsets are zero, and the initial guess is (0.75, 0.75, 0.75). Figure 4 shows averaged detector responses for the example configurations and samples from each of these configurations.
For this task, a 1hiddenlayer Neural Network with 32 hidden units and ReLU activation function is employed. R_{1} regularization, proposed by Mescheder, Geiger & Nowozin (2018), with the coefficient 10, is used for the proposed divergences and the baseline. Adam optimization algorithm (Kingma & Ba, 2014) with learning rate 10^{−2} is used to perform updates of the search distribution. We compare the performance of two variants of adaptive divergence (dropout and l_{2} regularization) described in ‘Implementation’.
Results are shown in Fig. 5. Adaptive divergences require considerably fewer samples for their estimation than the baseline divergence with only R_{1} regularization, which, given the same budget, allows both variants of adaptive divergence to accelerate Adversarial Optimization significantly. Note that the acceleration is even more pronounced in comparison to JSD estimated by an unregularized network: in our experiments, to achieve the set level of agreement between train and test losses, the unregularized network often requires more samples than the entire budget.
Discussion
To the best knowledge of the authors, this work is the first one that explicitly addresses computational costs of Adversarial Optimization for expensive generators. Interestingly, several recent developments, like Progressive GAN (Karras et al., 2017) and ChainGAN (Hossain et al., 2018), use multiple discriminators of increasing capacity; however, this is done mainly to compensate for the growing capacity of the generators and, probably, not for reducing computational costs.
Several recent papers propose improving stability of Adversarial Optimization by employing divergences other than Jensen–Shannon (Gulrajani et al., 2017; Arjovsky, Chintala & Bottou, 2017; Bellemare et al., 2017). Note that all results in this paper also hold for any divergence that can be formulated as an optimization problem, including Wasserstein (Arjovsky, Chintala & Bottou, 2017) and Cramer (Bellemare et al., 2017) distances. It can be demonstrated by adjusting Definition 2 and repeating the proof of Theorem 4 for a new divergence; presented algorithms also require only minor adjustments.
Multiple works introduce regularization (Sønderby et al., 2016; Arjovsky, Chintala & Bottou, 2017; Roth et al. 2017; Kodali et al., 2017; Mescheder, Geiger & Nowozin, 2018) for improving stability and convergence of Adversarial Optimization. Most of the standard regularization methods can be used to regulate model capacity in adaptive divergences. Also, one can use these regularization methods in addition to adaptive divergence as any discriminatorbased regularization effectively produces a new type of divergence. Pythiaalignment experiment (‘Pythia alignment’) demonstrates it clearly, where we use R_{1} regularization with constant coefficient in addition to varyingstrength dropout and l_{2} regularization.
As we discussed in ‘Adaptive Divergence’, properties of adaptive divergences highly depend on the underlying families of pseudodivergences; the impact of various regularization schemes is a subject of future research.
Conclusion
In this work, we introduce adaptive divergences, a family of divergences meant as an alternative to Jensen–Shannon divergence for Adversarial Optimization. Adaptive divergences generally require smaller sample sizes for estimation, which allows for a significant acceleration of Adversarial Optimization algorithms. These benefits were demonstrated on two finetuning problems involving Pythia event generator and two of the most popular blackbox optimization algorithms: Bayesian Optimization and Variational Optimization. Experiments show that, given the same budget, adaptive divergences yield results up to an order of magnitude closer to the optimum than Jensen–Shannon divergence. Note, that while we consider physicsrelated simulations, adaptive divergences can be applied to any stochastic simulation.
Theoretical results presented in this work also hold for divergences other than Jensen–Shannon divergence.