paper&code
  • AI学者论文游乐园
  • CNN必读经典论文
    • AlexNet论文翻译
    • ResNet论文翻译
    • VGG论文翻译
    • GoogleNet论文翻译
    • Squeeze-and-Excitation Networks论文翻译
    • Batch Normalization论文翻译
    • Inception-V3论文翻译
    • 神经网络的有趣特性
    • 深度学习模式的对抗攻击
    • 解释和利用敌对的例子
    • 对神经网络鲁棒性的评估
    • 快速特性傻瓜:一种数据独立的方法,以普遍的对抗干扰
    • 普遍的对抗性的扰动
    • DeepFool:一种简单而准确的欺骗深度神经网络的方法
    • 以势头推动对抗性攻击
    • HopSkipJumpAttack:一种查询效率高的基于决策的攻击
    • 利用输入多样性提高对抗性算例的可转移性
    • 相反的例子:机遇和挑战
    • 在物理世界中的对抗例子
    • 用敌对的网络生成敌对的例子
    • 在敌对环境下的深度学习的局限性
    • 生成对抗网络
    • 用GAN消除对抗扰动
    • Defense- gan:使用生成模型保护分类器免受敌方攻击
    • 学习通用对抗性扰动生成模型
    • 抵御普遍的对抗性干扰
  • 最新研究方向
    • 使用ResNet产生通用对抗扰动网络
Powered by GitBook
On this page
  • 原文链接:
  • Abstract
  • 1 Introduction
  • 2 Related work
  • 3 Adversarial nets
  • 4 Theoretical Results
  • 4.1 Global Optimality of
  • 4.2 Convergence of Algorithm 1
  • 5 Experiments
  • 6 Advantages and disadvantages
  • 7 Conclusions and future work
  • Acknowledgments
  • References

Was this helpful?

  1. CNN必读经典论文

生成对抗网络

Generative adversarial nets

Previous在敌对环境下的深度学习的局限性Next用GAN消除对抗扰动

Last updated 4 years ago

Was this helpful?

原文链接:

GB/T 7714 Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]//Advances in neural information processing systems. 2014: 2672-2680.

MLA Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.

APA Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).

Abstract

摘要

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 12\frac{1}{2}21​ everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

我们提出一个新的框架,评估生成模型通过一个对抗的过程中,我们同时训练两个模型:一个生成模型G捕获数据分布,D和歧视模型估计的概率样本来自于训练数据而不是G G .培训过程是D犯了一个错误的概率最大化。该框架对应于一个极大极小二人对策。在任意函数G和D的空间中,存在一个唯一解,G恢复训练数据分布,且D处处为 12\frac{1}{2}21​ 。在G和D由多层感知器定义的情况下,整个系统可以通过反向传播进行训练。在训练和样本生成过程中,不需要任何马尔科夫链或展开的近似推理网络。通过对生成的样本进行定性和定量评价,实验证明了该框架的潜力。

1 Introduction

1介绍

The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved discriminative models, usually those that map a high-dimensional, rich sensory input to a class label [14, 20]. These striking successes have primarily been based on the backpropagation and dropout algorithms, using piecewise linear units [17, 8, 9] which have a particularly well-behaved gradient . Deep generative models have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difficulties. 1

深度学习的前景是发现丰富的层次模型[2],它代表人工智能应用中遇到的各种数据的概率分布,如自然图像、包含语音的音频波形,以及自然语言语料中的符号。到目前为止,深度学习领域最显著的成功涉及到鉴别模型,通常是那些将高维、丰富的感官输入映射到类标签的模型[14,20]。这些惊人的成功主要是基于反向传播和退出算法,使用分段线性单位[17,8,9],它们有一个特别良好的梯度。由于在极大似然估计和相关策略中难以逼近许多棘手的概率计算,以及由于难以在生成环境中利用分段线性单元的好处,深层生成模型的影响较小。我们提出了一种新的生成模型估计方法来克服这些困难。1

In the proposed adversarial nets framework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles.

在提出的对抗式网框架中,生成模型与对手针锋相对:一种判别模型,它学会了确定样本是来自模型分布还是来自数据分布。生成模型可以被认为类似于一组伪造者,试图制造假币并在不被检测的情况下使用它,而鉴别模型类似于警察,试图检测假币。这个游戏中的竞争促使两队都改进自己的方法,直到赝品和真品无法区分。

This framework can yield specific training algorithms for many kinds of model and optimization algorithm. In this article, we explore the special case when the generative model generates samples by passing random noise through a multilayer perceptron, and the discriminative model is also a multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train both models using only the highly successful backpropagation and dropout algorithms [16] and sample from the generative model using only forward propagation. No approximate inference or Markov chains are necessary.

该框架可生成多种模型的具体训练算法和优化算法。在本文中,我们探索了生成模型通过将随机噪声通过一个多层感知器来生成样本的特殊情况,而判别模型也是一个多层感知器。我们将这种特殊情况称为对抗性网。在这种情况下,我们可以只使用非常成功的backpropagation和dropout算法[16]来训练这两个模型,而生成模型的sample只使用forward propagation。不需要近似推理或马尔可夫链。

2 Related work

2相关工作

Until recently, most work on deep generative models focused on models that provided a parametric specification of a probability distribution function. The model can then be trained by maximizing the log likelihood. In this family of model, perhaps the most succesful is the deep Boltzmann machine [25]. Such models generally have intractable likelihood functions and therefore require numerous approximations to the likelihood gradient. These difficulties motivated the development of “generative machines”–models that do not explicitly represent the likelihood, yet are able to generate samples from the desired distribution. Generative stochastic networks [4] are an example of a generative machine that can be trained with exact backpropagation rather than the numerous approximations required for Boltzmann machines. This work extends the idea of a generative machine by eliminating the Markov chains used in generative stochastic networks.

直到最近,大多数关于深层生成模型的工作都集中在提供概率分布函数参数说明的模型上。然后可以通过最大化对数似然来训练模型。在这个模型家族中,也许最成功的是[25]深度玻尔兹曼机器。此类模型通常具有难以处理的似然函数,因此需要对似然梯度进行多次近似。这些困难促使生成机模型的发展,生成机模型不能明确表示可能性,但能够从期望的分布中生成样本。生成随机网络[4]是生成机的一个例子,它可以用精确的反向传播来训练,而不是像波尔兹曼机器那样需要大量的近似。本工作通过消除生成随机网络中使用的马尔科夫链来扩展生成机的思想。

Our work backpropagates derivatives through generative processes by using the observation that

lim\triangledown_{x}\mathbb{E}_{\epsilon\sim N(0,\theta^{2}I)}f(x+\epsilon)\triangledown_{x}f(x).\tag{1}

我们的工作通过生成过程反向传播衍生物使用的观察

lim\triangledown_{x}\mathbb{E}_{\epsilon\sim N(0,\theta^{2}I)}f(x+\epsilon)\triangledown_{x}f(x).\tag{1}

We were unaware at the time we developed this work that Kingma and Welling [18] and Rezende et al. [23] had developed more general stochastic backpropagation rules, allowing one to backpropagate through Gaussian distributions with finite variance, and to backpropagate to the covariance parameter as well as the mean. These backpropagation rules could allow one to learn the conditional variance of the generator, which we treated as a hyperparameter in this work. Kingma and Welling [18] and Rezende et al. [23] use stochastic backpropagation to train variational autoencoders (VAEs). Like generative adversarial networks, variational autoencoders pair a differentiable generator network with a second neural network. Unlike generative adversarial networks, the second network in a VAE is a recognition model that performs approximate inference. GANs require differentiation through the visible units, and thus cannot model discrete data, while VAEs require differentiation through the hidden units, and thus cannot have discrete latent variables. Other VAElike approaches exist [12, 22] but are less closely related to our method.

当时我们都不知道我们开发了这个工作,Kingma和湿润[18]和Rezende et al。[23]开发了更一般的随机反向传播规则,允许一个backpropagate通过与有限方差的高斯分布,并backpropagate协方差参数的意思。这些反向传播规则可以让一个人学习生成器的条件方差,我们把它作为一个超参数来处理。Kingma和Welling[18]和Rezende等人[23]使用随机反向传播训练变分自编码器(VAEs)。像生成对抗网络一样,变分自编码器将一个可微发生器网络与另一个神经网络配对。与生成的对抗性网络不同,VAE中的第二个网络是一个执行近似推理的识别模型。GANs需要通过可见单位进行微分,因此不能建模离散数据,而VAEs需要通过隐藏单位进行微分,因此不能有离散的潜变量。还有其他类似于虚场的方法[12,22],但与我们的方法关系不大。

Previous work has also taken the approach of using a discriminative criterion to train a generative model [29, 13]. These approaches use criteria that are intractable for deep generative models. These methods are difficult even to approximate for deep models because they involve ratios of probabilities which cannot be approximated using variational approximations that lower bound the probability. Noise-contrastive estimation (NCE) [13] involves training a generative model by learning the weights that make the model useful for discriminating data from a fixed noise distribution. Using a previously trained model as the noise distribution allows training a sequence of models of increasing quality. This can be seen as an informal competition mechanism similar in spirit to the formal competition used in the adversarial networks game. The key limitation of NCE is that its “discriminator” is defined by the ratio of the probability densities of the noise distribution and the model distribution, and thus requires the ability to evaluate and backpropagate through both densities.

之前的工作也采用了使用判别标准来训练生成模型的方法[29,13]。这些方法使用的标准对于深层生成模型来说是难以处理的。这些方法甚至很难对深层模型进行近似,因为它们涉及的概率比率不能用概率下限的变分近似来近似。噪声对比估计(NCE)[13]涉及到通过学习使模型有用的从固定噪声分布甄别数据的权值来训练生成模型。使用一个以前训练过的模型作为噪声分布,可以训练一系列的模型来提高质量。这可以被看作是一种非正式的竞争机制,在精神上类似于对抗性网络游戏中使用的正式竞争。NCE的关键限制在于它的鉴别器是由噪声分布的概率密度与模型分布的概率密度之比定义的,因此需要能够通过两种密度评估和反向传播。

Some previous work has used the general concept of having two neural networks compete. The most relevant work is predictability minimization [26]. In predictability minimization, each hidden unit in a neural network is trained to be different from the output of a second network, which predicts the value of that hidden unit given the value of all of the other hidden units. This work differs from predictability minimization in three important ways: 1) in this work, the competition between the networks is the sole training criterion, and is sufficient on its own to train the network. Predictability minimization is only a regularizer that encourages the hidden units of a neural network to be statistically independent while they accomplish some other task; it is not a primary training criterion. 2) The nature of the competition is different. In predictability minimization, two networks’ outputs are compared, with one network trying to make the outputs similar and the other trying to make the outputs different. The output in question is a single scalar. In GANs, one network produces a rich, high dimensional vector that is used as the input to another network, and attempts to choose an input that the other network does not know how to process. 3) The specification of the learning process is different. Predictability minimization is described as an optimization problem with an objective function to be minimized, and learning approaches the minimum of the objective function. GANs are based on a minimax game rather than an optimization problem, and have a value function that one agent seeks to maximize and the other seeks to minimize. The game terminates at a saddle point that is a minimum with respect to one player’s strategy and a maximum with respect to the other player’s strategy.

之前的一些工作使用了两个神经网络竞争的一般概念。最相关的工作是[26]的可预测性最小化。在可预测性最小化中,神经网络中的每个隐藏单元被训练成不同于另一个网络的输出,后者根据其他所有隐藏单元的值来预测该隐藏单元的值。这项工作与可预测性最小化有三个重要的不同:1)在这项工作中,网络之间的竞争是唯一的训练标准,它本身就足以训练网络。可预测性最小化只是一个调节器,它鼓励神经网络的隐藏单元在完成其他任务时在统计上是独立的;这不是一个主要的训练标准。2)竞争的性质不同。在可预测性最小化中,比较两个网络的输出,一个网络试图使输出相似,另一个网络试图使输出不同。所讨论的输出是单个标量。在GANs中,一个网络产生一个丰富的高维向量,作为另一个网络的输入,并尝试选择另一个网络不知道如何处理的输入。3)学习过程的规范不同。可预测性最小化被描述为一个目标函数被最小化的最优化问题,学习方法接近目标函数的最小值。GANs是基于极小极大对策而不是优化问题,并且具有一个个体寻求最大化而另一个个体寻求最小化的价值函数。游戏在一个鞍点结束,这个鞍点对于一方的策略是最小的,对于另一方的策略是最大的。

Generative adversarial networks has been sometimes confused with the related concept of “adversarial examples” [28]. Adversarial examples are examples found by using gradient-based optimization directly on the input to a classification network, in order to find examples that are similar to the data yet misclassified. This is different from the present work because adversarial examples are not a mechanism for training a generative model. Instead, adversarial examples are primarily an analysis tool for showing that neural networks behave in intriguing ways, often confidently classifying two images differently with high confidence even though the difference between them is imperceptible to a human observer. The existence of such adversarial examples does suggest that generative adversarial network training could be inefficient, because they show that it is possible to make modern discriminative networks confidently recognize a class without emulating any of the human-perceptible attributes of that class.

生成的敌对网络有时会与敌对例子[28]的相关概念混淆。对抗性示例是直接在分类网络的输入上使用基于梯度的优化方法找到的示例,目的是找到与数据相似但分类错误的示例。这与目前的工作不同,因为对抗的例子不是培养生成模型的机制。相反,相反的例子主要是一种分析工具,用来显示神经网络以有趣的方式运行,经常自信地对两幅图像进行高度自信的分类,即使它们之间的差异对人类观察者来说是察觉不到的。这种对抗性例子的存在确实表明生成式对抗性网络训练可能是低效的,因为它们表明,有可能使现代歧视网络自信地识别一个类,而不模仿该类的任何人类可感知的属性。

3 Adversarial nets

3敌对的网

The adversarial modeling framework is most straightforward to apply when the models are both multilayer perceptrons. To learn the generator’s distribution pg over data x, we define a prior on input noise variables pz(z), then represent a mapping to data space as G(z; θg), where G is a differentiable function represented by a multilayer perceptron with parameters θg. We also define a second multilayer perceptron D(x; θd) that outputs a single scalar. D(x) represents the probability that x came from the data rather than pg. We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize log(1 − D(G(z))). In other words, D and G play the following two-player minimax game with value function V (G, D):

min~maxV(D,G)=\mathbb{E}_{x\sim pdata(x)}[logD(x)]+\mathbb{E}_{z\sim pz(z)}[log(1-D(G(z)))].\tag{1}

当模型是多层感知器时,对抗模型框架是最直接应用的。为了学习生成器关于数据x上的分布pg, 我们定义输入噪声的先验变量pz(z),然后使用G(z;θg)来代表数据空间的映射。这里G是一个由含有参数θg 的多层感知机表示的可微函数。我们再定义了一个多层感知机D(x;θd)用来输出一个单独的标量。D(x) 代表x 来自于真实数据分布而不是pg的概率,我们训练D来最大化分配正确标签给不管是来自于训练样例还是G生成的样例的概率.我们同时训练G来最小化log(1−D(G(z)))。换句话说,D和G的训练是关于值函数V(G,D)的极小化极大的二人博弈问题:

min~maxV(D,G)=\mathbb{E}_{x\sim pdata(x)}[logD(x)]+\mathbb{E}_{z\sim pz(z)}[log(1-D(G(z)))].\tag{1}

In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, so long as G changes slowly enough. The procedure is formally presented in Algorithm 1.

在下一节中,我们将给出一个对抗性网络的理论分析,本质上表明,当G和D被给予足够的容量时,即在非参数极限下,训练准则允许恢复数据生成分布。参见图1,了解该方法的不那么正式的、更具教学意义的解释。在实践中,我们必须使用一种迭代的数值方法来实现这个博弈。在训练的内环中优化D到完成在计算上是禁止的,并且在有限的数据集上会导致过拟合。相反,我们在优化D的k步和优化G的1步之间交替进行。这样,只要G变化足够慢,D就会保持在其最优解附近。该过程在算法1中正式给出。

In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 − D(G(z))) saturates. Rather than training G to minimize log(1 − D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the same fixed point of the dynamics of G and D but provides much stronger gradients early in learning.

在实践中,方程1可能不能提供足够的梯度,使G很好地学习。在学习初期,当G较差的时候,D可以高置信度地拒绝样本,因为样本与训练数据明显不同。在这种情况下,log(1 D(G(z))饱和。我们可以训练G使log D(G(z))最大化,而不是训练G使log D(G(z))最小化。该目标函数使G和D的动力学具有相同的不动点,但在学习初期提供了更强的梯度。

4 Theoretical Results

4理论结果

当z∼pz时,获得样本G(z),产生器G隐式的定义概率分布pg为G(z)获得的样本的分布。因此,如果模型容量和训练时间足够大时,我们希望算法1收敛为pdata的良好估计量。本节的结果是在非参数设置下完成的,例如,我们通过研究概率密度函数空间中的收敛来表示具有无限容量的模型。

我们将在4.1节中显示,这个极小化极大问题的全局最优解为pg=pdata。我们将在4.2节中展示使用算法1来优化等式1,从而获得期望的结果。

We first consider the optimal discriminator D for any given generator G.

Proposition 1. For G fixed, the optimal discriminator D is

D_{G}^{\star}=\frac{p_{data}(x)}{p_{data}(x)+p_{g}(x)}.\tag{2}

首先考虑对任意给定发生器G的最优鉴别器D。

命题1。当G固定时,最优鉴别器D为

D_{G}^{\star}=\frac{p_{data}(x)}{p_{data}(x)+p_{g}(x)}.\tag{2}

Proof. The training criterion for the discriminator D, given any generator G, is to maximize the quantity V (G, D)

V(G,D)=\int_{x}p_{data}(x)log(D(x)dx+\int_{z}p_{z}log(1-D(g(z))))dz=\\ \int_{x}pdata(x)log(D(x)+p_{g}(x)log(1-D(x)))dx\tag{3}

证明。对于任意产生器G,鉴别器D的训练准则是使V (G, D)最大化

V(G,D)=\int_{x}p_{data}(x)log(D(x)dx+\int_{z}p_{z}log(1-D(g(z))))dz=\\ \int_{x}pdata(x)log(D(x)+p_{g}(x)log(1-D(x)))dx\tag{3}

C(G)=maxV(G,D)\\ =\mathbb{E}_{x\sim pdata}[log D_{G}^{\star}(x)]+\mathbb{E}_{x\sim pdata}[log(1- D_{G}^{\star}(G(z)))]\\ =\mathbb{E}_{x\sim pdata}[log D_{G}^{\star}(x)]+\mathbb{E}_{x\sim pdata}[log(1- D_{G}^{\star}(x))]\tag{4}\\ =\mathbb{E}_{x\sim pdata}[log\frac{pdata(x)}{P_{data}+p_{g}(x)} ]+\mathbb{E}_{x\sim pdata}[log(\frac{pdata(x)}{P_{data}+p_{g}(x)})]

C(G)=maxV(G,D)\\ =\mathbb{E}_{x\sim pdata}[log D_{G}^{\star}(x)]+\mathbb{E}_{x\sim pdata}[log(1- D_{G}^{\star}(G(z)))]\\ =\mathbb{E}_{x\sim pdata}[log D_{G}^{\star}(x)]+\mathbb{E}_{x\sim pdata}[log(1- D_{G}^{\star}(x))]\tag{4}\\ =\mathbb{E}_{x\sim pdata}[log\frac{pdata(x)}{P_{data}+p_{g}(x)} ]+\mathbb{E}_{x\sim pdata}[log(\frac{pdata(x)}{P_{data}+p_{g}(x)})]

Theorem 1. The global minimum of the virtual training criterion C(G) is achieved if and only if pg = pdata. At that point, C(G) achieves the value − log 4.

定理1。当且仅当pg = pdata时,得到虚拟训练准则C(G)的全局最小值。此时,C(G)达到−log 4。

\mathbb{E}_{x\sim p_{data}}[-log 2]+\mathbb{E}_{x\sim p_{y}}[-log 2]=-log4\tag{4.1}

\mathbb{E}_{x\sim p_{data}}[-log 2]+\mathbb{E}_{x\sim p_{y}}[-log 2]=-log4\tag{4.1}

C(G)=-log(4)+KL(p_{data}\lVert\frac{p_{data+p_{g}}}{2}+KL(p_{g}\lVert\frac{p_{data+p_{g}}}{2}))\tag{5}

C(G)=-log(4)+KL(p_{data}\lVert\frac{p_{data+p_{g}}}{2}+KL(p_{g}\lVert\frac{p_{data+p_{g}}}{2}))\tag{5}

where KL is the Kullback–Leibler divergence. We recognize in the previous expression the Jensen– Shannon divergence between the model’s distribution and the data generating process:

C(G)=-log(4)+2\cdot JSD(p_{data}\lVert p_{g})\tag{6}

其中KL为Kullback-Leibler散度。我们在前面的表达式中认识到模型s分布和数据生成过程之间的Jensen Shannon散度:

C(G)=-log(4)+2\cdot JSD(p_{data}\lVert p_{g})\tag{6}

Since the Jensen–Shannon divergence between two distributions is always non-negative, and zero iff they are equal, we have shown that C ∗ = − log(4) is the global minimum of C(G) and that the only solution is pg = pdata, i.e., the generative model perfectly replicating the data distribution.

由于两个分布之间的Jensen Shannon散度总是非负的,且在零iff条件下它们是相等的,我们证明了C = log(4)是C(G)的全局最小值,唯一的解是pg = pdata,即生成模型完美地复制了数据分布。

4.2 Convergence of Algorithm 1

4.2算法1的收敛性

Proposition 2. If G and D have enough capacity, and at each step of Algorithm 1, the discriminator is allowed to reach its optimum given G, and pg is updated so as to improve the criterion

\mathbb{E}_{x\sim p_{data}}[log D_{G}^{\star}(x)]+\mathbb{E}_{x\sim p_{g}}[log(1- D_{G}^{\star}(x))]\tag{6.1}

命题2。如果G和D有足够的容量,在算法1的每一步,让鉴别器达到给定的最优G,并更新pg以改进判据

\mathbb{E}_{x\sim p_{data}}[log D_{G}^{\star}(x)]+\mathbb{E}_{x\sim p_{g}}[log(1- D_{G}^{\star}(x))]\tag{6.1}

then pg converges to pdata

然后pg收敛到pdata

In practice, adversarial nets represent a limited family of pg distributions via the function G(z; θg), and we optimize θg rather than pg itself, so the proofs do not apply. However, the excellent performance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical guarantees.

在实践中,对抗性网通过函数G(z)表示有限的pg分布族;而我们优化的是修饰后的g,而不是修饰后的pg本身,所以证明并不适用。然而,多层感知器在实际应用中的优异性能表明,尽管其缺乏理论保障,但仍是一种合理的模型。

Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean loglikelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we computed the standard error across folds of the dataset, with a different σ chosen using the validation set of each fold. On TFD, σ was cross validated on each fold and mean log-likelihood on each fold were computed. For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.

表1:基于Parzen窗口的对数似然估计。MNIST上报告的数字是样本在测试集上的平均对数似然值,以及样本间计算平均值的标准误差。在TFD上,我们计算了数据集各折线之间的标准误差,使用每个折线的验证集选择不同的折线。在TFD上,每一次交叉验证一次数据,计算每一次数据的对数似然均值。对于MNIST,我们将与数据集的实值(而不是二进制)版本的其他模型进行比较。

5 Experiments

5实验

We trained adversarial nets an a range of datasets including MNIST[21], the Toronto Face Database (TFD) [27], and CIFAR-10 [19]. The generator nets used a mixture of rectifier linear activations [17, 8] and sigmoid activations, while the discriminator net used maxout [9] activations. Dropout [16] was applied in training the discriminator net. While our theoretical framework permits the use of dropout and other noise at intermediate layers of the generator, we used noise as the input to only the bottommost layer of the generator network.

我们训练了一系列的数据集,包括MNIST[21],多伦多面部数据库(TFD)[27],和CIFAR-10[19]。发电机网采用整流器线性激活[17,8]和s型激活的混合,而鉴别网采用最大输出[9]激活。将Dropout[16]应用于鉴别网的训练中。虽然我们的理论框架允许使用dropout和其他噪音在中间层的发电机,我们使用噪音作为输入,只有最底部的一层发电机网络。

We estimate probability of the test set data under pg by fitting a Gaussian Parzen window to the samples generated with G and reporting the log-likelihood under this distribution. The σ parameter of the Gaussians was obtained by cross validation on the validation set. This procedure was introduced in Breuleux et al. [7] and used for various generative models for which the exact likelihood is not tractable [24, 3, 4]. Results are reported in Table 1. This method of estimating the likelihood has somewhat high variance and does not perform well in high dimensional spaces but it is the best method available to our knowledge. Advances in generative models that can sample but not estimate likelihood directly motivate further research into how to evaluate such models. In Figures 2 and 3 we show samples drawn from the generator net after training. While we make no claim that these samples are better than samples generated by existing methods, we believe that these samples are at least competitive with the better generative models in the literature and highlight the potential of the adversarial framework.

我们通过对G生成的样本拟合一个高斯Parzen窗口来估计测试集数据在pg下的概率,并报告该分布下的对数似然。Gaussians的参数是通过验证集上的交叉验证得到的。Breuleux等人[7]引入了这个过程,并用于各种生成模型,这些模型的确切似然性是不可控制的[24,3,4]。结果见表1。这种估计可能性的方法方差较大,在高维空间中表现不佳,但它是我们所知的最好的方法。生成模型的进步,可以抽样但不能估计可能性,直接激励进一步研究如何评估这类模型。在图2和图3中,我们展示了经过训练后从生成器网络中抽取的样本。虽然我们并不认为这些样本比现有方法生成的样本更好,但我们认为这些样本至少可以与文献中更好的生成模型竞争,凸显了对抗性框架的潜力。

6 Advantages and disadvantages

6优势和劣势

This new framework comes with advantages and disadvantages relative to previous modeling frameworks. The disadvantages are primarily that there is no explicit representation of pg(x), and that D must be synchronized well with G during training (in particular, G must not be trained too much without updating D, in order to avoid “the Helvetica scenario” in which G collapses too many values of z to the same value of x to have enough diversity to model pdata), much as the negative chains of a Boltzmann machine must be kept up to date between learning steps. The advantages are that Markov chains are never needed, only backprop is used to obtain gradients, no inference is needed during learning, and a wide variety of functions can be incorporated into the model. Table 2 summarizes the comparison of generative adversarial nets with other generative modeling approaches.

与以前的建模框架相比,这个新框架有优点也有缺点。缺点主要是没有显式表示的pg (x)和D必须同步与G在训练(特别是G不能训练太多没有更新,为了避免Helvetica场景,在该场景中,G崩溃太多相同的z值x的值有足够的多样性模型pdata),负链的玻耳兹曼机之间必须保持最新的学习步骤。优点是不需要马尔科夫链,只使用backprop来获取梯度,学习过程中不需要推理,模型中可以包含多种函数。表2总结了生成式对抗网与其他生成式建模方法的比较。

The aforementioned advantages are primarily computational. Adversarial models may also gain some statistical advantage from the generator network not being updated directly with data examples, but only with gradients flowing through the discriminator. This means that components of the input are not copied directly into the generator’s parameters. Another advantage of adversarial networks is that they can represent very sharp, even degenerate distributions, while methods based on Markov chains require that the distribution be somewhat blurry in order for the chains to be able to mix between modes.

上述优势主要是计算性的。相反的模型也可以从生成器网络中获得一些统计优势,因为生成器网络不直接使用数据实例进行更新,而只使用通过鉴别器的梯度。这意味着输入的组件不会直接复制到生成器的参数中。对敌网络的另一个优点是,它们可以表示非常尖锐的,甚至是退化的分布,而基于马尔可夫链的方法要求分布在一定程度上是模糊的,以便链能够在模式之间混合。

7 Conclusions and future work

7结论与未来工作

This framework admits many straightforward extensions:

这个框架允许许多直接的扩展:

Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these images show actual samples from the model distributions, not conditional means given samples of hidden units. Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator)

图2:模型样本的可视化。最右边一列显示的是相邻样本最近的训练样本,以证明模型没有记住训练集。样本是公平随机抽取的,而不是精选的。不同于大多数其他深层生成模型的可视化,这些图像显示了来自模型分布的实际样本,而不是条件意味着给定的隐藏单元样本。此外,这些样本是不相关的,因为采样过程不依赖马尔可夫链混合。a) MNIST b) TFD c) CIFAR-10(全连接模型)d) CIFAR-10(卷积鉴别器和反卷积发生器)

  1. A conditional generative model p(x | c) can be obtained by adding c as input to both G and D.

    1. 将c作为G和D的输入,可以得到条件生成模型p(x | c)。

  2. Learned approximate inference can be performed by training an auxiliary network to predict z given x. This is similar to the inference net trained by the wake-sleep algorithm [15] but with the advantage that the inference net may be trained for a fixed generator net after the generator net has finished training.

    1. 学习近似推理可以由培训辅助网络预测z给x。这类似于和生物钟的推理网络训练算法[15]但推理网络的优势可能被训练为一个固定的发电机净后发电机完成培训。

  3. One can approximately model all conditionals p(xS | x6S) where S is a subset of the indices of x by training a family of conditional models that share parameters. Essentially, one can use adversarial nets to implement a stochastic extension of the deterministic MP-DBM [10].

    1. 通过训练一组共享参数的条件模型,可以对所有条件p(xS | x6S)进行近似建模,其中S是x的索引的子集。本质上,我们可以使用对抗性网来实现确定性MP-DBM[10]的随机扩展。

  4. Semi-supervised learning: features from the discriminator or inference net could improve performance of classifiers when limited labeled data is available.

    1. 半监督学习:当标签数据有限时,从鉴别器或推理网络中获取的特征可以提高分类器的性能。

  5. Efficiency improvements: training could be accelerated greatly by devising better methods for coordinating G and D or determining better distributions to sample z from during training.

    1. 提高效率:通过设计更好的协调G和D的方法,或者在培训过程中确定更好的z样本分布,可以大大加快培训速度。

This paper has demonstrated the viability of the adversarial modeling framework, suggesting that these research directions could prove useful.

本文证明了对抗性建模框架的可行性,表明这些研究方向可以证明是有用的。

Acknowledgments

致谢

We would like to acknowledge Patrice Marcotte, Olivier Delalleau, Kyunghyun Cho, Guillaume Alain and Jason Yosinski for helpful discussions. Yann Dauphin shared his Parzen window evaluation code with us. We would like to thank the developers of Pylearn2 [11] and Theano [6, 1], particularly Fred´ eric Bastien who rushed a Theano feature specifically to benefit this project. Ar- ´ naud Bergeron provided much-needed support with LATEX typesetting. We would also like to thank CIFAR, and Canada Research Chairs for funding, and Compute Canada, and Calcul Quebec for ´ providing computational resources. Ian Goodfellow is supported by the 2013 Google Fellowship in Deep Learning. Finally, we would like to thank Les Trois Brasseurs for stimulating our creativity.

我们要感谢Patrice Marcotte、Olivier Delalleau、Kyunghyun Cho、Guillaume Alain和Jason Yosinski的讨论。Yann Dauphin与我们分享了他的Parzen window评估代码。我们要感谢Pylearn2[11]和Theano的开发人员[6,1],特别是Fred eric Bastien,他为了这个项目匆忙推出了Theano特性。Ar- naud Bergeron提供了乳胶排版急需的支持。我们也要感谢CIFAR和加拿大研究主席提供的资助,以及加拿大计算机和魁北克Calcul提供的计算资源。Ian Goodfellow获得2013谷歌深度学习奖学金的支持。最后,我们要感谢Les Trois Brasseurs激发了我们的创造力。

References

参考文献

[1] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.

[2] Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers.

[3] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations. In ICML’13.

[4] Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable by backprop. In ICML’14.

[5] Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative stochastic networks trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning (ICML’14).

[6] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.

[7] Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an RBM-derived process. Neural Computation, 23(8), 2053–2073.

[8] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS’2011.

[9] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks. In ICML’2013.

[10] Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann machines. In NIPS’2013.

[11] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.

[12] Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep autoregressive networks. In ICML’2014.

[13] Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS’10).

[14] Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.

[15] Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1558–1161.

[16] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.

[17] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153. IEEE.

[18] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).

[19] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.

[20] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS’2012.

[21] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

[22] Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. Technical report, arXiv preprint arXiv:1402.0030.

[23] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. Technical report, arXiv:1401.4082.

[24] Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive auto-encoders. In ICML’12.

[25] Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, pages 448– 455.

[26] Schmidhuber, J. (1992). Learning factorial codes by predictability minimization. Neural Computation, 4(6), 863–879.

[27] Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML TR 2010-001, U. Toronto.

[28] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014). Intriguing properties of neural networks. ICLR, abs/1312.6199.

[29] Tu, Z. (2007). Learning generative models via discriminative approaches. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE.

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on transformed samples. G contracts in regions of high density and expands in regions of low density of pg. (a) Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier. (b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D∗(x)=pdata(x)pdata(x)+pg(x)D ∗ (x) = \frac{pdata(x)}{pdata(x)+pg(x)} D∗(x)=pdata(x)+pg(x)pdata(x)​ . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likely to be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach a point at which both cannot improve because pg = pdata. The discriminator is unable to differentiate between the two distributions, i.e. D(x)=12D(x)=\frac{1}{2}D(x)=21​ .

图1.训练对抗的生成网络时,同时更新判别分布(D,蓝色虚线)使D能区分数据生成分布px(黑色虚线)中的样本和生成分布pg (G,绿色实线) 中的样本。下面的水平线为均匀采样z的区域,上面的水平线为x的部分区域。朝上的箭头显示映射x=G(z)如何将非均匀分布pg作用在转换后的样本上。G在pg高密度区域收缩,且在pg的低密度区域扩散。(a)考虑一个接近收敛的对抗的模型对:pg与pdata相似,且D是个部分准确的分类器。(b)算法的内循环中,训练D来判别数据中的样本,收敛到: D∗(x)=pdata(x)pdata(x)+pg(x)D ∗ (x) = \frac{pdata(x)}{pdata(x)+pg(x)} D∗(x)=pdata(x)+pg(x)pdata(x)​ 。(c)在G的1次更新后,D的梯度引导G(z)流向更可能分类为数据的区域。(d)训练若干步后,如果G和D性能足够,它们接近某个稳定点并都无法继续提高性能,因为此时pg=pdata。判别器将无法区分训练数据分布和生成数据分布,即 D(x)=12D(x)=\frac{1}{2}D(x)=21​ 。

4.1 Global Optimality of pg=pdatap_{g} = p_{data}pg​=pdata​

4.1全局最优性 pg=pdatap_{g} = p_{data}pg​=pdata​

For any (a,b)∈R2∖{0,0}(a, b) ∈ \mathbb{R}^{2} \setminus \{0, 0\}(a,b)∈R2∖{0,0} , the function y→alog(y)+blog(1−y) y → a log(y) + b log(1 − y)y→alog(y)+blog(1−y) achieves its maximum in [0, 1] at aa+b\frac{a}{a+b}a+ba​ . The discriminator does not need to be defined outside of Supp(pdata)∪Supp(pg)Supp(pdata) ∪ Supp(pg)Supp(pdata)∪Supp(pg) , concluding the proof.

对于任意 (a,b)∈R2∖{0,0}(a, b) ∈ \mathbb{R}^{2} \setminus \{0, 0\}(a,b)∈R2∖{0,0} ,函数 y→alog(y)+blog(1−y) y → a log(y) + b log(1 − y)y→alog(y)+blog(1−y) 在 aa+b\frac{a}{a+b}a+ba​ 处[0,1]达到最大值。该鉴别器不需要在 Supp(pdata)∪Supp(pg)Supp(pdata) ∪ Supp(pg)Supp(pdata)∪Supp(pg) 之外定义,得出证明结论。

Note that the training objective for D can be interpreted as maximizing the log-likelihood for estimating the conditional probability P(Y=y∣x)P(Y = y|x)P(Y=y∣x) , where Y indicates whether x comes from pdata( with y=1) or from pg( with y=0)pdata (~with~ y = 1)~ or~ from ~pg (~with ~y = 0)pdata( with y=1) or from pg( with y=0) . The minimax game in Eq. 1 can now be reformulated as:

注意,培养目标为D可以解释为最大化对数似估计条件概率 P(Y=y∣x)P(Y = y|x)P(Y=y∣x) , pdata( with y=1) or from pg( with y=0)pdata (~with~ y = 1)~ or~ from ~pg (~with ~y = 0)pdata( with y=1) or from pg( with y=0) 。在Eq。1现在可以新配方:

Proof. For pg=pdatap_{g} = pdatapg​=pdata , DG∗(x)=12D_{G}^{∗}(x) =\frac{1}{2}DG∗​(x)=21​ , (consider Eq. 2). Hence, by inspecting Eq. 4 at DG∗(x)=12D_{G}^{∗}(x) =\frac{1}{2}DG∗​(x)=21​ , we find C(G)=log12+log12=−log4C(G) = log\frac{1}{2}+log\frac{1}{2}=−log 4C(G)=log21​+log21​=−log4 . To see that this is the best possible value of C(G), reached only for pg = pdata, observe that

证明。对于 pg=pdatap_{g} = pdatapg​=pdata , DG∗(x)=12D_{G}^{∗}(x) =\frac{1}{2}DG∗​(x)=21​ ,(考虑Eq. 2),因此,在 DG∗(x)=12D_{G}^{∗}(x) =\frac{1}{2}DG∗​(x)=21​ 时检查Eq. 4,我们发现 C(G)=log12+log12=−log4C(G) = log\frac{1}{2}+log\frac{1}{2}=−log 4C(G)=log21​+log21​=−log4 。要知道这是仅在pg = pdata时C(G)可能达到的最佳值,请观察

and that by subtracting this expression from C(G)=V(DG∗,G)C(G) = V (D_{G}^{∗}, G)C(G)=V(DG∗​,G) , we obtain:

再从 C(G)=V(DG∗,G)C(G) = V (D_{G}^{∗}, G)C(G)=V(DG∗​,G) 中减去这个表达式,得到

Proof. Consider V(G,D)=U(pg,D)V (G, D) = U(pg, D)V(G,D)=U(pg,D) as a function of pg as done in the above criterion. Note that U(pg,D)U(pg, D)U(pg,D) is convex in pg. The subderivatives of a supremum of convex functions include the derivative of the function at the point where the maximum is attained. In other words, if f(x)=supα∈Afα(x)f(x) = sup_{α∈A} fα(x)f(x)=supα∈A​fα(x) and fα(x) is convex in x for every α, then ∂fβ(x)∈∂f if β=argsupα∈Afα(x)∂f_{β}(x) ∈ ∂f ~if~ β = arg sup_{α∈A} fα(x)∂fβ​(x)∈∂f if β=argsupα∈A​fα(x) . This is equivalent to computing a gradient descent update for pg at the optimal D given the corresponding G. supDU(pg,D)sup_{D} U(p_{g}, D)supD​U(pg​,D) is convex in pg with a unique global optima as proven in Thm 1, therefore with sufficiently small updates of pg, pg converges to px, concluding the proof.

证明。将 V(G,D)=U(pg,D)V (G, D) = U(pg, D)V(G,D)=U(pg,D) 视为上述准则中pg的函数。注意 U(pg,D)U(pg, D)U(pg,D) 在pg中是凸的,凸函数的上项的子导数包括函数在最大点处的导数。换句话说,如果 f(x)=supα∈Afα(x)f(x) = sup_{α∈A} fα(x)f(x)=supα∈A​fα(x) ,并且f求出(x)对于每个求出(x)都是凸的,那么 ∂fβ(x)∈∂f if β=argsupα∈Afα(x)∂f_{β}(x) ∈ ∂f ~if~ β = arg sup_{α∈A} fα(x)∂fβ​(x)∈∂f if β=argsupα∈A​fα(x) 。这相当于在给定对应G. supDU(pg,D)sup_{D} U(p_{g}, D)supD​U(pg​,D) 在pg中凸且有唯一全局最优值的情况下计算pg在最优D处的梯度下降更新,已在thm1中证明,因此在pg更新足够小的情况下,pg收敛于px,得出证明。

http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
图3:全模型z空间坐标间线性插值得到的数字。
表2:生成式建模的挑战:对涉及模型的每个主要操作采用不同方法进行深层生成式建模所遇到的困难的总结。