解释和利用敌对的例子

EXPLAINING AND HARNESSING ADVERSARIAL EXAMPLES

原文连接：

https://arxiv.org/pdf/1412.6572.pdf

GB/T Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples[J]. arXiv preprint arXiv:1412.6572, 2014.

MLA Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).

APA Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.

ABSTRACT

Several machine learning models, including neural networks, consistently misclassify adversarial examples—inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks’ vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.

摘要

一些机器学习模型，包括神经网络，通过对数据集的例子应用小的但故意的最坏情况扰动而形成的对抗性例子输入始终错误地分类，这样扰动的输入导致模型输出一个高度自信的错误答案。早期试图解释这一现象的尝试集中在非线性和过拟合上。相反，我们认为神经网络易受敌对干扰的主要原因是它们的线性特性。这个解释得到了新的定量结果的支持，同时给出了关于它们最有趣事实的第一个解释:它们在架构和训练集之间的泛化。此外，这个视图提供了一种生成敌对示例的简单而快速的方法。利用该方法为对抗性训练提供了实例，减少了maxout网络在MNIST数据集上的测试集误差。

1. INTRODUCTION

Szegedy et al. (2014b) made an intriguing discovery: several machine learning models, including state-of-the-art neural networks, are vulnerable to adversarial examples. That is, these machine learning models misclassify examples that are only slightly different from correctly classified examples drawn from the data distribution. In many cases, a wide variety of models with different architectures trained on different subsets of the training data misclassify the same adversarial example. This suggests that adversarial examples expose fundamental blind spots in our training algorithms.

1. 介绍

Szegedy等人(2014b)有一个有趣的发现:一些机器学习模型，包括最先进的神经网络，容易受到敌对例子的攻击。也就是说，这些机器学习模型误分类的例子与从数据分布中得到的正确分类的例子只有轻微的区别。在许多情况下，在训练数据的不同子集上训练的具有不同架构的各种各样的模型会对相同的对抗性例子进行错误的分类。这表明，敌对的例子暴露了我们训练算法中的基本盲点。

The cause of these adversarial examples was a mystery, and speculative explanations have suggested it is due to extreme nonlinearity of deep neural networks, perhaps combined with insufficient model averaging and insufficient regularization of the purely supervised learning problem. We show that these speculative hypotheses are unnecessary. Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples. This view enables us to design a fast method of generating adversarial examples that makes adversarial training practical. We show that adversarial training can provide an additional regularization benefit beyond that provided by using dropout (Srivastava et al., 2014) alone. Generic regularization strategies such as dropout, pretraining, and model averaging do not confer a significant reduction in a model’s vulnerability to adversarial examples, but changing to nonlinear model families such as RBF networks can do so.

这些对立的例子的原因是一个谜，推测的解释表明，这是由于深度神经网络的极端非线性，也许结合不够充分的模型平均和正则化的纯监督学习问题。我们证明这些推测性的假设是不必要的。高维空间中的线性行为足以引起对抗性的例子。这个视图使我们能够设计出一种快速生成对抗实例的方法，从而使对抗训练变得切实可行。我们表明，对抗训练可以提供一个额外的正规化效益，而不只是使用dropout (Srivastava et al.， 2014)。一般的正则化策略，如退出，预训练，和模型平均不能提供一个模型的弱点的对抗性例子，但改变非线性模型家族，如RBF网络可以做到这一点。

Our explanation suggests a fundamental tension between designing models that are easy to train due to their linearity and designing models that use nonlinear effects to resist adversarial perturbation. In the long run, it may be possible to escape this tradeoff by designing more powerful optimization methods that can succesfully train more nonlinear models.

我们的解释表明，在设计由于线性而易于训练的模型和设计使用非线性效应来抵抗敌对扰动的模型之间存在着一种基本的张力。从长远来看，通过设计更强大的优化方法来成功地训练更多的非线性模型，有可能避免这种权衡。

Szegedy et al. (2014b) demonstrated a variety of intriguing properties of neural networks and related models. Those most relevant to this paper include:

Box-constrained L-BFGS can reliably find adversarial examples.
On some datasets, such as ImageNet (Deng et al., 2009), the adversarial examples were so close to the original examples that the differences were indistinguishable to the human eye.
The same adversarial example is often misclassified by a variety of classifiers with different architectures or trained on different subsets of the training data.
Shallow softmax regression models are also vulnerable to adversarial examples.
Training on adversarial examples can regularize the model—however, this was not practical at the time due to the need for expensive constrained optimization in the inner loop.

2. 相关工作

Szegedy等人(2014b)展示了神经网络和相关模型的各种有趣特性。与本文件最相关的内容包括:

盒约束的L-BFGS可以可靠地找到敌对的例子。
在一些数据集上，比如ImageNet (Deng et al.， 2009)，敌对的示例与原始示例非常接近，人眼无法分辨出差异。
同一个对抗式示例常常被具有不同架构的各种分类器错误地分类，或者在训练数据的不同子集上进行训练。
浅的softmax回归模型也容易受到敌对实例的攻击
在对抗性例子上进行训练可以使模型正规化——但是，这在当时并不实际，因为需要在内部循环中进行昂贵的约束优化。

These results suggest that classifiers based on modern machine learning techniques, even those that obtain excellent performance on the test set, are not learning the true underlying concepts that determine the correct output label. Instead, these algorithms have built a Potemkin village that works well on naturally occuring data, but is exposed as a fake when one visits points in space that do not have high probability in the data distribution. This is particularly disappointing because a popular approach in computer vision is to use convolutional network features as a space where Euclidean distance approximates perceptual distance. This resemblance is clearly flawed if images that have an immeasurably small perceptual distance correspond to completely different classes in the network’s representation.

这些结果表明，基于现代机器学习技术的分类器，即使是那些在测试集上获得优异性能的分类器，也没有学习到真正的底层概念，这些概念决定了正确的输出标签。相反，这些算法建立了一个“波将金村”(Potemkin village)，可以很好地处理自然发生的数据，但当访问点在数据分布中不具有高概率时，就会暴露为假数据。这尤其令人失望，因为计算机视觉中一个流行的方法是使用卷积网络特征作为欧几里得距离近似感知距离的空间。如果感知距离非常小的图像对应的是网络表示中完全不同的类，那么这种相似性显然是有缺陷的。

These results have often been interpreted as being a flaw in deep networks in particular, even though linear classifiers have the same problem. We regard the knowledge of this flaw as an opportunity to fix it. Indeed, Gu & Rigazio (2014) and Chalupka et al. (2014) have already begun the first steps toward designing models that resist adversarial perturbation, though no model has yet succesfully done so while maintaining state of the art accuracy on clean inputs.

这些结果经常被解释为一个缺陷，特别是在深度网络，即使线性分类器有同样的问题。我们把对这一缺陷的了解看作是一次修复它的机会。事实上,顾,Rigazio(2014)和Chalupka等人(2014)已经开始了设计抵御敌对干扰的模型的第一步，尽管还没有模型在保持干净输入的精确状态的同时成功地做到了这一点。

3. THE LINEAR EXPLANATION OF ADVERSARIAL EXAMPLES

We start with explaining the existence of adversarial examples for linear models.

我们从解释线性模型的对抗性例子的存在开始。

In many problems, the precision of an individual input feature is limited. For example, digital images often use only 8 bits per pixel so they discard all information below $1/255$ of the dynamic range. Because the precision of the features is limited, it is not rational for the classifier to respond differently to an input $x$ than to an adversarial input $\widetilde{x} = x + \eta$ if every element of the perturbation $\eta$ is smaller than the precision of the features. Formally, for problems with well-separated classes, we expect the classifier to assign the same class to $x$ and $\widetilde{x}$ so long as $\lVert \eta \rVert_{\infty}<\epsilon$ , where $\epsilon$ is small enough to be discarded by the sensor or data storage apparatus associated with our problem.

在许多问题中，单个输入特征的精度是有限的。例如，数字图像通常每像素只使用8位，因此它们会丢弃低于 $1/255$ 动态范围的所有信息。由于特征的精度是有限的，如果扰动的每个元素都小于特征的精度，分类器对一个输入 $x$ 和对一个敌对输入 $\widetilde{x} = x + \eta$ 做出不同的响应 $\eta$ 是不合理的。在形式上，对于类分离良好的问题，我们期望分类器将相同的类 $x$ 和 $\widetilde{x}$ 分配到足够小的 $\lVert \eta \rVert_{\infty}<\epsilon$ ，可以被与我们的问题相关的传感器或数据存储设备丢弃的地方 $\epsilon$ 。

Consider the dot product between a weight vector w and an adversarial example $\widetilde{x}$ :

考虑权向量w与一个敌对的例子之间的点积 $\widetilde{x}$ :

The adversarial perturbation causes the activation to grow by $\omega^{T}\eta$ .We can maximize this increase subject to the max norm constraint on $\eta$ by assigning $\eta = sign(\omega)$ . If $\omega$ has $n$ dimensions and the average magnitude of an element of the weight vector is $m$ , then the activation will grow by $\epsilon mn$ . Since $\lVert \eta \rVert_{\infty}$ does not grow with the dimensionality of the problem but the change in activation caused by perturbation by $\eta$ can grow linearly with $n$ , then for high dimensional problems, we can make many infinitesimal changes to the input that add up to one large change to the output. We can think of this as a sort of “accidental steganography,” where a linear model is forced to attend exclusively to the signal that aligns most closely with its weights, even if multiple signals are present and other signals have much greater amplitude.

对抗性扰动使激活增加 $\omega^{T}\eta$ ，我们可以在最大范数约束下 $\eta$ 通过赋值 $\eta = sign(\omega)$ 使这个增加最大化。如果 $\omega$ 权重向量 $n$ 中有维数且元素的平均大小为，则激活量将增长为 $\epsilon mn$ 。因为它 $\lVert \eta \rVert_{\infty}$ 不会随着问题的维数而增长但是由扰动引起的激活变化会随着线性增长，那么对于高维问题，我们可以对输入做许多无限小的变化，加起来对输出做一个大的变化。我们可以把这看作是一种“偶然的隐写术”，线性模型被迫只关注与权值最接近的信号，即使存在多个信号，而其他信号的振幅要大得多。

This explanation shows that a simple linear model can have adversarial examples if its input has sufficient dimensionality. Previous explanations for adversarial examples invoked hypothesized properties of neural networks, such as their supposed highly non-linear nature. Our hypothesis based on linearity is simpler, and can also explain why softmax regression is vulnerable to adversarial examples.

这说明了当一个简单的线性模型的输入具有足够的维数时，它也会有相反的例子。先前对敌对例子的解释引用了神经网络的假设属性，比如高度非线性的性质。我们基于线性的假设更简单，也可以解释为什么softmax回归容易受到反对的例子。

4. LINEAR PERTURBATION OF NON-LINEAR MODELS

The linear view of adversarial examples suggests a fast way of generating them. We hypothesize that neural networks are too linear to resist linear adversarial perturbation. LSTMs (Hochreiter & Schmidhuber, 1997), ReLUs (Jarrett et al., 2009; Glorot et al., 2011), and maxout networks (Goodfellow et al., 2013c) are all intentionally designed to behave in very linear ways, so that they are easier to optimize. More nonlinear models such as sigmoid networks are carefully tuned to spend most of their time in the non-saturating, more linear regime for the same reason. This linear behavior suggests that cheap, analytical perturbations of a linear model should also damage neural networks.

对抗性例子的线性观点提供了一种快速生成它们的方法。我们假设神经网络是太线性抵抗线性对抗性扰动。LSTMs (Hochreiter,Schmidhuber, 1997)， ReLUs (Jarrett et al.， 2009;Glorot et al.， 2011)和maxout网络(Goodfellow et al.， 2013c)都被有意设计为非常线性的行为方式，因此它们更容易优化。出于同样的原因，更多的非线性模型(如s型网络)被仔细地调整，使其大部分时间处于非饱和、更线性的状态。这种线性特性表明，对线性模型的廉价的解析扰动也会破坏神经网络。

Let θ be the parameters of a model, x the input to the model, y the targets associated with x (for machine learning tasks that have targets) and J(θ, x, y) be the cost used to train the neural network. We can linearize the cost function around the current value of θ, obtaining an optimal max-norm constrained pertubation of:

让模型的参数为，x为模型的输入，y为与x相关的目标(对于有目标的机器学习任务)，J为用于训练神经网络的代价。我们可以将代价函数线性化在当前值的周围，得到一个最优的最大范数约束下的灌流:

We refer to this as the “fast gradient sign method” of generating adversarial examples. Note that the required gradient can be computed efficiently using backpropagation.

我们将其称为生成对抗性例子的“快速梯度符号法”。注意，使用反向传播可以有效地计算所需的梯度。

We find that this method reliably causes a wide variety of models to misclassify their input. See Fig. 1 for a demonstration on ImageNet. We find that using $\epsilon=.25$ , we cause a shallow softmax classifier to have an error rate of 99.9% with an average confidence of 79.3% on the MNIST $(?)$ test set1 . In the same setting, a maxout network misclassifies 89.4% of our adversarial examples with an average confidence of 97.6%. Similarly, using $\epsilon = .1$ , we obtain an error rate of 87.15% and an average probability of 96.6% assigned to the incorrect labels when using a convolutional maxout network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 2009) test $set^{2}$ . Other simple methods of generating adversarial examples are possible. For example, we also found that rotating $x$ by a small angle in the direction of the gradient reliably produces adversarial examples.

我们发现，该方法能可靠地导致各种模型对输入进行误分类。看到图1在ImageNet上的演示。我们发现使用 $\epsilon=.25$ , 我们使一个浅的softmax分类器对MNIST(?)测试set1的错误率为99.9%，平均置信度为79.3%.在相同的设置下，maxout网络以97.6%的平均置信度对89.4%的敌对例子进行了错误分类。同样的, 使用 $\epsilon = .1$ ，在预处理版本的ci远-10 (Krizhevsky & Hinton, 2009)测试set2上使用卷积maxout网络，我们获得了87.15%的错误率和96.6%的平均概率分配给错误的标签 $set^{2}$ 。还可以使用其他简单的方法来生成对抗的例子。例如，我们还发现x在梯度方向上旋转一个小角度会可靠地产生相反的例子。

The fact that these simple, cheap algorithms are able to generate misclassified examples serves as evidence in favor of our interpretation of adversarial examples as a result of linearity. The algorithms are also useful as a way of speeding up adversarial training or even just analysis of trained networks.

这些简单、廉价的算法能够产生错误分类的例子，这一事实证明了线性的结果有利于我们对敌对例子的解释。这些算法也可以用来加速对抗性训练，甚至可以用来分析训练过的网络。

5. ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEIGHT DECAY

Perhaps the simplest possible model we can consider is logistic regression. In this case, the fast gradient sign method is exact. We can use this case to gain some intuition for how adversarial examples are generated in a simple setting. See Fig. 2 for instructive images.

也许我们可以考虑的最简单的模型是逻辑回归。在这种情况下，快速梯度符号法是精确的。我们可以使用这个例子来获得一些直觉，了解如何在一个简单的设置中生成敌对的示例。指导性图像见图2。

If we train a single model to recognize labels $y \in \{−1, 1\}$ with $P(y=1)=\sigma(\omega^{T}x+b)$ where $\sigma(z)$ is the logistic sigmoid function, then training consists of gradient descent on

如果我们用单模型函数 $y \in \{−1, 1\}$ 训练单一模型识别标签 $P(y=1)=\sigma(\omega^{T}x+b)$ ，那么训练包括梯度下降 $\sigma(z)$

where $\zeta(z)=log(1+exp(z))$ is the softplus function. We can derive a simple analytical form for training on the worst-case adversarial perturbation of x rather than x itself, based on gradient sign perturbation. Note that the sign of the gradient is just $-sign(\omega)$ , and that $\omega^{T} sign(\omega)=\lVert \omega \rVert_{1}$ . The adversarial version of logistic regression is therefore to minimize

那么 $\zeta(z)=log(1+exp(z))$ 是softplus 方法，我们可以推导出一个简单的分析形式来训练最坏情况下对x的对抗扰动，而不是x本身基于梯度sign 扰乱. 注意梯度的符号是 $-sign(\omega)$ , 而且这个 $\omega^{T} sign(\omega)=\lVert \omega \rVert_{1}$ 因此，逻辑回归的对抗性版本是最小化

This is somewhat similar to $L^{1}$ regularization. However, there are some important differences. Most significantly, the $L^{1}$ penalty is subtracted off the model’s activation during training, rather than added to the training cost. This means that the penalty can eventually start to disappear if the model learns to make confident enough predictions that $\zeta$ saturates. This is not guaranteed to happen—in the underfitting regime, adversarial training will simply worsen underfitting. We can thus view $L^{1}$ weight decay as being more “worst case” than adversarial training, because it fails to deactivate in the case of good margin.

这有点类似于 $L^{1}$ 正则化。然而，有一些重要的区别。最重要的是， $L^{1}$ 的惩罚是在训练过程中从模型s的激活中减去的，而不是增加到训练成本中。这意味着，如果模型学会做出足够自信的预测，使脂肪酸饱和，那么这种惩罚最终会开始消失。这并不能保证在不适应的情况下发生，对抗训练只会加重不适应。因此，我们可以把 $L^{1}$ 重量衰减看作比对抗性训练更糟糕的情况，因为它不能在良好的边际情况下失效。

If we move beyond logistic regression to multiclass softmax regression, $L^{1}$ weight decay becomes even more pessimistic, because it treats each of the softmax’s outputs as independently perturbable, when in fact it is usually not possible to find a single $\eta$ that aligns with all of the class’s weight vectors. Weight decay overestimates the damage achievable with perturbation even more in the case of a deep network with multiple hidden units. Because $L^{1}$ weight decay overestimates the amount of damage an adversary can do, it is necessary to use a smaller $L^{1}$ weight decay coefficient than the $\epsilon$ associated with the precision of our features. When training maxout networks on MNIST, we obtained good results using adversarial training with $\epsilon=.25$ . When applying $L^{1}$ weight decay to the first layer, we found that even a coefficient of .0025 was too large, and caused the model to get stuck with over 5% error on the training set. Smaller weight decay coefficients permitted succesful training but conferred no regularization benefit.

如果我们超越logistic回归，进入多类softmax回归， $L^{1}$ 权重衰减变得更加悲观，因为它将softmax的每个输出看作是独立的可干扰的，而实际上通常不可能找到一个与类的所有权重向量相匹配的单一规范。在有多个隐藏单位的深网中，重量衰减会高估扰动所能造成的伤害。因为 $L^{1}$ 重量衰减过高估计了对手的伤害，用一个小的权重缩减系数 $L^{1}$ 与我们的特征的精确性有关。在MNIST上训练maxout网络，采用对抗性训练，取得了良好的效果 $\epsilon=.25$ 。应用 $L^{1}$ 权值衰减到第一层时，我们发现即使系数为。.0025也太大，导致模型在训练集上的误差超过5%。较小的权重衰减系数允许成功的训练，但没有正则化的好处。

6. ADVERSARIAL TRAINING OF DEEP NETWORKS

The criticism of deep networks as vulnerable to adversarial examples is somewhat misguided, because unlike shallow linear models, deep networks are at least able to represent functions that resist adversarial perturbation. The universal approximator theorem (Hornik et al., 1989) guarantees that a neural network with at least one hidden layer can represent any function to an arbitary degree of accuracy so long as its hidden layer is permitted to have enough units. Shallow linear models are not able to become constant near training points while also assigning different outputs to different training points.

6. 深度网络的对抗性训练

认为深层网络容易受到敌对性例子的攻击的批评在某种程度上是被误导的，因为与浅线性模型不同，深层网络至少能够表示抵抗敌对性干扰的函数。通用逼近器定理(Hornik et al.， 1989)保证了具有至少一个隐含层的神经网络可以以任意精度表示任何函数，只要它的隐含层允许有足够的单位。浅层线性模型不能在训练点附近成为常数，同时对不同的训练点分配不同的输出。

Of course, the universal approximator theorem does not say anything about whether a training algorithm will be able to discover a function with all of the desired properties. Obviously, standard supervised training does not specify that the chosen function be resistant to adversarial examples. This must be encoded in the training procedure somehow.

当然，通用逼近器定理并没有说明训练算法是否能够发现具有所有期望性质的函数。显然，标准的监督训练并没有规定所选的函数能够抵抗敌对的例子。这必须以某种方式被编码到训练过程中。

Szegedy et al. (2014b) showed that by training on a mixture of adversarial and clean examples, a neural network could be regularized somewhat. Training on adversarial examples is somewhat different from other data augmentation schemes; usually, one augments the data with transformations such as translations that are expected to actually occur in the test set. This form of data augmentation instead uses inputs that are unlikely to occur naturally but that expose flaws in the ways that the model conceptualizes its decision function. At the time, this procedure was never demonstrated to improve beyond dropout on a state of the art benchmark. However, this was partially because it was difficult to experiment extensively with expensive adversarial examples based on L-BFGS.

Szegedy等人(2014b)表明，通过混合训练敌对的和干净的例子，神经网络可以在一定程度上正则化。对抗性实例的训练与其他数据增强方案有所不同;通常,一个增强的数据转换等翻译,预计将发生在测试集。这种形式的数据增强而不是使用输入,自然不太可能发生,但暴露的缺陷模型总结决策函数的方法。在那个时候，这个过程从来没有被证明在一个最先进的基准测试中可以提高。然而，这在一定程度上是因为很难用昂贵的基于L-BFGS的对抗实例进行广泛的实验。

We found that training with an adversarial objective function based on the fast gradient sign method was an effective regularizer:

我们发现基于快速梯度符号方法的对抗性目标函数训练是一种有效的正则化器:

In all of our experiments, we used $\alpha = 0.5$ . Other values may work better; our initial guess of this hyperparameter worked well enough that we did not feel the need to explore more. This approach means that we continually update our supply of adversarial examples, to make them resist the current version of the model. Using this approach to train a maxout network that was also regularized with dropout, we were able to reduce the error rate from 0.94% without adversarial training to 0.84% with adversarial training.

在我们所有的实验中，我们使用的是 $\alpha = 0.5$ 。其他价值观可能更有效;我们对这个超参数的最初猜测非常有效，所以我们觉得没有必要进一步探索。这种方法意味着我们要不断更新我们的对抗性示例，使它们能够抵抗模型的当前版本。用这种方法训练一个正则化的带dropout的maxout网络，可以将不进行对抗性训练的错误率从0.94%降低到进行对抗性训练的错误率为0.84%。

We observed that we were not reaching zero error rate on adversarial examples on the training set. We fixed this problem by making two changes. First, we made the model larger, using 1600 units per layer rather than the 240 used by the original maxout network for this problem. Without adversarial training, this causes the model to overfit slightly, and get an error rate of 1.14% on the test set. With adversarial training, we found that the validation set error leveled off over time, and made very slow progress. The original maxout result uses early stopping, and terminates learning after the validation set error rate has not decreased for 100 epochs. We found that while the validation set error was very flat, the adversarial validation set error was not. We therefore used early stopping on the adversarial validation set error. Using this criterion to choose the number of epochs to train for, we then retrained on all 60,000 examples. Five different training runs using different seeds for the random number generators used to select minibatches of training examples, initialize model weights, and generate dropout masks result in four trials that each had an error rate of 0.77% on the test set and one trial that had an error rate of 0.83%. The average of 0.782% is the best result reported on the permutation invariant version of MNIST, though statistically indistinguishable from the result obtained by fine-tuning DBMs with dropout (Srivastava et al., 2014) at 0.79%.

我们观察到，在训练集的对抗性例子上，我们没有达到零错误率。我们通过做两个修改解决了这个问题。首先，我们将模型放大，每层使用1600个单位，而不是原来maxout网络的240个单位。如果没有对抗性训练，这会导致模型稍微过拟合，测试集的错误率为1.14%。在对抗性训练中，我们发现验证集错误随着时间的推移趋于稳定，并且进展非常缓慢。原来的maxout结果使用了早期停止，并在验证集错误率未降低100个epoch后终止学习。我们发现，虽然验证集错误是非常平坦的，但对抗性验证集错误不是。因此，我们在对抗性验证集错误上使用了早期停止。使用这个标准来选择训练的纪元数，然后我们对所有60000个例子进行再训练。五个不同的培训运行使用不同的随机数生成器的种子用于选择minibatches训练的例子,初始化权重模型,并生成辍学面具导致四个试验都有0.77%的错误率在测试集和一个试验0.83%的错误率。0.782%的平均值是MNIST排列不变版本报告的最佳结果，尽管在统计学上与使用dropout微调DBMs (Srivastava et al.， 2014)获得的0.79%的结果没有区别。

The model also became somewhat resistant to adversarial examples. Recall that without adversarial training, this same kind of model had an error rate of 89.4% on adversarial examples based on the fast gradient sign method. With adversarial training, the error rate fell to 17.9%. Adversarial examples are transferable between the two models but with the adversarially trained model showing greater robustness. Adversarial examples generated via the original model yield an error rate of 19.6% on the adversarially trained model, while adversarial examples generated via the new model yield an error rate of 40.9% on the original model. When the adversarially trained model does misclassify an adversarial example, its predictions are unfortunately still highly confident. The average confidence on a misclassified example was 81.4%. We also found that the weights of the learned model changed significantly, with the weights of the adversarially trained model being significantly more localized and interpretable (see Fig. 3).

这个模型也变得对敌对的例子有些抵触。回想一下，在没有对抗性训练的情况下，基于快速梯度符号法的对抗性算例的错误率为89.4%。通过对抗性训练，错误率下降到17.9%。相反的例子可以在两个模型之间转移，但是相反训练的模型显示出更强的鲁棒性。由原模型生成的对抗性算例对经对抗性训练的模型的错误率为19.6%，而由新模型生成的对抗性算例对原模型的错误率为40.9%。不幸的是，当经过对抗性训练的模型确实对一个对抗性例子进行了错误的分类时，它的预测仍然是高度自信的。对错误分类样本的平均置信度为81.4%。我们还发现，学习模型的权值发生了显著变化，反向训练模型的权值明显更具有局部性和可解释性(见图3)。

The adversarial training procedure can be seen as minimizing the worst case error when the data is perturbed by an adversary. That can be interpreted as learning to play an adversarial game, or as minimizing an upper bound on the expected cost over noisy samples with noise from $U(-\epsilon,\epsilon)$ added to the inputs. Adversarial training can also be seen as a form of active learning, where the model is able to request labels on new points. In this case the human labeler is replaced with a heuristic labeler that copies labels from nearby points.

当数据被对手干扰时，对抗性的训练程序可以被看作是使最坏情况的误差最小化。这可以解释为学习玩一个对抗性的游戏，或者最小化噪声样本的期望代价上限添加到 $U(-\epsilon,\epsilon)$ 输入层。对抗性训练也可以看作是主动学习的一种形式，在这种形式中，模型能够要求在新的点上标注。在这种情况下，人工标签机被从附近点复制标签的启发式标签机所取代。

We could also regularize the model to be insensitive to changes in its features that are smaller than the $\epsilon$ precision simply by training on all points within the $\epsilon$ max norm box, or sampling many points within this box. This corresponds to adding noise with max norm $\epsilon$ during training. However, noise with zero mean and zero covariance is very inefficient at preventing adversarial examples. The expected dot product between any reference vector and such a noise vector is zero. This means that in many cases the noise will have essentially no effect rather than yielding a more difficult input.

我们还可以对模型进行规则化，使其对小于 $\epsilon$ 精度只需通过训练对所有点内的 $\epsilon$ 最大范数框，或者在这个框中采样许多点。这就相当于在训练时用最大范数添加噪声 $\epsilon$ 。然而，零均值和零协方差的噪声在防止对抗性例子方面是非常低效的。任何参考向量与噪声向量之间的期望点积为零。这意味着，在许多情况下，噪音将基本上没有影响，而不是产生一个更困难的输入。

In fact, in many cases the noise will actualy result in a lower objective function value. We can think of adversarial training as doing hard example mining among the set of noisy inputs, in order to train more efficiently by considering only those noisy points that strongly resist classification. As control experiments, we trained training a maxout network with noise based on randomly adding $± \epsilon$ to each pixel, or adding noise in $U(-\epsilon,\epsilon)$ to each pixel. These obtained an error rate of 86.2% with confidence 97.3% and an error rate of 90.4% with a confidence of 97.8% respectively on fast gradient sign adversarial examples.

事实上，在很多情况下，噪声会导致目标函数值更低。我们可以把对抗性训练看作是在一组有噪声的输入中进行艰难的例子挖掘，以便通过只考虑那些强烈抵制分类的有噪声的点来更有效地训练。作为控制实验，我们在随机添加的基础上训练带有噪声的最大输出网络 $± \epsilon$ 给每个像素。或者是在噪声中添加 $U(-\epsilon,\epsilon)$ 给每个像素。对快速梯度符号对抗性算例的错误率为86.2%，置信度为97.3%，错误率为90.4%，置信度为97.8%。

Because the derivative of the sign function is zero or undefined everywhere, gradient descent on the adversarial objective function based on the fast gradient sign method does not allow the model to anticipate how the adversary will react to changes in the parameters. If we instead adversarial examples based on small rotations or addition of the scaled gradient, then the perturbation process is itself differentiable and the learning can take the reaction of the adversary into account. However, we did not find nearly as powerful of a regularizing result from this process, perhaps because these kinds of adversarial examples are not as difficult to solve.

由于符号函数的导数处处为零或无定义，基于快速梯度符号法对敌对目标函数的梯度下降不允许模型预测对手对参数变化的反应。如果我们不是基于小的旋转或缩放梯度的加法的对抗性例子，那么扰动过程本身是可微的，学习可以考虑对手的反应。然而，我们并没有从这个过程中发现一个强大的正则化结果，也许是因为这些类型的敌对的例子并不那么难解决。

One natural question is whether it is better to perturb the input or the hidden layers or both. Here the results are inconsistent. Szegedy et al. (2014b) reported that adversarial perturbations yield the best regularization when applied to the hidden layers. That result was obtained on a sigmoidal network. In our experiments with the fast gradient sign method, we find that networks with hidden units whose activations are unbounded simply respond by making their hidden unit activations very large, so it is usually better to just perturb the original input. On saturating models such as the Rust model we found that perturbation of the input performed comparably to perturbation of the hidden layers. Perturbations based on rotating the hidden layers solve the problem of unbounded activations growing to make additive perturbations smaller by comparison. We were able to succesfully train maxout networks with rotational perturbations of the hidden layers. However, this did not yield nearly as strong of a regularizing effect as additive perturbation of the input layer. Our view of adversarial training is that it is only clearly useful when the model has the capacity to learn to resist adversarial examples. This is only clearly the case when a universal approximator theorem applies. Because the last layer of a neural network, the linear-sigmoid or linear-softmax layer, is not a universal approximator of functions of the final hidden layer, this suggests that one is likely to encounter problems with underfitting when applying adversarial perturbations to the final hidden layer. We indeed found this effect. Our best results with training using perturbations of hidden layers never involved perturbations of the final hidden layer.

一个很自然的问题是，扰动输入或隐藏层，或者两者都扰动更好。这里的结果不一致。Szegedy等人(2014b)报道了对抗性扰动对隐藏层的正则化效果最好。这个结果是在一个s形网络上得到的。在快速梯度符号法的实验中，我们发现具有无界隐单元的网络的响应是使其隐单元的活跃性非常大，因此通常最好是扰动原始输入。在饱和模型(如Rust模型)上，我们发现对输入的扰动与对隐藏层的扰动执行得相当。基于旋转隐藏层的扰动解决了无界激活增长的问题，使相加的扰动相对较小。我们能够成功地用隐含层的旋转扰动训练最大输出网络。然而，这并没有产生几乎强大的正则化效果的添加扰动的输入层。我们对对抗性训练的看法是，只有当模型有能力学会抵制对抗性的例子时，它才显然是有用的。只有当通用逼近定理适用时，才会出现这种情况。由于神经网络的最后一层，即linear-sigmoid或linear-softmax层，并不是最终隐含层函数的通用逼近器，这意味着当对最终隐含层施加不利扰动时，很可能会遇到拟合不足的问题。我们确实发现了这种效应。我们使用隐含层的扰动训练的最佳结果从未涉及到最后隐含层的扰动。

7. DIFFERENT KINDS OF MODEL CAPACITY

7. 不同的模型容量

One reason that the existence of adversarial examples can seem counter-intuitive is that most of us have poor intuitions for high dimensional spaces. We live in three dimensions, so we are not used to small effects in hundreds of dimensions adding up to create a large effect. There is another way that our intuitions serve us poorly. Many people think of models with low capacity as being unable to make many different confident predictions. This is not correct. Some models with low capacity do exhibit this behavior. For example shallow RBF networks with

这种对立例子的存在似乎是违反直觉的一个原因是，我们大多数人对高维空间的直觉都很差。我们生活在三维空间中，所以我们不习惯数百个维度中的小效果叠加起来产生大效果。我们的直觉在另一个方面对我们很不利。许多人认为低能力的模型无法做出许多不同的有信心的预测。这是不正确的。一些低容量的模型确实表现出这种行为。例如浅RBF网络

are only able to confidently predict that the positive class is present in the vicinity of µ. Elsewhere, they default to predicting the class is absent, or have low-confidence predictions

只能自信地预测，积极的阶级存在于。在其他地方，他们会默认预测课程缺席，或者对课程的预测信心不足。

RBF networks are naturally immune to adversarial examples, in the sense that they have low confidence when they are fooled. A shallow RBF network with no hidden layers gets an error rate of 55.4% on MNIST using adversarial examples generated with the fast gradient sign method and $\epsilon = .25$ . However, its confidence on mistaken examples is only 1.2%. Its average confidence on clean test examples is 60.6%. We can’t expect a model with such low capacity to get the right answer at all points of space, but it does correctly respond by reducing its confidence considerably on points it does not “understand.”

RBF网络自然地对敌对的例子免疫，从某种意义上说，当他们被愚弄时，就会缺乏自信。一个没有隐藏层的浅RBF网络在MNIST上使用快速梯度符号法和 $\epsilon = .25$ 生成的对抗性例子的错误率为55.4%。然而，它对错误例子的信心只有1.2%。它对干净样本的平均信心为60.6%。我们不能指望一个容量如此之低的模型在空间的所有点上都能得到正确的答案，但它确实通过在它不“理解”的点上大幅降低信心来做出正确的反应。

RBF units are unfortunately not invariant to any significant transformations so they cannot generalize very well. We can view linear units and RBF units as different points on a precision-recall tradeoff curve. Linear units achieve high recall by responding to every input in a certain direction, but may have low precision due to responding too strongly in unfamiliar situations. RBF units achieve high precision by responding only to a specific point in space, but in doing so sacrifice recall. Motivated by this idea, we decided to explore a variety of models involving quadratic units, including deep RBF networks. We found this to be a difficult task—very model with sufficient quadratic inhibition to resist adversarial perturbation obtained high training set error when trained with SGD.

不幸的是，RBF单元对于任何重要的转换都不是不变的，因此它们不能很好地推广。我们可以把线性单元和RBF单元看作精确-召回折衷曲线上的不同点。线性单元通过对某一方向的每一个输入作出反应来实现高回忆率，但在不熟悉的情况下，由于反应过于强烈，可能导致精度低。RBF单位通过只对空间中的一个特定点做出反应来达到高精度，但是这样做牺牲了回忆。受此启发，我们决定探索各种涉及二次单元的模型，包括深度RBF网络。我们发现这是一项困难的任务，当有足够的二次抑制来抵抗敌对摄动时，用SGD训练得到了很高的训练集误差。

8. WHY DO ADVERSARIAL EXAMPLES GENERALIZE?

An intriguing aspect of adversarial examples is that an example generated for one model is often misclassified by other models, even when they have different architecures or were trained on disjoint training sets. Moreover, when these different models misclassify an adversarial example, they often agree with each other on its class. Explanations based on extreme non-linearity and overfitting cannot readily account for this behavior—why should multiple extremely non-linear model with excess capacity consistently label out-of-distribution points in the same way? This behavior is especially surprising from the view of the hypothesis that adversarial examples finely tile space like the rational numbers among the reals, because in this view adversarial examples are common but occur only at very precise locations.

对抗性示例的一个有趣的方面是，为一个模型生成的示例经常被其他模型分类错误，即使它们有不同的架构或者是在不相交的训练集上训练的时候也是如此。此外，当这些不同的模型对一个敌对的例子进行错误的分类时，它们通常会在其类别上达成一致。基于极端非线性和过拟合的解释不能很好地解释这种行为，为什么多个具有过剩产能的极端非线性模型要以同样的方式始终标记出分布外的点呢?这种行为是特别令人惊讶的，因为在这种观点中，敌对的例子是常见的，但只发生在非常精确的位置。

Under the linear view, adversarial examples occur in broad subspaces. The direction $\eta$ need only have positive dot product with the gradient of the cost function, and $\epsilon$ need only be large enough. Fig. 4 demonstrates this phenomenon. By tracing out different values of we see that adversarial examples occur in contiguous regions of the 1-D subspace defined by the fast gradient sign method, not in fine pockets. This explains why adversarial examples are abundant and why an example misclassified by one classifier has a fairly high prior probability of being misclassified by another classifier.

在线性的观点下，敌对的例子出现在广阔的子空间中。方向 $\eta$ 只需要与代价函数的梯度有正的点积，而 $\epsilon$ 只需要足够大。图4展示了这种现象。通过追踪 $\epsilon$ 的不同值，我们发现在快速梯度符号法所定义的一维子空间的相邻区域中出现了敌对的例子，而不是在细口袋中。这就解释了为什么敌对的例子会大量存在，为什么一个分类器误分类的例子被另一个分类器误分类的先验概率会很高。

To explain why mutiple classifiers assign the same class to adversarial examples, we hypothesize that neural networks trained with current methodologies all resemble the linear classifier learned on the same training set. This reference classifier is able to learn approximately the same classification weights when trained on different subsets of the training set, simply because machine learning algorithms are able to generalize. The stability of the underlying classification weights in turn results in the stability of adversarial examples.

解释为什么多种分类器分配相同的阶级对抗的例子,我们假设与当前神经网络训练方法类似于线性分类器在同一训练集学习。这个引用分类器能够学习大约相同的分类权重训练时不同的训练集的子集,因为机器学习算法可以推广。基础分类权值的稳定性反过来也会导致对抗性实例的稳定性。

To test this hypothesis, we generated adversarial examples on a deep maxout network and classified these examples using a shallow softmax network and a shallow RBF network. On examples that were misclassified by the maxout network, the RBF network predicted the maxout network’s class assignment only 16.0% of the time, while the softmax classifier predict the maxout network’s class correctly 54.6% of the time. These numbers are largely driven by the differing error rate of the different models though. If we exclude our attention to cases where both models being compared make a mistake, then softmax regression predict’s maxout’s class 84.6% of the time, while the RBF network is able to predict maxout’s class only 54.3% of the time. For comparison, the RBF network can predict softmax regression’s class 53.6% of the time, so it does have a strong linear component to its own behavior. Our hypothesis does not explain all of the maxout network’s mistakes or all of the mistakes that generalize across models, but clearly a significant proportion of them are consistent with linear behavior being a major cause of cross-model generalization.

为了验证这一假设，我们在一个深maxout网络上生成了敌对的例子，并使用一个浅softmax网络和一个浅RBF网络对这些例子进行分类。在maxout网络误分类的例子中，RBF分类器对maxout网络的类分配的预测准确率为16.0%，而softmax分类器对maxout网络的类分配的预测准确率为54.6%。不过，这些数字很大程度上是由不同模型的错误率不同造成的。如果我们不考虑两个模型都出现错误的情况，那么softmax回归预测maxout s类的概率为84.6%，而RBF网络预测maxout s类的概率为54.3%。作为比较，RBF网络可以预测softmax回归的53.6%的时间，所以它对自己的行为有很强的线性成分。我们的假设并不能解释maxout网络的所有错误或所有跨模型泛化的错误，但很明显，其中很大一部分是与线性行为一致的，这是跨模型泛化的主要原因。

9. ALTERNATIVE HYPOTHESES

9. 备择假设

We now consider and refute some alternative hypotheses for the existence of adversarial examples. First, one hypothesis is that generative training could provide more constraint on the training process, or cause the model to learn what to distinguish “real” from “fake” data and be confident only on “real” data. The MP-DBM (Goodfellow et al., 2013a) provides a good model to test this hypothesis. Its inference procedure gets good classification accuracy (an 0.88% error rate) on MNIST. This inference procedure is differentiable. Other generative models either have non-differentiable inference procedures, making it harder to compute adversarial examples, or require an additional non-generative discriminator model to get good classification accuracy on MNIST. In the case of the MP-DBM, we can be sure that the generative model itself is responding to adversarial examples, rather than the non-generative classifier model on top. We find that the model is vulnerable to adversarial examples. With an of 0.25, we find an error rate of 97.5% on adversarial examples generated from the MNIST test set. It remains possible that some other form of generative training could confer resistance, but clearly the mere fact of being generative is not alone sufficient.

我们现在考虑并驳斥一些关于存在对抗性例子的备选假设。首先，一种假设是生成式训练可以在训练过程中提供更多的约束，或者使模型学会如何区分真实数据和虚假数据，并且只对真实数据有信心。MP-DBM (Goodfellow et al.， 2013a)为检验这一假设提供了一个良好的模型。其推理过程在MNIST上取得了较好的分类准确率(错误率0.88%)。这个推理过程是可微的。其他生成模型要么具有不可区分的推理程序，使得计算对抗性实例更加困难，要么需要额外的非生成判别器模型来获得MNIST上良好的分类精度。在MP-DBM的情况下，我们可以肯定生成模型本身是对对抗例子的回应，而不是上面的非生成分类器模型。我们发现，该模型容易受到反例的攻击。在0.25的情况下，我们发现从MNIST测试集生成的对抗性示例的错误率为97.5%。一些其他形式的生成训练仍有可能带来阻力，但显然，仅仅生成这一事实并不是充分的。

Another hypothesis about why adversarial examples exist is that individual models have strange quirks but averaging over many models can cause adversarial examples to wash out. To test this hypothesis, we trained an ensemble of twelve maxout networks on MNIST. Each network was trained using a different seed for the random number generator used to initialize the weights, generate dropout masks, and select minibatches of data for stochastic gradient descent. The ensemble gets an error rate of 91.1% on adversarial examples designed to perturb the entire ensemble with $\epsilon=.25$ . If we instead use adversarial examples designed to perturb only one member of the ensemble, the error rate falls to 87.9%. Ensembling provides only limited resistance to adversarial perturbation.

另一个关于为什么会有敌对的例子存在的假设是，单个的模型有奇怪的怪癖，但是平均在许多模型上会导致敌对的例子消失。为了验证这个假设，我们在MNIST上训练了一个包含12个maxout网络的集合。每个网络使用不同的种子进行训练，用于初始化权值、生成dropout掩码和选择小批数据进行随机梯度下降的随机数生成器。对于设计来干扰整个集成的反例，该集成的错误率为91.1%。如果我们使用相反的例子来设计干扰系统中的一个成员，错误率下降到87.9%。集成对对抗性扰动只提供有限的抵抗。

10. SUMMARY AND DISCUSSION

10. 对课文进行评述

As a summary, this paper has made the following observations:

综上所述，本文提出了以下几点看法:

Adversarial examples can be explained as a property of high-dimensional dot products. They are a result of models being too linear, rather than too nonlinear.
- 相反的例子可以解释为高维点积的一个性质。它们是模型线性而不是非线性的结果。
The generalization of adversarial examples across different models can be explained as a result of adversarial perturbations being highly aligned with the weight vectors of a model, and different models learning similar functions when trained to perform the same task.
- 在不同模型中对抗性例子的一般化可以解释为对抗性扰动与模型的权向量高度一致的结果，以及不同的模型在训练执行相同任务时学习相似的函数。
The direction of perturbation, rather than the specific point in space, matters most. Space is not full of pockets of adversarial examples that finely tile the reals like the rational numbers.
- 扰动的方向，而不是空间中的特定点，是最重要的。空间中并不是充满了像有理数那样精细地平铺实数的对抗式示例。
Because it is the direction that matters most, adversarial perturbations generalize across different clean examples.
- 因为方向是最重要的，所以对抗性扰动推广了不同的简洁的例子。
We have introduced a family of fast methods for generating adversarial examples.
- 我们已经介绍了一系列快速方法来生成对抗的例子。
We have demonstrated that adversarial training can result in regularization; even further regularization than dropout.
- 我们已经证明，对抗性训练可以导致正规化;甚至比辍学还要正规化。
We have run control experiments that failed to reproduce this effect with simpler but less efficient regularizers including $L^{1}$ weight decay and adding noise.
- 我们已经进行了控制实验，但使用更简单但效率更低的正则化器(包括重量衰减和添加噪声)，无法重现这种效果。
Models that are easy to optimize are easy to perturb.
- 容易优化的模型也容易被打乱。
Linear models lack the capacity to resist adversarial perturbation; only structures with a hidden layer (where the universal approximator theorem applies) should be trained to resist adversarial perturbation.
- 线性模型缺乏对抗扰动的能力;只有具有隐含层的结构(适用通用逼近器定理的结构)才应该被训练来抵抗对抗性扰动。
RBF networks are resistant to adversarial examples.
- RBF网络对敌对的例子具有抵抗力。
Models trained to model the input distribution are not resistant to adversarial examples.
- 为输入分布建模而训练的模型对敌对的例子没有抵抗力。
Ensembles are not resistant to adversarial examples.
- 合成器对敌对的例子没有抵抗力。

Some further observations concerning rubbish class examples are presented in the appendix:

有关垃圾类例子的进一步观察载于附录:

Rubbish class examples are ubiquitous and easily generated.
- 垃圾类示例无处不在，而且很容易生成。
Shallow linear models are not resistant to rubbish class examples.
- 浅层线性模型对垃圾类示例没有抵抗力。
RBF networks are resistant to rubbish class examples.
- RBF网络对垃圾类的例子有抵抗力。

Gradient-based optimization is the workhorse of modern AI. Using a network that has been designed to be sufficiently linear–whether it is a ReLU or maxout network, an LSTM, or a sigmoid network that has been carefully configured not to saturate too much– we are able to fit most problems we care about, at least on the training set. The existence of adversarial examples suggests that being able to explain the training data or even being able to correctly label the test data does not imply that our models truly understand the tasks we have asked them to perform. Instead, their linear responses are overly confident at points that do not occur in the data distribution, and these confident predictions are often highly incorrect. This work has shown we can partially correct for this problem by explicitly identifying problematic points and correcting the model at each of these points. However, one may also conclude that the model families we use are intrinsically flawed. Ease of optimization has come at the cost of models that are easily misled. This motivates the development of optimization procedures that are able to train models whose behavior is more locally stable.

基于梯度的优化是现代人工智能的核心。使用一个设计得足够线性的网络，无论是ReLU或maxout网络、LSTM网络，还是一个精心配置的不饱和的s型网络，我们能够满足我们关心的大多数问题，相反例子的存在表明，能够解释训练数据，甚至能够正确标注测试数据，并不意味着我们的模型真正理解了我们要求它们执行的任务。相反，他们的线性反应在数据分布中没有出现的点上过于自信，而这些自信的预测通常是高度不正确的。这项工作表明，我们可以通过明确地识别问题点并在每个问题点上修正模型来部分地修正这个问题。然而，我们也可以得出这样的结论:我们所使用的模范家庭在本质上是有缺陷的。优化的易用性是以模型容易被误导为代价的。这激发了优化程序的开发，这些程序能够训练那些行为在局部更稳定的模型。

ACKNOWLEDGMENTS

We would like to thank Geoffrey Hinton and Ilya Sutskever for helpful discussions. We would also like to thank Jeff Dean, Greg Corrado, and Oriol Vinyals for their feedback on drafts of this article. We would like to thank the developers of Theano(Bergstra et al., 2010; Bastien et al., 2012), Pylearn2(Goodfellow et al., 2013b), and DistBelief (Dean et al., 2012).

我们要感谢Geoffrey Hinton和Ilya Sutskever所做的有益讨论。我们还要感谢Jeff Dean、Greg Corrado和Oriol Vinyals对本文草稿的反馈。我们要感谢Theano的开发者(Bergstra et al.， 2010;Bastien等，2012)、Pylearn2(Goodfellow等，2013b)和DistBelief (Dean等，2012)。

REFERENCES

Bastien, Fred´ eric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian J., Bergeron, Arnaud, ´ Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

Bergstra, James, Breuleux, Olivier, Bastien, Fred´ eric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guil- ´ laume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.

Chalupka, K., Perona, P., and Eberhardt, F. Visual Causal Feature Learning. ArXiv e-prints, December 2014.

Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, MarcAurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.

Deng, Jia, Dong, Wei, Socher, Richard, jia Li, Li, Li, Kai, and Fei-fei, Li. Imagenet: A large-scale hierarchical image database. In In CVPR, 2009.

Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Deep sparse rectifier neural networks. In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011), April 2011.

Goodfellow, Ian J., Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. Multi-prediction deep Boltzmann machines. In Neural Information Processing Systems, December 2013a.

Goodfellow, Ian J., Warde-Farley, David, Lamblin, Pascal, Dumoulin, Vincent, Mirza, Mehdi, Pascanu, Razvan, Bergstra, James, Bastien, Fred´ eric, and Bengio, Yoshua. Pylearn2: a machine learning research library. ´ arXiv preprint arXiv:1308.4214, 2013b.

Goodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. Maxout networks. In Dasgupta, Sanjoy and McAllester, David (eds.), International Conference on Machine Learning, pp. 1319–1327, 2013c.

Gu, Shixiang and Rigazio, Luca. Towards deep neural network architectures robust to adversarial examples. In NIPS Workshop on Deep Learning and Representation Learning, 2014.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

Hornik, Kurt, Stinchcombe, Maxwell, and White, Halbert. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366, 1989.

Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun, Yann. What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pp. 2146–2153. IEEE, 2009.

Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

Nguyen, A., Yosinski, J., and Clune, J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. ArXiv e-prints, December 2014.

Rust, Nicole, Schwartz, Odelia, Movshon, J. Anthony, and Simoncelli, Eero. Spatiotemporal elements of macaque V1 receptive fields. Neuron, 46(6):945–956, 2005.

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15 (1):1929–1958, 2014.

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. Technical report, arXiv preprint arXiv:1409.4842, 2014a.

Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian J., and Fergus, Rob. Intriguing properties of neural networks. ICLR, abs/1312.6199, 2014b. URL http: //arxiv.org/abs/1312.6199.

A RUBBISH CLASS EXAMPLES

一个垃圾类示例

A concept related to adversarial examples is the concept of examples drawn from a “rubbish class.” These examples are degenerate inputs that a human would classify as not belonging to any of the categories in the training set. If we call these classes in the training set “the positive classes,” then we want to be careful to avoid false positives on rubbish inputs–i.e., we do not want to classify a degenerate input as being something real. In the case of separate binary classifiers for each class, we want all classes output near zero probability of the class being present, and in the case of a multinoulli distribution over only the positive classes, we would prefer that the classifier output a high-entropy (nearly uniform) distribution over the classes. The traditional approach to reducing vulnerability to rubbish inputs is to introduce an extra, constant output to the model representing the rubbish class (?). Nguyen et al. (2014) recently re-popularized the concept of the rubbish class in the context of computer vision under the name fooling images. As with adversarial examples, there has been a misconception that rubbish class false positives are hard to find, and that they are primarily a problem faced by deep networks.

与对抗示例相关的一个概念是从垃圾类中抽取示例的概念。这些例子是简并输入一个人类将分类不属于任何类别的训练集,如果我们把这些类在积极的训练集类,然后我们要小心避免假阳性垃圾输入即,我们不想把简并输入是真实的东西。在单独的二元分类器为每个类的情况下,我们希望所有类输出类的概率接近于零,在multinoulli分布的情况下,只有积极的类,我们宁愿分类器输出一个熵值(均匀)分布在类。减少垃圾输入脆弱性的传统方法是在表示垃圾类(?)的模型中引入一个额外的、恒定的输出。Nguyen等人(2014)最近在计算机视觉的语境下重新普及了垃圾类的概念，其名称为愚弄图像。与敌对的例子一样，有一种误解，认为垃圾类假阳性很难找到，而且它们主要是深层网络面临的问题。

Our explanation of adversarial examples as the result of linearity and high dimensional spaces also applies to analyzing the behavior of the model on rubbish class examples. Linear models produce more extreme predictions at points that are far from the training data than at points that are near the training data. In order to find high confidence rubbish false positives for such a model, we need only generate a point that is far from the data, with larger norms yielding more confidence. RBF networks, which are not able to confidently predict the presence of any class far from the training data, are not fooled by this phenomenon.

我们对线性和高维空间的对抗性实例的解释也适用于分析垃圾类实例的模型行为。线性模型在离训练数据较远的点比在离训练数据较近的点产生更极端的预测。为了找到这样一个模型的高置信垃圾假阳性，我们只需要生成一个远离数据的点，更大的规范产生更多的置信。RBF网络无法自信地预测远离训练数据的任何类的存在，因此不会被这种现象所迷惑。

We generated 10,000 samples from N (0, I784) and fed them into various classifiers on the MNIST dataset. In this context, we consider assigning a probability greater than 0.5 to any class to be an error. A naively trained maxout network with a softmax layer on top had an error rate of 98.35% on Gaussian rubbish examples with an average confidence of 92.8% on mistakes. Changing the top layer to independent sigmoids dropped the error rate to 68% with an average confidence on mistakes of 87.9%. On CIFAR-10, using 1,000 samples from N (0, I3072), a convolutional maxout net obtains an error rate of 93.4%, with an average confidence of 84.4%.

我们从N (0, I784)中生成10,000个样本，并将它们放入MNIST数据集上的各种分类器中。在这种情况下，我们考虑分配一个概率大于0.5的任何类是一个错误。一个简单训练的maxout网络，上面有一个softmax层，对高斯垃圾样本的错误率为98.35%，对错误的平均置信度为92.8%。将顶层改为独立的乙状元，错误率降至68%，平均错误率为87.9%。在CIFAR-10上，使用来自N (0, I3072)的1000个样本，卷积maxout网络的错误率为93.4%，平均置信度为84.4%。

These experiments suggest that the optimization algorithms employed by Nguyen et al. (2014) are overkill (or perhaps only needed on ImageNet), and that the rich geometric structure in their fooling images are due to the priors encoded in their search procedures, rather than those structures being uniquely able to cause false positives.

这些实验表明优化算法受雇于Nguyen et al。(2014)过度(或也许只需要ImageNet),和丰富的几何结构愚弄由于先验图像编码在搜索过程中,而不是这些结构独特的能够引起假阳性。

Though Nguyen et al. (2014) focused their attention on deep networks, shallow linear models have the same problem. A softmax regression model has an error rate of 59.8% on the rubbish examples, with an average confidence on mistakes of 70.8%. If we use instead an RBF network, which does not behave like a linear function, we find an error rate of 0%. Note that when the error rate is zero the average confidence on a mistake is undefined.

虽然Nguyen等人(2014)将注意力集中在深度网络上，但浅线性模型也有同样的问题。softmax回归模型对垃圾样本的错误率为59.8%，平均置信率为70.8%。如果我们使用不像线性函数那样表现的RBF网络，我们会发现错误率为0%。请注意，当错误率为零时，错误的平均置信度是没有定义的。

Nguyen et al. (2014) focused on the problem of generating fooling images for a specific class, which is a harder problem than simply finding points that the network confidently classifies as belonging to any one class despite being defective. The above methods on MNIST and CIFAR-10 tend to have a very skewed distribution over classes. On MNIST, 45.3% of a naively trained maxout network’s false positives were classified as 5s, and none were classified as 8s. Likewise, on CIFAR-10, 49.7% of the convolutional network’s false positives were classified as frogs, and none were classified as airplanes, automobiles, horses, ships, or trucks.

Nguyen等人(2014)专注于为特定类生成欺骗图像的问题，这比简单地找到网络自信地分类为属于任何一个类(尽管存在缺陷)的点要困难得多。上面MNIST和CIFAR-10上的方法在类上倾向于有一个非常偏态的分布。在MNIST上，一个天真训练的maxout网络的误报率有45.3%被划分为5s，没有一个被划分为8s。同样地，在CIFAR-10上，卷积网络中49.7%的误报被归类为青蛙，没有一个被归类为飞机、汽车、马、船只或卡车。

To solve the problem introduced by Nguyen et al. (2014) of generating a fooling image for a particular class, we propose adding $\epsilon\triangledown_{x}p(y=i|x)$ to a Gaussian sample x as a fast method of generating a fooling image classified as class i. If we repeat this sampling process until it succeeds, we a randomized algorithm with variable runtime. On CIFAR-10, we found that one sampling step had a 100% success rate for frogs and trucks, and the hardest class was airplanes, with a success rate of 24.7% per sampling step. Averaged over all ten classes, the method has an average per-step success rate of 75.3%. We can thus generate any desired class with a handful of samples and no special priors, rather than tens of thousands of generations of evolution. To confirm that the resulting examples are indeed fooling images, and not images of real classes rendered by the gradient sign method, see Fig. 5. The success rate of this method in terms of generating members of class i may degrade for datasets with more classes, since the risk of inadvertently increasing the activation of a different class j increases in that case. We found that we were able to train a maxout network to have a zero percent error rate on Gaussian rubbish examples (it was still vulnerable to rubbish examples generated by applying a fast gradient sign step to a Gaussian sample) with no negative impact on its ability to classify clean examples. Unfortunately, unlike training on adversarial examples, this did not result in any significant reduction of the model’s test set error rate.

为了解决Nguyen等人(2014)提出的为特定类生成欺骗图像的问题，我们建议添加 $\epsilon\triangledown_{x}p(y=i|x)$ 如果我们重复这个采样过程直到它成功，我们就得到了一个具有可变运行时间的随机算法。在CIFAR-10中，我们发现青蛙和卡车的一个抽样步骤有100%的成功率，最难的类别是飞机，每个抽样步骤的成功率为24.7%。对所有10个类进行平均，该方法每步的平均成功率为75.3%。因此，我们可以用少量的样本和没有特殊的先验来生成任何想要的类，而不是经过成千上万代的进化。为了证实所得到的示例确实是欺骗图像，而不是由梯度符号方法渲染的真实类的图像，见图5。对于有更多类的数据集，这种方法在生成类i成员方面的成功率可能会降低，因为在这种情况下无意中增加不同类j的激活的风险会增加。我们发现,我们可以训练maxout网络有百分之十的错误率高斯垃圾例子(它还容易受到垃圾生成的示例应用快速高斯梯度信号一步样本)对其分类能力没有负面影响干净的例子。不幸的是，与对抗性例子的训练不同，这并没有导致模型s测试集错误率的任何显著降低。

In conclusion, it appears that a randomly selected input to deep or shallow models built from linear parts is overwhelmingly likely to be processed incorrectly, and that these models only behave reasonably on a very thin manifold encompassing the training data.

总之，对于由线性部分构建的深浅层模型，随机选择的输入极有可能被错误处理，而且这些模型只在包含训练数据的非常细的流形上表现合理。

Previous深度学习模式的对抗攻击 Next对神经网络鲁棒性的评估

Last updated 4 years ago

Was this helpful?