神经网络的有趣特性

Intriguing properties of neural networks

原文连接:

https://arxiv.org/pdf/1312.6199.pdf

GB/T 7714 Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks[J]. arXiv preprint arXiv:1312.6199, 2013.

MLA Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013).

APA Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.

Abstract

Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties.

深度神经网络是近年来在语音和视觉识别任务中取得最新性能的高度表达模型。虽然它们的表现力是它们成功的原因，但它也会使它们学习不可理解的解决方案，这些解决方案可能具有反直觉的特性。在本文中，我们介绍了两个这样的性质。

First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains the semantic information in the high layers of neural networks.

首先，根据不同的单元分析方法，我们发现单个高层单元和高层单元的随机线性组合之间没有区别。这表明，在神经网络的高层中，包含语义信息的是空间，而不是单个单元。

Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extent. We can cause the network to misclassify an image by applying a certain hardly perceptible perturbation, which is found by maximizing the network’s prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input. 其次，我们发现深度神经网络学习的输入-输出映射在很大程度上是不连续的。通过对网络的预测误差进行最大化，我们可以利用一定的不易察觉的扰动使网络误分类。此外，这些扰动的具体性质并不是学习到的随机伪影：相同的扰动会导致在数据集的不同子集上训练的不同网络对相同的输入进行错误分类。

1. Introduction

Deep neural networks are powerful learning models that achieve excellent performance on visual and speech recognition problems [9, 8]. Neural networks achieve high performance because they can express arbitrary computation that consists of a modest number of massively parallel nonlinear steps. But as the resulting computation is automatically discovered by backpropagation via supervised learning, it can be difficult to interpret and can have counter-intuitive properties. In this paper, we discuss two counter-intuitive properties of deep neural networks.

深度神经网络是一种强大的学习模型，在视觉和语音识别问题上具有优异的性能[9，8]。神经网络由于能够表达由少量大规模并行非线性步骤组成的任意计算而获得高性能。但是，由于计算结果是通过有监督学习的反向传播自动发现的，因此很难解释，并且可能具有反直观的特性。本文讨论了深度神经网络的两个反直觉性质。

The first property is concerned with the semantic meaning of individual units. Previous works [6, 13, 7] analyzed the semantic meaning of various units by finding the set of inputs that maximally activate a given unit. The inspection of individual units makes the implicit assumption that the units of the last feature layer form a distinguished basis which is particularly useful for extracting semantic information. Instead, we show in section 3 that random projections of $\phi(x)$ are semantically indistinguishable from the coordinates of $\phi(x)$ . This puts into question the conjecture that neural networks disentangle variation factors across coordinates. Generally, it seems that it is the entire space of activations, rather than the individual units, that contains the bulk of the semantic information. A similar, but even stronger conclusion was reached recently by Mikolov et al. [12] for word representations, where the various directions in the vector space representing the words are shown to give rise to a surprisingly rich semantic encoding of relations and analogies. At the same time,the vector representations are stable up to a rotation of the space, so the individual units of the vector representations are unlikely to contain semantic information.

第一个属性涉及单个单元的语义。以前的研究[6，13，7]通过找到最大限度地激活一个给定单元的输入集来分析各个单元的语义。对单个单元的检查使得隐式假设最后一个特征层的单元形成一个区分的基础，这对于提取语义信息特别有用。相反，我们在第3节中指出， $\phi(x)$ 的随机投影在语义上与 $\phi(x)$ 的坐标不可区分。这就对神经网络将坐标上的变化因素分离的推测提出了质疑。一般来说，包含大量语义信息的似乎是整个激活空间，而不是单个单元。对于词的表示，Mikolov等人[12] 最近得出了一个类似但更有力的结论，在表示词的向量空间中的各个方向被显示为产生令人惊讶且丰富的关系与类比的语义编码。同时，向量表示在空间旋转时是稳定的，因此向量表示的各个单元不太可能包含语义信息。

The second property is concerned with the stability of neural networks with respect to small perturbations to their inputs. Consider a state-of-the-art deep neural network that generalizes well on an object recognition task. We expect such network to be robust to small perturbations of its input, because small perturbation cannot change the object category of an image. However, we find that applying an imperceptible non-random perturbation to a test image, it is possible to arbitrarily change the network’s prediction (see figure 5). These perturbations are found by optimizing the input to maximize the prediction error. We term the so perturbed examples “adversarial examples”.

第二个性质与神经网络对输入的小扰动的稳定性有关。考虑一个最先进的深度神经网络，它在目标识别任务中具有很好的泛化能力。我们期望这样的网络对其输入的小扰动具有鲁棒性，因为小扰动不能改变图像的对象类别。然而，我们发现对测试图像应用不可察觉的非随机扰动，可以任意改变网络的预测（见图5）。这些扰动是通过优化输入以使预测误差最大化来发现的。我们把这些令人不安的样本称为“对抗样本”。

It is natural to expect that the precise configuration of the minimal necessary perturbations is a random artifact of the normal variability that arises in different runs of backpropagation learning. Yet, we found that adversarial examples are relatively robust, and are shared by neural networks with varied number of layers, activations or trained on different subsets of the training data. That is, if we use one neural net to generate a set of adversarial examples, we find that these examples are still statistically hard for another neural network even when it was trained with different hyperparameters or, most surprisingly, when it was trained on a different set of examples.

很自然地，期望最小必要扰动的精确配置是在不同的反向传播学习过程中产生的正常变化的随机伪影。然而，我们发现对抗样本是相对健壮的，并且由具有不同层数、激活或训练在不同训练数据子集上的神经网络共享。也就是说，如果我们使用一个神经网络来生成一组对抗样本，我们发现对于另一个神经网络来说，即使它是用不同的超参数训练的，或者，最令人惊讶的是，当它是用不同的样本集训练时，这些样本在统计学上仍然是困难的。

These results suggest that the deep neural networks that are learned by backpropagation have nonintuitive characteristics and intrinsic blind spots, whose structure is connected to the data distribution in a non-obvious way.

这些结果表明，反向传播学习的深度神经网络具有非直观的特性和内在的盲点，其结构与数据分布的关系并不明显。

2. Framework

Notation We denote by $x\in R^{m}$ an input image, and $\phi(x)$ activation values of some layer. We first examine properties of the image of $\phi(x)$ , and then we search for its blind spots.

We perform a number of experiments on a few different networks and three datasets :

For the MNIST dataset, we used the following architectures [11]
- – A simple fully connected network with one or more hidden layers and a Softmax classifier. We refer to this network as “FC”.
- – A classifier trained on top of an autoencoder. We refer to this network as “AE”.
The ImageNet dataset [3].
- – Krizhevsky et. al architecture [9]. We refer to it as “AlexNet”.
∼ 10M image samples from Youtube (see [10])
- – Unsupervised trained network with ∼ 1 billion learnable parameters. We refer to it as “QuocNet”.

For the MNIST experiments, we use regularization with a weight decay of λ. Moreover, in some experiments we split the MNIST training dataset into two disjoint datasets P1, and P2, each with 30000 training cases.

符号说明：我们用 $x\in R^{m}$ 表示输入图像，以及某一层的 $\phi(x)$ 激活值。我们首先研究了 $\phi(x)$ 的图像的性质，然后寻找其盲点。

我们在几个不同的网络和三个数据集上进行了许多实验：

对于MNIST数据集，我们使用了以下架构[11]
- 具有一个或多个隐含层和Softmax分类器的简单全连接网络。我们称这个网络为“FC”。
- 在自动编码器上训练的分类器。我们称这个网络为“AE”。
ImageNet数据集[3]
- Krizhevsky等人的结构[9]。我们称之为“AlexNet”。
来自Youtube的近1000万张图片样本（见[10]）
- 具有10亿可学习参数的无监督训练网络。我们称之为“QuocNet”。

对于MNIST实验，我们使用权重衰减为λ的正则化。此外，在一些实验中，我们将MNIST训练集分成两个不相交的数据集P1和P2，每个数据集有30000个训练案例。

3. Units of: $\phi(x)$

Traditional computer vision systems rely on feature extraction: often a single feature is easily interpretable, e.g. a histogram of colors, or quantized local derivatives. This allows one to inspect the individual coordinates of the feature space, and link them back to meaningful variations in the input domain. Similar reasoning was used in previous work that attempted to analyze neural networks that were applied to computer vision problems. These works interpret an activation of a hidden unit as a meaningful feature. They look for input images which maximize the activation value of this single feature [6, 13, 7, 4].

传统的计算机视觉系统依赖于特征提取：通常一个特征很容易被解释，例如颜色直方图或量化的局部导数。这允许用户检查特征空间的各个坐标，并将它们链接回输入域中有意义的变化。类似的推理在先前的工作中被用来分析应用于计算机视觉问题的神经网络。这些作品将一个隐藏单元的激活解释为一个有意义的特征。他们寻找输入图像，使这个单一特征的激活值最大化[6，13，7，4]。

The aforementioned technique can be formally stated as visual inspection of images $x\prime$ , which satisfy (or are close to maximum attainable value):

上述技术可以正式表述为图像 $x\prime$ 的视觉检查，其满足（或接近最大可达到值）：

where I is a held-out set of images from the data distribution that the network was not trained on and $e_{i}$ is the natural basis vector associated with the $i-th$ hidden unit.

其中，Ι 是来自数据分布的固定图像集，该数据分布是网络未训练的， $e_{i}$ 是与第i个隐藏单元相关联的自然基向量。

Our experiments show that any random direction $v\in R^{n}$ gives rise to similarly interpretable semantic properties. More formally, we find that images $x\prime$ are semantically related to each other, for many $x\prime$ such that

我们的实验表明，任何随机方向 $v\in R^{n}$ 都会产生类似的可解释性质。更正式地说，我们发现图像 $x\prime$ 在语义上相互关联，对于许多 $x\prime$ 来说

This suggests that the natural basis is not better than a random basis for inspecting the properties of $\phi(x)$ . This puts into question the notion that neural networks disentangle variation factors across coordinates.

这说明用自然基检验 $\phi(x)$ 的性质并不比用随机基好。这就对神经网络将坐标上的变化因素分离的观点提出了质疑。(注：<,>表示向量内积)

First, we evaluated the above claim using a convolutional neural network trained on MNIST. We used the MNIST test set for I. Figure 1 shows images that maximize the activations in the natural basis, and Figure 2 shows images that maximize the activation in random directions. In both cases the resulting images share many high-level similarities.

首先，我们使用在MNIST上训练的卷积神经网络来评估上述声明。我们使用了 I 的MNIST测试集。图1显示了在自然基上最大化激活值的图像，图2显示了在随机方向上最大化激活值的图像。在这两种情况下，得到的图像具有许多高层次的相似性。(注：Fig. 1、2中的stroke指的是不同的笔画)

Next, we repeated our experiment on an AlexNet, where we used the validation set as I. Figures 3 and 4 compare the natural basis to the random basis on the trained network. The rows appear to be semantically meaningful for both the single unit and the combination of units.

接下来，我们在AlexNet上重复了我们的实验，我们使用验证集作为 I。图3和图4比较了训练网络上的自然基和随机基。这些行对单个单元和单元组合似乎都有语义意义。

Although such analysis gives insight on the capacity of $\phi$ to generate invariance on a particular subset of the input distribution, it does not explain the behavior on the rest of its domain. We shall see in the next section that $\phi$ has counterintuitive properties in the neighbourhood of almost every point form data distribution.

尽管这种分析揭示了 $\phi$ 在输入分布的特定子集上产生不变性的能力，但它并不能解释其在域的其余部分的行为。我们将在下一节中看到，在几乎每个点的邻域中， $\phi$ 都具有反直观的性质。

4. Blind Spots in Neural Networks

So far, unit-level inspection methods had relatively little utility beyond confirming certain intuitions regarding the complexity of the representations learned by a deep neural network [6, 13, 7, 4]. Global, network level inspection methods can be useful in the context of explaining classification decisions made by a model [1] and can be used to, for instance, identify the parts of the input which led to a correct classification of a given visual input instance (in other words, one can use a trained model for weakly-supervised localization). Such global analyses are useful in that they can make us understand better the input-to-output mapping represented by the trained network.

到目前为止，单元级的检查方法除了证实关于由深度神经网络学习的表示的复杂性的某些直觉之外，其效用相对较小[6，13，7，4]。全局的、网络级的检查方法在解释由模型[1]做出的分类决策的上下文中是有用的，并且可以用于例如识别导致给定视觉输入实例正确分类的输入部分（换句话说，可以使用经过训练的模型进行弱监督定位）。这种全局分析是有用的，因为它们可以使我们更好地理解由训练的网络表示的输入-输出映射。

Generally speaking, the output layer unit of a neural network is a highly nonlinear function of its input. When it is trained with the cross-entropy loss (using the Softmax activation function), it represents a conditional distribution of the label given the input (and the training set presented so far). It has been argued [2] that the deep stack of non-linear layers in between the input and the output unit of a neural network are a way for the model to encode a non-local generalization prior over the input space. In other words, it is assumed that is possible for the output unit to assign nonsignificant (and, presumably, non-epsilon) probabilities to regions of the input space that contain no training examples in their vicinity. Such regions can represent, for instance, the same objects from different viewpoints, which are relatively far (in pixel space), but which share nonetheless both the label and the statistical structure of the original inputs.

一般来说，神经网络的输出层单元是其输入的高度非线性函数。当使用交叉熵损失（使用Softmax激活函数）进行训练时，它表示给定输入的标签的条件分布（以及迄今为止呈现的训练集）。有人认为[2]在神经网络的输入和输出单元之间的非线性层的深度堆栈是模型在输入空间上对非局部泛化进行编码的一种方式。换言之，假设输出单元可以将非显著（并且，假设为非epsilon）概率分配给输入空间中在其附近不包含训练样本的区域。例如，这些区域可以表示来自不同视点的相同对象，这些对象相对较远（在像素空间中），但是它们共享原始输入的标签和统计结构。

It is implicit in such arguments that $local$ generalization—in the very proximity of the training examples—works as expected. And that in particular, for a small enough radius $\varepsilon >0$ in the vicinity of a given training input $x$ , an $x + r$ satisfying $\lVert r \rVert < \varepsilon$ will get assigned a high probability of the correct class by the model. This kind of smoothness prior is typically valid for computer vision problems. In general, imperceptibly tiny perturbations of a given image do not normally change the underlying class.

在这样的论点中，隐含着局部泛化——训练样本的极近邻——如预期的那样工作。特别是，对于给定训练输入x附近的足够小的半径 $\varepsilon >0$ ，满足 $\lVert r \rVert < \varepsilon$ 的 $x + r$ 将被该模型赋予正确类的高概率。这种平滑先验对于计算机视觉问题是典型有效的。一般来说，给定图像的细微扰动通常不会改变底层类。

Our main result is that for deep neural networks, the smoothness assumption that underlies many kernel methods does not hold. Specifically, we show that by using a simple optimization procedure, we are able to find adversarial examples, which are obtained by imperceptibly small perturbations to a correctly classified input image, so that it is no longer classified correctly.

我们的主要结果是，对于深度神经网络，许多核方法所依据的光滑性假设不成立。具体地说，通过使用一个简单的优化过程，我们能够找到对抗样本，这些样本是通过对正确分类的输入图像进行不可察觉的小扰动而获得的，因此它不再被正确分类。

In some sense, what we describe is a way to traverse the manifold represented by the network in an efficient way (by optimization) and finding adversarial examples in the input space. The adversarial examples represent low-probability (high-dimensional) “pockets” in the manifold, which are hard to efficiently find by simply randomly sampling the input around a given example. Already, a variety of recent state of the art computer vision models employ input deformations during training for increasing the robustness and convergence speed of the models [9, 13]. These deformations are, however, statistically inefficient, for a given example: they are highly correlated and are drawn from the same distribution throughout the entire training of the model. We propose a scheme to make this process adaptive in a way that exploits the model and its deficiencies in modeling the local space around the training data.

从某种意义上讲，我们所描述的是一种以有效的方式（通过优化）遍历由网络表示的流形并在输入空间中找到对抗样本的方法。对抗样本代表流形中的低概率（高维）“pockets”，仅通过随机抽样给定样本周围的输入很难有效地找到。目前，许多最新的计算机视觉模型在训练过程中采用输入变形来提高模型的鲁棒性和收敛速度[9，13]。然而，对于给定的样本，这些变形在统计上是样本无效的：它们高度相关，并且在整个模型训练过程中从相同的分布中提取。我们提出了一个方案，自适应地利用模型和它的不足对训练数据周围的局部空间进行建模。

We make the connection with hard-negative mining explicitly, as it is close in spirit: hard-negative mining, in computer vision, consists of identifying training set examples (or portions thereof) which are given low probabilities by the model, but which should be high probability instead, cf. [5]. The training set distribution is then changed to emphasize such hard negatives and a further round of model training is performed. As shall be described, the optimization problem proposed in this work can also be used in a constructive way, similar to the hard-negative mining principle.

我们明确地与困难-负样本挖掘联系在一起，因为它在内核上是紧密的：在计算机视觉中，困难-负样本挖掘包括识别训练集实例（或其中的一部分），这些实例由模型给出低概率，但应该是高概率的，参见[5]。然后改变训练集的分布以强调这些困难负样本，并进行下一轮的模型训练。如前所述，本工作中提出的优化问题也可以作为类似于困难-负样本挖掘原理的建设性方式使用。

4.1. Formal description

We denote by $f : R^{m}\to \{1 . . . k\}$ a classifier mapping image pixel value vectors to a discrete label set. We also assume that f has an associated continuous loss function denoted by loss $f : R^{m}\times\{1 . . . k\}\to R^{+}$ . For a given $x \in R^{m}$ image and target label $l \in \{ 1...k\}$ , we aim to solve the following box-constrained optimization problem:

Minimize $\lVert r \rVert_{2}$ subject to:

$f(x+r)=l$
$x + r \in [0, 1]^{m}$

The minimizer $r$ might not be unique, but we denote one such $x + r$ for an arbitrarily chosen minimizer by $D(x, l)$ . Informally, $x + r$ is the closest image to $x$ classified as $l ~by~ f$ . Obviously, $D(x, f(x)) = f(x)$ , so this task is non-trivial only if $f(x)\neq l$ . In general, the exact computation of $D(x, l)$ is a hard problem, so we approximate it by using a box-constrained L-BFGS. Concretely, we find an approximation of $D(x, l)$ by performing line-search to find the minimum c > 0 for which the minimizer r of the following problem satisfies $f(x + r) = l$ .

Minimize $c\lvert r\rvert+loss_{f}(x+r,l)$ subject to $x + r \in [0, 1]^{m}$

This penalty function method would yield the exact solution for $D(X, l)$ in the case of convex losses, however neural networks are non-convex in general, so we end up with an approximation in this case.

我们用 $f : R^{m}\to \{1 . . . k\}$ 表示一种将图像像素值向量映射到离散标签集的分类器。我们还假设f有一个关联的连续损失函数，用loss $f : R^{m}\times\{1 . . . k\}\to R^{+}$ 。对于给定的 $x \in R^{m}$ 图像和目标标签 $l \in \{ 1...k\}$ ，我们的目标是解决以下边界约束优化问题：

Minimize $\lVert r \rVert_{2}$ subject to:

$f(x+r)=l$
$x + r \in [0, 1]^{m}$

极小元 $r$ 可能不是唯一的，但是我们用 $D(x,l)$ 来表示任意选择的极小元得到的 $x+r$ 。非正式地说， $x+r$ 是最接近 $x$ 的图像，被 $f$ 分类为 $l$ 。显然， $D(x,f(x))=f(x)$ ，所以只有当 $f(x)\neq l$ 时，这个任务才是非平凡的。一般来说， $D(x,l)$ 的精确计算是一个困难的问题，所以我们用一个边界约束的L-BFGS来近似它。具体地说，我们通过执行线搜索找到 $D(x,l)$ 的一个近似值，以找到下列问题的最小元 $r$ 满足 $f(x+r)=l$ 的最小值 $c>0$ 。

Minimize $c\lvert r\rvert+loss_{f}(x+r,l)$ subject to $x + r \in [0, 1]^{m}$

这种罚函数法在有凸损失的情况下可以得到D(X,l)的精确解，但神经网络一般是非凸的，因此在这种情况下我们得到一个近似解。

4.2. Experimental results

Our “minimimum distortion” function $D$ has the following intriguing properties which we will support by informal evidence and quantitative experiments in this section:

我们的“最小失真”函数D具有以下有趣的特性，我们将通过本节中的非正式证据和定量实验予以支持：

For all the networks we studied (MNIST, QuocNet [10], AlexNet [9]), for each sample, we have always managed to generate very close, visually hard to distinguish, adversarial examples that are misclassified by the original network (see figure 5 and http://goo.gl/huaGPb for examples).
Cross model generalization: a relatively large fraction of examples will be misclassified by networks trained from scratch with different hyper-parameters (number of layers, regularization or initial weights).
Cross training-set generalization a relatively large fraction of examples will be misclassified by networks trained from scratch on a disjoint training set.

对于我们研究的所有网络（MNIST、QuocNet[10]、AlexNet[9]），对于每个样本，我们总是设法生成非常接近的、视觉上难以区分的、由原始网络错误分类的对抗样本（样本参见图5和http://goo.gl/huaGPb）。
跨模型泛化：使用不同超参数（层数、规则化或初始权值）从头训练的网络，会对相当大一部分样本进行错误分类。
交叉训练集泛化：在一个不相交的训练集上，相对较大比例的样本将被从零开始训练的网络误分类。

The above observations suggest that adversarial examples are somewhat universal and not just the results of overfitting to a particular model or to the specific selection of the training set. They also suggest that back-feeding adversarial examples to training might improve generalization of the resulting models. Our preliminary experiments have yielded positive evidence on MNIST to support this hypothesis as well: We have successfully trained a two layer 100-100-10 non-convolutional neural network with a test error below $1.2\%$ by keeping a pool of adversarial examples a random subset of which is continuously replaced by newly generated adversarial examples and which is mixed into the original training set all the time. We used weight decay, but no dropout for this network. For comparison, a network of this size gets to $1.6\%$ errors when regularized by weight decay alone and can be improved to around $1.3\%$ by using carefully applied dropout. A subtle, but essential detail is that we only got improvements by generating adversarial examples for each layer outputs which were used to train all the layers above. The network was trained in an alternating fashion, maintaining and updating a pool of adversarial examples for each layer separately in addition to the original training set. According to our initial observations, adversarial examples for the higher layers seemed to be significantly more useful than those on the input or lower layers. In our future work, we plan to compare these effects in a systematic manner.

上述观察结果表明，对抗样本有些普遍性，而不仅仅是对特定模型或训练集的特定选择过拟合的结果。他们还建议，将对抗样本反馈到训练中，可能会提高结果模型的通用性。我们的初步实验也为MNIST提供了支持这一假设的积极证据：我们成功地训练了一个两层的100-100-10非卷积神经网络，其测试误差小于1.2%，方法是将一组随机的对抗样本不断地替换为新生成的对抗样本，它们一直被混入原始训练集。我们用了权重衰减，但这个网络没有随机失活。相比之下，这种大小的网络仅通过权重衰减就可以得到1.6%的误差，并且通过使用谨慎应用的随机失活可以增强到1.3%左右。一个微妙但重要的细节是，我们只有通过为每一层输出生成对抗样本来改进，这些输出用于训练上面的所有层。该网络以交替的方式进行训练，除了最初的训练集外，还为每一层分别维护和更新一组对抗样本。根据我们的初步观察，高层次的对抗样本似乎比输入或低层次的对抗样本更有用。在今后的工作中，我们计划系统地比较这些影响。

For space considerations, we just present results for a representative subset (see Table 1) of the MNIST experiments we performed. The results presented here are consistent with those on a larger variety of non-convolutional models. For MNIST, we do not have results for convolutional models yet, but our first qualitative experiments with AlexNet gives us reason to believe that convolutional networks may behave similarly as well. Each of our models were trained with L-BFGS until convergence. The first three models are linear classifiers that work on the pixel level with various weight decay parameters $\lambda$ . All our examples use quadratic weight decay on the connection weights: $loss_{decay} = λ\sum \omega_{i}^{2} /k$ added to the total loss, where k is the number of units in the layer. Three of our models are simple linear (softmax) classifier without hidden units $(FC10(\lambda))$ . One of them, $FC10(1)$ , is trained with extremely high $\lambda = 1$ in order to test whether it is still possible to generate adversarial examples in this extreme setting as well.Two other models are a simple sigmoidal neural network with two hidden layers and a classifier. The last model, AE400-10, consists of a single layer sparse autoencoder with sigmoid activations and 400 nodes with a Softmax classifier. This network has been trained until it got very high quality first layer filters and this layer was not fine-tuned. The last column measures the minimum average pixel level distortion necessary to reach 0% accuracy on the training set. The distortion is measure by $\sqrt{\frac{\sum(s_{i}^\prime-x_{i})^{2}}{n}}$ between the original x and distorted x 0 images, where n = 784 is the number of image pixels. The pixel intensities are scaled to be in the range [0, 1].

Table 1: Tests of the generalization of adversarial instances on MNIST.

Model Name

Description

Training error

Test error

Av. min. distortion

$FC10(10^{-4})$

Softmax with $\lambda=10^{-4}$

6.7%

7.4%

0.062

$FC10(10^{-2})$

Softmax with $\lambda=10^{-2}$

10%

9.4%

0.1

$FC10(1)$

Softmax with $\lambda=1$

21.2%

20%

0.14

$FC100-100-10$

Softmax with $\lambda=10^{-5},10^{-5},10^{-6}$

1.64%

0.058

$FC200-200-10$

Softmax with $\lambda=10^{-5},10^{-5},10^{-6}$

1.54%

0.065

$AE400-10$

Autoencoder with Softmax $\lambda=10^{-6}$

0.57%

1.9%

0.086

Table 2: Cross-model generalization of adversarial examples. The columns of the Tables show the error induced by distorted examples fed to the given model. The last column shows average distortion wrt. original training set.

$FC10(10^{-4})$

$FC10(10^{-2})$

$FC10(1)$

$FC100-100-10$

$FC200-200-10$

$AE400-10$

Av. distortion

$FC10(10^{-4})$

100%

11.7%

22.7%

3.9%

2.7%

0.062

$FC10(10^{-2})$

87.1%

100%

35.2%

35.9%

27.3%

9.8%

0.1

$FC10(1)$

71.9%

76.2%

100%

48.1%

47%

34.4%

0.14

$FC100-100-10$

28.9%

13.7%

21.1%

100%

6.6%

0.058

$FC200-200-10$

38.2%

14%

23.8%

20.3%

100%

2.7%

0.065

$AE400-10$

23.4%

16%

24.8%

9.4%

6.6%

100%

0.086

Gaussian noise, stddev=0.1

5.0%

10.1%

18.3%

0.8%

0.1

Gaussian noise, stddev=0.3

15.6%

11.3%

22.7%

4.3%

3.1%

0.3

出于空间考虑，我们只给出了我们进行的MNIST实验的代表子集（见表1）的结果。本文给出的结果与更大范围的非卷积模型的结果是一致的。对于MNIST，我们还没有卷积模型的结果，但是我们对AlexNet的第一个定性实验让我们有理由相信，卷积网络的行为也可能类似。我们的每一个模型都用L-BFGS训练直到收敛。前三个模型是线性分类器，工作在像素级，具有不同的权重衰减参数λ。我们所有的样本都对连接权重使用二次权重衰减：被添加到总损失中，其中k是层中的单元数。我们的三个模型是没有隐藏单元 $(FC10(\lambda))$ 的简单线性（softmax）分类器。其中一个模型FC10(1)是用极高的λ=1进行训练，以测试在这种极端环境下是否仍有可能产生对抗样本。另外两个模型是一个简单的具有两个隐含层和一个分类器的sigmoidal神经网络。最后一个模型AE400-10由一个带sigmoid激活的单层稀疏自动编码器和一个400节点的Softmax分类器组成。这个网络经过训练，直到它得到了非常高质量的第一层滤波器，而这一层并没有微调。最后一列测量在训练集上达到0%精度所需的最小平均像素级失真。失真是通过对原始x和失真x'图像计算 $\sqrt{\frac{\sum(s_{i}^\prime-x_{i})^{2}}{n}}$ 来衡量的，其中n=784是图像像素数。像素强度被缩放到范围[0，1]内。

表1对抗性实例在MNIST上的泛化检验。

Model Name

Description

Training error

Test error

Av. min. distortion

$FC10(10^{-4})$

Softmax with $\lambda=10^{-4}$

6.7%

7.4%

0.062

$FC10(10^{-2})$

Softmax with $\lambda=10^{-2}$

10%

9.4%

0.1

$FC10(1)$

Softmax with $\lambda=1$

21.2%

20%

0.14

$FC100-100-10$

Softmax with $\lambda=10^{-5},10^{-5},10^{-6}$

1.64%

0.058

$FC200-200-10$

Softmax with $\lambda=10^{-5},10^{-5},10^{-6}$

1.54%

0.065

$AE400-10$

Autoencoder with Softmax $\lambda=10^{-6}$

0.57%

1.9%

0.086

表2:对抗性例子的跨模型泛化。表中的列显示了输入到给定模型的畸变示例所引起的误差。最后一列显示平均失真wrt。原始训练集。

$FC10(10^{-4})$

$FC10(10^{-2})$

$FC10(1)$

$FC100-100-10$

$FC200-200-10$

$AE400-10$

Av. distortion

$FC10(10^{-4})$

100%

11.7%

22.7%

3.9%

2.7%

0.062

$FC10(10^{-2})$

87.1%

100%

35.2%

35.9%

27.3%

9.8%

0.1

$FC10(1)$

71.9%

76.2%

100%

48.1%

47%

34.4%

0.14

$FC100-100-10$

28.9%

13.7%

21.1%

100%

6.6%

0.058

$FC200-200-10$

38.2%

14%

23.8%

20.3%

100%

2.7%

0.065

$AE400-10$

23.4%

16%

24.8%

9.4%

6.6%

100%

0.086

Gaussian noise, stddev=0.1

5.0%

10.1%

18.3%

0.8%

0.1

Gaussian noise, stddev=0.3

15.6%

11.3%

22.7%

4.3%

3.1%

0.3

In our first experiment, we generated a set of adversarial instances for a given network and fed these examples for each other network to measure the proportion of misclassified instances. The last column shows the average minimum distortion that was necessary to reach 0% accuracy on the whole training set. The experimental results are presented in Table 2. The columns of Table 2 show the error (proportion of misclassified instances) on the so distorted training sets. The last two rows are given for reference showing the error induced when distorting by the given amounts of Gaussian noise. Note that even the noise with stddev 0.1 is greater than the stddev of our adversarial noise for all but one of the models. Figure 7 shows a visualization of the generated adversarial instances for two of the networks used in this experiment The general conclusion is that adversarial examples tend to stay hard even for models trained with different hyperparameters. Although the autoencoder based version seems most resilient to adversarial examples, it is not fully immune either.

在我们的第一个实验中，我们为一个给定的网络生成了一组对抗样本，并将这些实例反馈给其他各个网络，以测量错误分类实例的比例。最后一列显示了在整个训练集上达到0%精度所需的平均最小失真。实验结果见表2。表2的列显示了扭曲训练集上的错误（错误分类实例的比例）。最后两行作为参考，显示了给定高斯噪声量引起的失真误差。注意，即使是stddev 0.1的噪声也比我们的对抗噪声的stddev大，除了一个模型。图7显示了在这个实验中使用的两个网络生成的对抗样本的可视化。一般的结论是，即使对于使用不同超参数训练的模型，对抗样本也保持困难。尽管基于自编码器的版本看起来对于对抗样本最有弹性，但它也不能完全免疫。

Still, this experiment leaves open the question of dependence over the training set. Does the hardness of the generated examples rely solely on the particular choice of our training set as a sample or does this effect generalize even to models trained on completely different training sets?

尽管如此，这个实验仍然留下了对训练集依赖的问题。生成样本的困难度是否仅依赖于我们作为样本的训练集的特定选择，或者这种效果是否能推广到在完全不同的训练集上训练的模型？

To study cross-training-set generalization, we have partitioned the 60000 MNIST training images into two parts P1 and P2 of size 30000 each and trained three non-convolutional networks with sigmoid activations on them: Two, FC100-100-10 and FC123-456-10, on P1 and FC100-100-10 on P2. The reason we trained two networks for P1 is to study the cumulative effect of changing the hypermarameters and the training sets at the same time. Models FC100-100-10 and FC100-100- 10 share the same hyperparameters: both of them are 100-100-10 networks, while FC123-456-10 has different number of hidden units. In this experiment, we were distorting the elements of the test set rather than the training set. Table 3 summarizes the basic facts about these models. After we generate adversarial examples with 100% error rates with minimum distortion for the test set, we feed these examples to the each of the models. The error for each model is displayed in the corresponding column of the upper part of Table 4. In the last experiment, we magnify the effect of our distortion by using the examples $x + 0.1 \frac{x\prime-x}{\lVert x\prime-x\lVert_{2}}$ rather than x 0 . This magnifies the distortion on average by 40%, from stddev 0.06 to 0.1. The so distorted examples are fed back to each of the models and the error rates are displayed in the lower part of Table 4. The intriguing conclusion is that the adversarial examples remain hard for models trained even on a disjoint training set, although their effectiveness decreases considerably.

为了研究交叉训练集的泛化问题，我们将60000幅MNIST训练图像分成大小分别为30000的P1和P2两部分，训练了3个具有sigmoid激活的非卷积网络：两个P1上的FC100-100-10和FC123-456-10以及P2上的FC100-100-10。为P1训练两个网络的原因是为了研究同时改变超参数和训练集的累积效应。FC100-100-10和FC100-100-10具有相同的超参数：它们都是100-100-10网络，而FC123-456-10具有不同数量的隐藏单元。在这个实验中，我们改变了测试集而不是训练集的元素。表3总结了这些模型的基本事实。在我们为测试集生成错误率为100%且失真最小的对抗样本之后，我们将这些样本提供给每个模型。每个模型的误差显示在表4上部的相应列中。在最后一个实验中，通过使用样本 $x + 0.1 \frac{x\prime-x}{\lVert x\prime-x\lVert_{2}}$ 代替x'，我们放大了失真的影响。这会使失真平均放大40%，从stddev 0.06到0.1。如此改变样本的被反馈到每个模型中，错误率显示在表4的下部。有趣的结论是，对抗样本对于即使在不相交的训练集上训练的模型仍然很难，尽管它们的有效性大大降低。

4.3 Spectral Analysis of Unstability

The previous section showed examples of deep networks resulting from purely supervised training which are unstable with respect to a peculiar form of small perturbations. Independently of their generalisation properties across networks and training sets, the adversarial examples show that there exist small additive perturbations of the input (in Euclidean sense) that produce large perturbations at the output of the last layer. This section describes a simple procedure to measure and control the additive stability of the network by measuring the spectrum of each rectified layer.

上一节展示了由纯监督训练产生的深度网络的例子，这些网络对于特殊形式的小扰动来说是不稳定的。独立于它们在网络和训练集上的泛化特性，这些对抗性的例子表明，输入存在小的加性扰动(欧几里得意义上)，在最后一层的输出产生大的扰动。本节描述了一个简单的程序，通过测量每个校正层的光谱来测量和控制网络的可加性稳定性。

Mathematically, if $\phi(x)$ denotes the output of a network of $K$ layers corresponding to input $x$ and trained parameters $W$ , we write

数学上，如果 $\phi(x)$ 表示与输入和训练参数相对应的层网络的输出，我们写

where $\phi(x)$ denotes the operator mapping layer $k = 1$ to layer $k$ . The unstability of $\phi(x)$ can be explained by inspecting the upper Lipschitz constant of each layer $k = 1 . . . K$ , defined as the constant $L_{k} > 0$ such that

其中 $\phi(x)$ 表示 $k = 1$ 层到 $k$ 层的映射操作符。 $\phi(x)$ 的不稳定性可以通过检查每一层的上Lipschitz常数来解释，定义为该常数

The resulting network thus satsifies $\lVert\phi(x)-\phi(x+r;W_{k})\lVert\leq L_{k}\lVert r \rVert,~with~ L=\prod_{k=1}^{K}L_{k}.$

由此产生的网络可以满足需求 $\lVert\phi(x)-\phi(x+r;W_{k})\lVert\leq L_{k}\lVert r \rVert,~with~ L=\prod_{k=1}^{K}L_{k}.$

A half-rectified layer (both convolutional or fully connected) is defined by the mapping φk(x; Wk, bk) = max(0, Wkx+bk). Let kWk denote the operator norm of W (i.e., its largest singular value). Since the non-linearity ρ(x) = max(0, x) is contractive, i.e. satisfies kρ(x)−ρ(x+r)k ≤ krk for all x, r; it follows that

一个半整流层(无论是卷积的还是完全连接的)是由映射k(x;Wk, bk) = max(0, Wkx+bk)设kWk表示W的算子范数(即W的最大奇异值)。由于非线性贯入(x) = max(0, x)是收缩的，即对所有x, r满足k贯入(x)−贯入(x+r)k≤krk;由此可见,

$\lVert\phi_{k}(x;W_{k})-\phi_{k}(x+r;W_{k})\rVert=\lVert max(0,W_{k} x+b_{k})-max(0,W_{k}(x+r)+b_{k})\lVert\leq\lVert W_{k^{r}}\rVert \leq \lVert W_{k}\lVert\rVert r \rVert,$ and hence $L_{k} \leq \lVert W_{k} \rVert.$ On the other hand, a max-pooling layer $\phi_{k}$ is contractive:

因此，另一方面 $L_{k} \leq \lVert W_{k} \rVert.$ ，最大池层 $\phi_{k}$ 是收缩的:

since its Jacobian is a projection onto a subset of the input coordinates and hence does not expand the gradients. Finally, if $\phi_{k}$ is a contrast-normalization layer

因为它的雅可比矩阵是在输入坐标子集上的投影，因此不会扩展梯度。最后，如果 $\phi_{k}$ 是一个对比度归一化层

one can verify that

我们可以验证一下

for $γ \in [0.5, 1]$ , which corresponds to most common operating regimes.

$γ \in [0.5, 1]$ 对应于最常见的操作体制

It results that a conservative measure of the unstability of the network can be obtained by simply computing the operator norm of each fully connected and convolutional layer. The fully connected case is trivial since the norm is directly given by the largest singular value of the fully connected matrix. Let us describe the convolutional case. If W denotes a generic 4-tensor, implementing a convolutional layer with C input features, D output features, support N × N and spatial stride ∆,

结果表明，只需计算各全连接层和卷积层的算子范数，就可以得到网络不稳定性的保守测度。由于范数由完全连通矩阵的最大奇异值直接给出，因此完全连通情形是平凡的。我们来描述卷积的情况。如果W为通用的4张量，实现一个卷积层，有C个输入特征，D个输出特征，支持N×N，空间步长∆，

where $x_{c}$ denotes the $c-th$ input feature image, and $w_{c,d}$ is the spatial kernel corresponding to input feature $c$ and output feature $d$ , by applying Parseval’s formula we obtain that its operator norm is given by

为输入特征图像，为输入特征和输出特征对应的空间核，应用Parseval公式，得到其算子范数为

where $A(\zeta)$ is a $D\times(C\cdot\Delta^{2})$ matrix whose rows are

矩阵的行在 $A(\zeta)$ 是一个 $D\times(C\cdot\Delta^{2})$ 哪里

and $\widehat{\omega_{c,d}}$ is the 2-D Fourier transform of $\omega_{c,d}:$

Table 5 shows the upper Lipschitz bounds computed from the ImageNet deep convolutional network of [9], using (1). It shows that instabilities can appear as soon as in the first convolutional layer.

These results are consistent with the exsitence of blind spots constructed in the previous section, but they don’t attempt to explain why these examples generalize across different hyperparameters or training sets. We emphasize that we compute upper bounds: large bounds do not automatically translate into existence of adversarial examples; however, small bounds guarantee that no such examples can appear. This suggests a simple regularization of the parameters, consisting in penalizing each upper Lipschitz bound, which might help improve the generalisation error of the networks.

这些结果与前一节中构造的盲点的存在是一致的，但它们并没有试图解释为什么这些样本会在不同的超参数或训练集上泛化。我们强调我们计算上界：大的界不会自动转化为对抗样本的存在；然而，小的界保证不会出现这样的样本。这表明参数的一个简单的正则化，包括惩罚每个上Lipschitz界，这可能有助于改善网络的泛化误差。

5. Discussion

We demonstrated that deep neural networks have counter-intuitive properties both with respect to the semantic meaning of individual units and with respect to their discontinuities. The existence of the adversarial negatives appears to be in contradiction with the network’s ability to achieve high generalization performance. Indeed, if the network can generalize well, how can it be confused by these adversarial negatives, which are indistinguishable from the regular examples? Possible explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set, yet it is dense (much like the rational numbers), and so it is found near every virtually every test case. However, we don’t have a deep understanding of how often adversarial negatives appears, and thus this issue should be addressed in a future research.

我们证明了深度神经网络在个体的语义和不连续性方面都具有反直觉的性质。对抗性反例的存在似乎与网络实现高泛化性能的能力相矛盾。事实上，如果网络能够很好地概括，它怎么会被这些对抗性反例所迷惑呢？这些反例与常规的样本是无法区分的？可能的解释是，对抗性反例集的概率极低，因此在测试集中从未观察到（或很少观察到），但它是密集的（很像有理数），因此几乎在每个测试用例附近都能发现它。然而，我们对于对抗性反例出现的频率还没有深入的了解，因此这个问题应该在以后的研究中加以解决。

References

[1] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and KlausRobert Muller. How to explain individual classification decisions. ¨ The Journal of Machine Learning Research, 99:1803–1831, 2010.

[2] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.

[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

[4] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. Technical Report 1341, University of Montreal, June 2009. Also presented at the ICML 2009 Workshop on Learning Feature Hierarchies, Montreal, Canada. ´

[5] Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively trained, multiscale, deformable part model. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.

[6] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524, 2013.

[7] Ian Goodfellow, Quoc Le, Andrew Saxe, Honglak Lee, and Andrew Y Ng. Measuring invariances in deep networks. Advances in neural information processing systems, 22:646–654, 2009.

[8] Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6):82–97, 2012.

[9] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.

[10] Quoc V Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado, Jeff Dean, and Andrew Y Ng. Building high-level features using large scale unsupervised learning. arXiv preprint arXiv:1112.6209, 2011.

[11] Yann LeCun and Corinna Cortes. The mnist database of handwritten digits, 1998.

[12] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[13] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901, 2013.

PreviousInception-V3论文翻译 Next深度学习模式的对抗攻击

Last updated 4 years ago

Was this helpful?

原文连接:

Abstract

1. Introduction

2. Framework

3. Units of: ϕ(x)\phi(x)ϕ(x)

4. Blind Spots in Neural Networks

4.1. Formal description

4.2. Experimental results

4.3 Spectral Analysis of Unstability

5. Discussion

References

3. Units of: $\phi(x)$