对神经网络鲁棒性的评估
Towards Evaluating the Robustnessof Neural Networks
Last updated
Was this helpful?
Towards Evaluating the Robustnessof Neural Networks
Last updated
Was this helpful?
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7958570
GB/T 7714 Carlini N, Wagner D. Towards evaluating the robustness of neural networks[C]//2017 ieee symposium on security and privacy (sp). IEEE, 2017: 39-57.
MLA Carlini, Nicholas, and David Wagner. "Towards evaluating the robustness of neural networks." 2017 ieee symposium on security and privacy (sp). IEEE, 2017.
APA Carlini, N., & Wagner, D. (2017, May). Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp) (pp. 39-57). IEEE.
Neural networks provide state-of-the-art results for most machine learning tasks. Unfortunately, neural networks are vulnerable to adversarial examples: given an input and any target classification , it is possible to find a new input that is similar to but classified as . This makes it difficult to apply neural networks in security-critical areas. Defensive distillation is a recently proposed approach that can take an arbitrary neural network, and increase its robustness, reducing the success rate of current attacks’ ability to find adversarial examples from to .
神经网络为大多数机器学习任务提供了最先进的结果。不幸的是,神经网络很容易遇到敌对的例子:给定一个输入 和任何目标分类 ,都有可能找到一个与 相似但分类为 的新输入 。这使得神经网络很难应用到安全关键区域。防御精馏是最近提出的一种方法,它可以采用任意的神经网络,提高其鲁棒性,将当前攻击中发现敌对实例的成功率从95%降低到0.5%。
In this paper, we demonstrate that defensive distillation does not significantly increase the robustness of neural networks by introducing three new attack algorithms that are successful on both distilled and undistilled neural networks with probability. Our attacks are tailored to three distance metrics used previously in the literature, and when compared to previous adversarial example generation algorithms, our attacks are often much more effective (and never worse). Furthermore, we propose using high-confidence adversarial examples in a simple transferability test we show can also be used to break defensive distillation. We hope our attacks will be used as a benchmark in future defense attempts to create neural networks that resist adversarial examples.
在本文中,我们通过引入三种新的攻击算法,证明了防御蒸馏并没有显著提高神经网络的鲁棒性,这三种算法都在经过蒸馏和未经过蒸馏的神经网络上取得了 的成功。我们的攻击是根据以前文献中使用的三个距离度量量身定制的,并且与以前的对抗性示例生成算法相比,我们的攻击通常更有效(而且不会更糟糕)。此外,我们提出在一个简单的可转移性测试中使用高可信的对抗例子,我们证明也可以用于破坏防御蒸馏。我们希望我们的攻击将被用作一个基准,在未来的防御尝试中,创建抵抗对抗的神经网络。
Deep neural networks have become increasingly effective at many difficult machine-learning tasks. In the image recognition domain, they are able to recognize images with nearhuman accuracy [27], [25]. They are also used for speech recognition [18], natural language processing [1], and playing games [43], [32].
在许多困难的机器学习任务中,深度神经网络变得越来越有效。在图像识别领域,他们能够识别出接近人类精度的图像[27],[25]。它们还用于语音识别[18],自然语言处理[1],以及玩游戏[43],[32]。
However, researchers have discovered that existing neural networks are vulnerable to attack. Szegedy et al. [46] first noticed the existence of adversarial examples in the image classification domain: it is possible to transform an image by a small amount and thereby change how the image is classified. Often, the total amount of change required can be so small as to be undetectable.
然而,研究人员发现,现有的神经网络容易受到攻击。Szegedy等[46]首先注意到在图像分类领域存在对敌例子:对图像进行少量的变换从而改变图像的分类方式是可能的。通常,所需的变化总量很小,以至于无法察觉。
The degree to which attackers can find adversarial examples limits the domains in which neural networks can be used. For example, if we use neural networks in self-driving cars, adversarial examples could allow an attacker to cause the car to take unwanted actions.
攻击者能够找到敌对例子的程度限制了神经网络的使用范围。例如,如果我们在自动驾驶汽车中使用神经网络,敌对的例子可能会允许攻击者导致汽车采取不想要的行动。
The existence of adversarial examples has inspired research on how to harden neural networks against these kinds of attacks. Many early attempts to secure neural networks failed or provided only marginal robustness improvements [15], [2], [20], [42].
敌对例子的存在激发了如何增强神经网络抵抗这类攻击的研究。许多早期保护神经网络的尝试都失败了,或者只提供了边缘的鲁棒性改进[15][2][20][42]。
Defensive distillation [39] is one such recent defense proposed for hardening neural networks against adversarial examples. Initial analysis proved to be very promising: defensive distillation defeats existing attack algorithms and reduces their success probability from 95% to 0.5%. Defensive distillation can be applied to any feed-forward neural network and only requires a single re-training step, and is currently one of the only defenses giving strong security guarantees against adversarial examples.
防御精馏[39]是最近提出的一种防御方法,用于强化神经网络对抗敌对的例子。初步的分析证明,这是非常有前途的:防御蒸馏击败现有的攻击算法,并将其成功概率从95%降低到0.5%。防御精馏可以应用于任何前馈神经网络,只需要一个单一的再训练步骤,是目前唯一的防御提供强大的安全保证,以对抗的例子。
In general, there are two different approaches one can take to evaluate the robustness of a neural network: attempt to prove a lower bound, or construct attacks that demonstrate an upper bound. The former approach, while sound, is substantially more difficult to implement in practice, and all attempts have required approximations [2], [21]. On the other hand, if the attacks used in the the latter approach are not sufficiently strong and fail often, the upper bound may not be useful.
一般来说,有两种不同的方法可以用来评估神经网络的鲁棒性:尝试证明一个下界,或者构造证明一个上界的攻击。前一种方法虽然合理,但在实践中实际上更加难以实现,并且所有的尝试都需要近似[2]、[21]。另一方面,如果后一种方法中使用的攻击不够强大并且经常失败,那么上限可能就没有用了。
This case study illustrates the general need for better techniques to evaluate the robustness of neural networks: while distillation was shown to be secure against the current state-of-the-art attacks, it fails against our stronger attacks. Furthermore, when comparing our attacks against the current state-of-the-art on standard unsecured models, our methods generate adversarial examples with less total distortion in every case. We suggest that our attacks are a better baseline for evaluating candidate defenses: before placing any faith in a new possible defense, we suggest that designers at least check whether it can resist our attacks.
这个案例研究说明了一般需要更好的技术来评估神经网络的鲁棒性:虽然蒸馏被证明对当前最先进的攻击是安全的,但它不能对付我们更强的攻击。此外,当将我们的攻击与当前最先进的标准不安全模型进行比较时,我们的方法在每一种情况下都会产生较少的完全失真的对抗性例子。我们建议,我们的攻击是评估候选防御的一个更好的基线:在对一种新的可能防御抱有任何信心之前,我们建议设计者至少检查一下它是否能够抵抗我们的攻击。
We additionally propose using high-confidence adversarial examples to evaluate the robustness of defenses. Transferability [46], [11] is the well-known property that adversarial examples on one model are often also adversarial on another model. We demonstrate that adversarial examples from our attacks are transferable from the unsecured model to the defensively distilled (secured) model. In general, we argue that any defense must demonstrate it is able to break the transferability property.
此外,我们建议使用高可信度的对抗实例来评估防御的健壮性。可转移性[46][11]是一个众所周知的性质一个模型上的对抗性例子在另一个模型上也是对抗性的。我们演示了从我们攻击的对抗例子可以从不安全模型转移到防御蒸馏(安全)模型。一般而言,我们认为任何抗辩都必须证明其能够破坏可转移性财产。
We evaluate our attacks on three standard datasets: MNIST [28], a digit-recognition task (0-9); CIFAR-10 [24], a smallimage recognition task, also with 10 classes; and ImageNet [9], a large-image recognition task with 1000 classes.
我们评估我们对三个标准数据集的攻击:MNIST 数字识别任务(0-9);CIFAR-10[24],一个小灰度识别任务,也有10个类;和ImageNet [9], 1000个类的大图像识别任务。
Figure 1 shows examples of adversarial examples our techniques generate on defensively distilled networks trained on the MNIST and CIFAR datasets.
图1显示了我们的技术在经过MNIST和CIFAR数据集训练的经过防御过滤的网络上生成的对抗示例。
In one extreme example for the ImageNet classification task, we can cause the Inception v3 [45] network to incorrectly classify images by changing only the lowest order bit of each pixel. Such changes are impossible to detect visually.
在ImageNet分类任务的一个极端例子中,我们只改变每个像素的最低阶位,就可能导致InceptionV3[45]网络对图像进行错误的分类。这样的变化是无法通过视觉检测到的。
To enable others to more easily use our work to evaluate the robustness of other defenses, all of our adversarial example generation algorithms (along with code to train the models we use, to reproduce the results we present) are available online at http://nicholas.carlini.com/code/nn_robust_attacks.
为了让其他人更容易地使用我们的工作来评估其他防御系统的健壮性,我们所有的对抗性示例生成算法(以及用于训练我们使用的模型的代码,以重现我们给出的结果)都可以在http://nicholas.carlini.com/code/nn_robust_attacks。
This paper makes the following contributions:
本文的贡献如下:
We introduce three new attacks for the L0, L2, and L∞ distance metrics. Our attacks are significantly more effective than previous approaches. Our L0 attack is the first published attack that can cause targeted misclassification on the ImageNet dataset
我们引入了三种新的攻击L0, L2和L∞距离度量。我们的攻击比以前的方法更有效。我们的L0攻击是第一个可以导致对ImageNet数据集进行定向错误分类的攻击
We apply these attacks to defensive distillation and discover that distillation provides little security benefit over un-distilled networks.
我们将这些攻击应用到防御精馏中,发现精馏相对于未精馏的网络提供了很少的安全好处。
We propose using high-confidence adversarial examples in a simple transferability test to evaluate defenses, and show this test breaks defensive distillation.
我们提出在一个简单的可转移性测试中使用高置信度对抗的例子来评估防御,并表明该测试破坏了防御的精馏。
We systematically evaluate the choice of the objective function for finding adversarial examples, and show that the choice can dramatically impact the efficacy of an attack.
我们系统地评价了寻找对抗例子的目标函数的选择,并表明该选择对攻击的效果有显著的影响。
Machine learning is being used in an increasing array of settings to make potentially security critical decisions: selfdriving cars [3], [4], drones [10], robots [33], [22], anomaly detection [6], malware classification [8], [40], [48], speech recognition and recognition of voice commands [17], [13], NLP [1], and many more. Consequently, understanding the security properties of deep learning has become a crucial question in this area. The extent to which we can construct adversarial examples influences the settings in which we may want to (or not want to) use neural networks.
机器学习正被用于越来越多的设置,以做出潜在的安全关键决策:自动驾驶汽车[3]、[4]、无人机[10]、机器人[33]、[22]、异常检测[22]0、恶意软件分类[22]1、[40]、[48]、语音识别和语音命令[17]、[13]、NLP[1],以及更多。因此,理解深度学习的安全特性已经成为这一领域的一个关键问题。我们可以构建对抗例子的程度影响到我们可能想要(或不想)使用神经网络的环境。
In the speech recognition domain, recent work has shown [5] it is possible to generate audio that sounds like speech to machine learning algorithms but not to humans. This can be used to control user’s devices without their knowledge. For example, by playing a video with a hidden voice command, it may be possible to cause a smart phone to visit a malicious webpage to cause a drive-by download. This work focused on conventional techniques (Gaussian Mixture Models and Hidden Markov Models), but as speech recognition is increasingly using neural networks, the study of adversarial examples becomes relevant in this domain. 1
在语音识别领域,最近的研究表明[5]有可能生成机器学习算法听起来像语音的声音,但对人类不是。这可以用来在用户不知情的情况下控制他们的设备。例如,通过播放带有隐藏声音指令的视频,可能会导致智能手机访问恶意网页,从而导致行车下载。这项工作集中在传统的技术(高斯混合模型和隐马尔可夫模型),但随着越来越多的使用神经网络语音识别,对敌例子的研究在这个领域变得相关。1
In the space of malware classification, the existence of adversarial examples not only limits their potential application settings, but entirely defeats its purpose: an adversary who is able to make only slight modifications to a malware file that cause it to remain malware, but become classified as benign, has entirely defeated the malware classifier [8], [14].
在恶意软件分类,敌对的例子的存在不仅限制了他们潜在的应用程序设置,但完全击败了它的目的:敌人是谁能让只有轻微的修改一个恶意文件,使它保持恶意软件,但成为归类为良性的,完全击败了恶意软件分类器[8],[14]。
Turning back to the threat to self-driving cars introduced earlier, this is not an unrealistic attack: it has been shown that adversarial examples are possible in the physical world [26] after taking pictures of them.
回到早些时候介绍的对自动驾驶汽车的威胁,这并不是一个不现实的攻击:有证据表明,在物理世界[26]中有可能出现敌对的例子,只要拍下它们的照片。
We assume in this paper that the adversary has complete access to a neural network, including the architecture and all paramaters, and can use this in a white-box manner. This is a conservative and realistic assumption: prior work has shown it is possible to train a substitute model given black-box access to a target model, and by attacking the substitute model, we can then transfer these attacks to the target model. [37]
在本文中,我们假设对手可以完全访问一个神经网络,包括它的结构和所有的参数,并且可以以白盒方式使用它。这是一个保守且现实的假设:之前的工作已经表明,在黑箱访问目标模型的情况下,训练替代模型是可能的,并且通过攻击替代模型,我们可以将这些攻击转移到目标模型上。[37]
Given these threats, there have been various attempts [15], [2], [20], [42], [39] at constructing defenses that increase the robustness of a neural network, defined as a measure of how easy it is to find adversarial examples that are close to their original input.
考虑到这些威胁,[15],[2],[20],[42],[39]在构建防御以增加神经网络的鲁棒性方面,已经有了各种尝试,这些尝试被定义为一种方法来衡量找到与原始输入接近的敌对例子有多容易。
In this paper we study one of these, distillation as a defense [39], that hopes to secure an arbitrary neural network. This type of defensive distillation was shown to make generating adversarial examples nearly impossible for existing attack techniques [39]. We find that although the current state-of-theart fails to find adversarial examples for defensively distilled networks, the stronger attacks we develop in this paper are able to construct adversarial examples.
在本文中,我们研究其中之一,蒸馏作为防御[39],希望保护任意神经网络。这种类型的防御蒸馏被证明使生成对抗的例子对现有的攻击技术[39]几乎不可能。我们发现,虽然目前的技术还没有找到针对防御过滤网络的对抗例子,但我们在本文中开发的更强的攻击能够构造对抗例子。
We use the notation from Papernot et al. [39]: define F to be the full neural network including the softmax function, Z(x) = z to be the output of all layers except the softmax (so z are the logits), and
我们使用Papernot等人的[39]符号:定义F为包含softmax函数的完整神经网络,Z(x) = Z为除softmax之外所有层的输出(因此Z是logits),和
F(x) = softmax(Z(x)) = y.\tag{1}
A neural network typically 2 consists of layers
典型的神经网络2由层组成
F = softmax ◦ F_{n} ◦ F_{n−1} ◦···◦ F_{1}\tag{2}
where
F_{i}(x) = \sigma(\theta_{i}\cdot x) +\widehat{\theta}_{i}\tag{3}
Instead, we consider three different approaches for how to choose the target class, in a targeted attack:
相反,我们考虑三种不同的方法来选择目标类,在一个有针对性的攻击:
Average Case: select the target class uniformly at random among the labels that are not the correct label.
平均情况:在不是正确标签的标签中均匀随机地选择目标类。
Best Case: perform the attack against all incorrect classes, and report the target class that was least difficult to attack.
最好的情况:对所有不正确的类执行攻击,并报告最难攻击的目标类。
Worst Case: perform the attack against all incorrect classes, and report the target class that was most difficult to attack.
最坏情况:对所有不正确的职业执行攻击,并报告最难攻击的目标职业。
In all of our evaluations we perform all three types of attacks: best-case, average-case, and worst-case. Notice that if a classifier is only accurate 80% of the time, then the best case attack will require a change of 0 in 20% of cases.
在所有的评估中,我们执行三种类型的攻击:最佳情况、平均情况和最坏情况。请注意,如果一个分类器在80%的情况下是准确的,那么最佳情况攻击将需要在20%的情况下更改为0。
On ImageNet, we approximate the best-case and worst-case attack by sampling 100 random target classes out of the 1,000 possible for efficiency reasons.
在ImageNet上,出于效率的考虑,我们从1000个可能的目标类中抽取100个随机目标类,来近似最佳情况和最差情况下的攻击。
\lVert \upsilon \rVert_{p}=(\sum\limits_{i=1}^{n}\lvert \upsilon_{i} \rvert^{p})^{\frac{1}{p}}\tag{4}
In more detail:
更详细地
L2 distance measures the standard Euclidean (rootmean-square) distance between x and x . The L2 distance can remain small when there are many small changes to many pixels. This distance metric was used in the initial adversarial example work [46].
L2距离测量x和x之间的标准欧式(rootmean-square)距离。当对多个像素有许多微小的变化时,L2距离可以保持较小。该距离度量被用于初始对抗算例[46]中。
L∞ distance measures the maximum change to any of the coordinates:
L∞距离测量任意坐标的最大变化:
\lVert x-x^{\prime} \rVert_{\infty}=max(\lvert x_{1}-x_{1}^{\prime} \rvert,...,\lvert x_{n}-x_{n}^{\prime} \rvert).\tag{5}
For images, we can imagine there is a maximum budget, and each pixel is allowed to be changed by up to this limit, with no limit on the number of pixels that are modified.
对于图像,我们可以想象有一个最大预算,允许对每个像素进行最大限度的更改,对修改的像素的数量没有限制。
Goodfellow et al. argue that L∞ is the optimal distance metric to use [47] and in a follow-up paper Papernot et al. argue distillation is secure under this distance metric [36].
Goodfellow等人认为L∞是使用[47]的最优距离度量,而在后续的论文中,not等人认为在这个距离度量[36]下蒸馏是安全的。
No distance metric is a perfect measure of human perceptual similarity, and we pass no judgement on exactly which distance metric is optimal. We believe constructing and evaluating a good distance metric is an important research question we leave to future work.
没有一种距离度量是人类感知相似性的完美度量,我们也没有对哪一种距离度量是最优的进行判断。我们认为,构建和评价一个良好的距离度量是我们留给未来工作的一个重要研究问题。
However, since most existing work has picked one of these three distance metrics, and since defensive distillation argued security against two of these, we too use these distance metrics and construct attacks that perform superior to the state-of-theart for each of these distance metrics.
然而,由于大多数现有的工作都选择了这三个距离度量中的一个,并且由于防御蒸馏针对其中两个的安全性进行了论证,因此我们也使用这些距离度量,并为每一个距离度量构造性能优于最新水平的攻击。
When reporting all numbers in this paper, we report using the distance metric as defined above, on the range [0, 1]. (That is, changing a pixel in a greyscale image from full-on to fulloff will result in L2 change of 1.0 and a L∞ change of 1.0, not 255.)
在报告本文中的所有数字时,我们使用上述定义的距离度量,在范围[0,1]上报告。(也就是说,将灰度图中的一个像素从full-on变为fulloff会导致L2变化为1.0,而将L∞变化为1.0,而不是255。)
We briefly provide a high-level overview of defensive distillation. We provide a complete description later in Section VIII.
我们简要地提供防御精馏的高级概述。我们将在后面的第VIII节中提供完整的描述。
To defensively distill a neural network, begin by first training a network with identical architecture on the training data in a standard manner. When we compute the softmax while training this network, replace it with a more-smooth version of the softmax (by dividing the logits by some constant T). At the end of training, generate the soft training labels by evaluating this network on each of the training instances and taking the output labels of the network.
为了防御地提取神经网络,首先以标准的方式在训练数据上训练具有相同架构的网络。当我们计算时将softmax培训这个网络,代之以更加平滑的版本将softmax(分对数除以一个常数T)。在最后的训练,生成软培训评估这个网络在每个标签的训练实例和网络的输出标签。
Then, throw out the first network and use only the soft training labels. With those, train a second network where instead of training it on the original training labels, use the soft labels. This trains the second model to behave like the first model, and the soft labels convey additional hidden knowledge learned by the first model.
然后,扔掉第一个网络,只使用软训练标签。用它们来训练第二个网络,而不是在原来的训练标签上训练它,使用软标签。这训练了第二个模型像第一个模型一样运行,并且软标签传达了第一个模型学到的附加隐藏知识。
The key insight here is that by training to match the first network, we will hopefully avoid over-fitting against any of the training data. If the reason that neural networks exist is because neural networks are highly non-linear and have “blind spots” [46] where adversarial examples lie, then preventing this type of over-fitting might remove those blind spots.
这里的关键观点是,通过训练匹配第一个网络,我们将有望避免对任何训练数据的过拟合。如果神经网络存在的原因是由于神经网络是高度非线性的,并且存在“盲点”[46],而这些“盲点”正是与之对立的例子,那么防止这种类型的过拟合可能会消除这些盲点。
In fact, as we will see later, defensive distillation does not remove adversarial examples. One potential reason this may occur is that others [11] have argued the reason adversarial examples exist is not due to blind spots in a highly non-linear neural network, but due only to the locally-linear nature of neural networks. This so-called linearity hypothesis appears to be true [47], and under this explanation it is perhaps less surprising that distillation does not increase the robustness of neural networks.
事实上,正如我们稍后将看到的,防御性蒸馏并没有删除对抗性的例子。这可能发生的一个潜在原因是,其他[11]已经争论过,反对的例子存在的原因不是由于盲点在高度非线性的神经网络,而是仅仅由于神经网络的局部线性性质。这种所谓的线性假设[47]似乎是正确的,在这种解释下,蒸馏并没有增加神经网络的鲁棒性也许就不那么令人惊讶了。
The remainder of this paper is structured as follows. In the next section, we survey existing attacks that have been proposed in the literature for generating adversarial examples, for the L2, L∞, and L0 distance metrics. We then describe our attack algorithms that target the same three distance metrics and provide superior results to the prior work. Having developed these attacks, we review defensive distillation in more detail and discuss why the existing attacks fail to find adversarial examples on defensively distilled networks. Finally, we attack defensive distillation with our new algorithms and show that it provides only limited value.
本文的其余部分结构如下。在下一节中,我们将调查文献中提出的针对L2、L和L0距离度量产生对抗性例子的现有攻击。然后,我们描述我们的攻击算法,以同样的三个距离指标为目标,并提供优越的结果,比之前的工作。在开发了这些攻击之后,我们将更详细地回顾防御精馏,并讨论为什么现有的攻击无法在防御精馏的网络中找到对抗的例子。最后,我们用我们的新算法攻击防御蒸馏,并证明了它只提供有限的价值。
minimize \lVert x-x^{\prime} \rVert such that C(x^{\prime}) = l x \in [0, 1]^{n} \tag{6}
This problem can be very difficult to solve, however, so Szegedy et al. instead solve the following problem:
但是这个问题很难解决,所以Szegedy等人转而解决了以下问题:
minimize~ c ~\cdot \lVert x-x^{\prime} \rVert_{2}^{2} such that~ C(x^{\prime})=l x^{\prime} \in [0, 1]^{n} \tag{6}
The fast gradient sign [11] method has two key differences from the L-BFGS method: first, it is optimized for the L∞ distance metric, and second, it is designed primarily to be fast instead of producing very close adversarial examples. Given an image x the fast gradient sign method sets
快速梯度符号[11]方法与L- bfgs方法有两个关键区别:首先,它是优化的L∞距离度量,其次,它的设计主要是为了快速而不是产生非常接近的敌对例子。给定图像x,快速梯度符号法集合
x^{\prime} = x − \in · sign(∇loss_{F,t}(x)),\tag{7}
其中取足够小而无法检测,t为目标标签。直观地讲,对于每一个像素,快速梯度符号法利用损失函数的梯度来确定像素的强度应该在哪个方向改变(增加还是减少),以使损失函数最小;然后,它同时移动所有像素。
It is important to note that the fast gradient sign attack was designed to be fast, rather than optimal. It is not meant to produce the minimal adversarial perturbations.
需要注意的是,快速梯度符号攻击被设计为快速的,而不是最优的。它并不意味着产生最小的对抗性扰动。
x_{0}^{\prime} = 0\tag{8}
and then on each iteration
然后在每次迭代中
x_{i}^{\prime} = x_{i−1}^{\prime} − clip_{\epsilon}(α · sign(∇loss_{F,t}(x i−1))) \tag{9}
Iterative gradient sign was found to produce superior results to fast gradient sign [26].
迭代梯度符号的结果优于快速梯度符号[26]。
Papernot et al. introduced an attack optimized under L0 distance [38] known as the Jacobian-based Saliency Map Attack (JSMA). We give a brief summary of their attack algorithm; for a complete description and motivation, we encourage the reader to read their original paper [38].
In more detail, we begin by defining the saliency map in terms of a pair of pixels p, q. Define
更详细地说,我们首先根据一对像素p, q. Define定义显著性映射
\alpha_{pq}=\sum\limits_{i\in\{p,q \}}\frac{\partial z(x)_{t}}{\partial x_{i}}\tag{10}
\beta_{pq}=(\sum\limits_{i\in\{ p,q\}}\sum\frac{\partial Z(x)_{j}}{\partial x_{j}})-\alpha_{pq}\tag{10}
(p^{∗}, q^{∗}) = argmax (−\alpha_{pq} · \beta_{pq}) · (\alpha_{pq} > 0) · (\beta_{pq} < 0)\tag{11}
so that αpq > 0 (the target class is more likely), βpq < 0 (the other classes become less likely), and −αpq · βpq is largest.
Notice that JSMA uses the output of the second-to-last layer Z, the logits, in the calculation of the gradient: the output of the softmax F is not used. We refer to this as the JSMA-Z attack.
注意,JSMA在计算梯度时使用了倒数第二层Z的输出,即logits:没有使用softmax F的输出。我们将其称为jsm - z攻击。
However, when the authors apply this attack to their defensively distilled networks, they modify the attack so it uses F instead of Z. In other words, their computation uses the output of the softmax (F) instead of the logits (Z). We refer to this modification as the JSMA-F attack.5
然而,当作者将这种攻击应用到他们的防御性网络时,他们修改了攻击,使其使用F而不是Z。换句话说,他们的计算使用softmax (F)的输出而不是logits (Z)。我们将这种修改称为jsm -F攻击
When an image has multiple color channels (e.g., RGB), this attack considers the L0 difference to be 1 for each color channel changed independently (so that if all three color channels of one pixel change change, the L0 norm would be 3). While we do not believe this is a meaningful threat model, when comparing to this attack, we evaluate under both models.
当一个图像有多个颜色通道(例如,RGB),这种攻击认为L0差异是每个颜色通道1改变独立(如果所有三个颜色通道的一个像素变化变化,L0范数是3)。虽然我们不相信这是一个有意义的威胁模型,当比较这种攻击,我们评估下两种模型。
Deepfool [34] is an untargeted attack technique optimized for the L2 distance metric. It is efficient and produces closer adversarial examples than the L-BFGS approach discussed earlier.
Deepfool[34]是一种非目标攻击技术,优化了L2距离度量。它是有效的,比前面讨论的L-BFGS方法产生更接近的对抗例子。
The authors construct Deepfool by imagining that the neural networks are totally linear, with a hyperplane separating each class from another. From this, they analytically derive the optimal solution to this simplified problem, and construct the adversarial example.
在构建Deepfool时,作者设想神经网络是完全线性的,用超平面将每个类与另一个类分开。以此为基础,解析推导出了该简化问题的最优解,并构造了对抗性实例。
Then, since neural networks are not actually linear, they take a step towards that solution, and repeat the process a second time. The search terminates when a true adversarial example is found.
然后,由于神经网络实际上不是线性的,它们向这个解决方案迈进了一步,并再次重复这个过程。当找到一个真正的对抗性示例时,搜索终止。
The exact formulation used is rather sophisticated; interested readers should refer to the original work [34].
所用的精确公式相当复杂;有兴趣的读者请参阅原著。
Before we develop our attack algorithms to break distillation, we describe how we train the models on which we will evaluate our attacks.
在我们开发破坏蒸馏的攻击算法之前,我们描述我们如何训练模型来评估我们的攻击。
Layer Type
MNIST Model
CIFAR Model
Convolutioin + ReLU
3X3X32
3X3X64
Convolution + ReLU
3X3X32
3X3X64
Max Pooling
2X2
2X2
Convolution + ReLU
3X3X64
3X3X128
Convolution + ReLU
3X3X64
3X3X128
Max Pooling
2X2
2X2
Fully Connected + ReLU
200
256
Fully Connected + ReLU
200
256
Softmax
10
10
MODEL ARCHITECTURES FOR THE MNIST AND CIFAR MODELS. THIS ARCHITECTURE IS IDENTICAL TO THAT OF THE ORIGINAL DEFENSIVE DISTILLATION WORK. [39]
MNIST和CIFAR模型的模型架构。这个架构是相同的,原来的防御蒸馏工作。[39]
Parameter
MNIST Model
CIFAR Model
Learning
0.1
0.01(decay 0.5)
Momentum
0.9
0.9(decay 0.5)
Delay Rate
-
10 epochs
Dropout
0.5
0.5
Batch Size
128
128
Epochs
50
50
MODEL PARAMETERS FOR THE MNIST AND CIFAR MODELS. THESE PARAMETERS ARE IDENTICAL TO THAT OF THE ORIGINAL DEFENSIVE DISTILLATION WORK. [39]
MNIST和CIFAR模型的模型参数。这些参数是相同的,原来的防御蒸馏工作。[39]
We train two networks for the MNIST [28] and CIFAR-10 [24] classification tasks, and use one pre-trained network for the ImageNet classification task [41]. Our models and training approaches are identical to those presented in [39]. We achieve 99.5% accuracy on MNIST, comparable to the state of the art. On CIFAR-10, we achieve 80% accuracy, identical to the accuracy given in the distillation work. 6
我们训练了MNIST[28]和CIFAR-10[24]分类任务的两个网络,并使用一个预先训练好的网络来训练ImageNet分类任务[41]。我们的模型和训练方法与[39]中的相同。我们在MNIST上达到99.5%的精度,可与最先进的状态相媲美。在CIFAR-10上,我们达到了80%的精度,与在蒸馏工作中给出的精度相同。6
MNIST and CIFAR-10. The model architecture is given in Table I and the hyperparameters selected in Table II. We use a momentum-based SGD optimizer during training.
MNIST CIFAR-10。模型架构如表I所示,超参数选择如表II所示。在培训过程中,我们使用基于动量的SGD优化器。
The CIFAR-10 model significantly overfits the training data even with dropout: we obtain a final training cross-entropy loss of 0.05 with accuracy 98%, compared to a validation loss of 1.2 with validation accuracy 80%. We do not alter the network by performing image augmentation or adding additional dropout as that was not done in [39].
即使存在dropout, ciremote -10模型也显著地超拟合了训练数据:在准确率98%的情况下,最终的训练交叉熵损失为0.05,而验证准确率为80%的验证损失为1.2。我们不会通过执行图像增强或添加额外的dropout来改变网络,因为在[39]中没有这样做。
ImageNet. Along with considering MNIST and CIFAR, which are both relatively small datasets, we also consider the ImageNet dataset. Instead of training our own ImageNet model, we use the pre-trained Inception v3 network [45], which achieves 96% top-5 accuracy (that is, the probability that the correct class is one of the five most likely as reported by the network is 96%). Inception takes images as 299×299×3 dimensional vectors.
ImageNet。除了考虑MNIST和CIFAR这两个相对较小的数据集之外,我们还考虑了ImageNet数据集。我们不用训练我们自己的ImageNet模型,而是使用预先训练过的Inception v3网络[45],它能达到96%的前5个准确率(也就是说,根据网络报告,正确的类是5个最有可能的类之一的概率是96%)。Inception将图像设置为299×299×3维向量。
We now turn to our approach for constructing adversarial examples. To begin, we rely on the initial formulation of adversarial examples [46] and formally define the problem of finding an adversarial instance for an image x as follows:
现在我们转向构建敌对示例的方法。首先,我们依靠对抗性例[46]的初始公式,正式定义图像x对抗性例的查找问题如下:
minimize D(x, x + δ)\tag{12}
such that C(x + δ) = t x + δ ∈ [0, 1]^{n}\tag{12}
where x is fixed, and the goal is to find δ that minimizes D(x, x+δ). That is, we want to find some small change δ that we can make to an image x that will change its classification, but so that the result is still a valid image. Here D is some distance metric; for us, it will be either L0, L2, or L∞ as discussed earlier.
We solve this problem by formulating it as an appropriate optimization instance that can be solved by existing optimization algorithms. There are many possible ways to do this; we explore the space of formulations and empirically identify which ones lead to the most effective attacks.
我们通过将其作为一个适当的优化实例来解决这个问题,可以用现有的优化算法来解决。有很多可能的方法可以做到这一点;我们探索公式的空间和经验确定哪些导致最有效的攻击。
The above formulation is difficult for existing algorithms to solve directly, as the constraint C(x + δ) = t is highly non-linear. Therefore, we express it in a different form that is better suited for optimization. We define an objective function f such that C(x + δ) = t if and only if f(x + δ) ≤ 0. There are many possible choices for f:
f_{1}(x^{\prime})=-loss_{F,t}(x^{\prime})+1\\f_{2}(x^{\prime})=(max(F(x^{\prime})_{i})-F(x^{\prime})_{t})^{+}\\f_{3}(x^{\prime})=(softplus(F(x^{\prime})_{i})-F(x^{\prime})_{t})-log(2)\\f_{4}(x^{\prime})=(0.5-F(x^{\prime})_{t})^{+}\\f_{5}(x^{\prime})=-log(2F(x^{\prime})_{t}-2)\\f_{6}(x^{\prime})=(max(Z(x^{\prime})_{i})-Z(x^{\prime})_{t})^{+}\\f_{3}(x^{\prime})=(softplus(F(x^{\prime})_{i})-F(x^{\prime})_{t})-log(2)\tag{13}
where s is the correct classification, (e)+ is short-hand for max(e, 0), softplus(x) = log(1 + exp(x)), and lossF,s(x) is the cross entropy loss for x.
其中s是正确的分类,(e)+是max(e, 0)的简写,softplus(x) = log(1 + exp(x)), lossF,s(x)是x的交叉熵损失。
Notice that we have adjusted some of the above formula by adding a constant; we have done this only so that the function respects our definition. This does not impact the final result, as it just scales the minimization function.
注意,我们通过增加一个常数调整了上面的一些公式;我们这样做只是为了使函数符合我们的定义。这不会影响最终的结果,因为它只是缩放了最小化函数。
Now, instead of formulating the problem as
现在,我们不把问题表述为
we use the alternative formulation:
minimize D(x, x + δ) + c · f(x + δ)\\ such that x + δ ∈ [0, 1]^{v}\tag{14}
where c > 0 is a suitably chosen constant. These two are equivalent, in the sense that there exists c > 0 such that the optimal solution to the latter matches the optimal solution to the former. After instantiating the distance metric D with an lp norm, the problem becomes: given x, find δ that solves
其中,c > 0是一个合适的常数。这两个是等价的,因为存在这样的情况,即后一个问题的最优解与前一个问题的最优解相匹配。在用lp规范实例化距离度量D之后,问题就变成了:给定x,找到已解决的度量
minimize δp + c · f(x + δ) \\such that x + δ ∈ [0, 1]^{n}\tag{15}
Choosing the constant c.
选择常数c。
Empirically, we have found that often the best way to choose c is to use the smallest value of c for which the resulting solution x∗ has f(x∗) ≤ 0. This causes gradient descent to minimize both of the terms simultaneously instead of picking only one to optimize over first.
根据经验,我们发现选择c的最佳方法通常是使用c的最小值,使得最终解决方案x∗的f(x∗)≤0。这使得梯度下降同时最小化这两个项,而不是只选择一个先优化。
We verify this by running our f6 formulation (which we found most effective) for values of c spaced uniformly (on a log scale) from c = 0.01 to c = 100 on the MNIST dataset. We plot this line in Figure 2.
我们通过运行f6公式(我们发现最有效的公式)来验证这一点,该公式用于MNIST数据集上从c = 0.01到c = 100之间(在日志尺度上)均匀分布的c值。我们在图2中绘制这条线。
Further, we have found that if choose the smallest c such that f(x∗) ≤ 0, the solution is within 5% of optimal 70% of the time, and within 30% of optimal 98% of the time, where “optimal” refers to the solution found using the best value of c. Therefore, in our implementations we use modified binary search to choose c.
进一步,我们发现如果选择最小的c f (x∗)≤0,解决方案是在最佳的70%的时间的5%,和30%的98%的时间内,在“最优”是指所找到的解决方案使用c的最佳值。因此,在我们的实现中,我们使用修改二进制搜索选择c。
To ensure the modification yields a valid image, we have a constraint on δ: we must have 0 ≤ xi +δi ≤ 1 for all i. In the optimization literature, this is known as a “box constraint.” Previous work uses a particular optimization algorithm, LBFGS-B, which supports box constraints natively.
为了保证修改后的图像是有效的,我们有一个关于约束约束:对于所有i,我们必须有0≤xi +所有i的约束约束i≤1,在优化文献中称为“框约束”。之前的工作使用了一种特殊的优化算法lbjs - b,它本身就支持框约束。
We investigate three different methods of approaching this problem.
我们研究了处理这个问题的三种不同方法。
Projected gradient descent performs one step of standard gradient descent, and then clips all the coordinates to be within the box. This approach can work poorly for gradient descent approaches that have a complicated update step (for example, those with momentum): when we clip the actual xi, we unexpectedly change the input to the next iteration of the algorithm.
投影梯度下降执行标准梯度下降的一步,然后剪辑方框内的所有坐标。这种方法对于具有复杂更新步骤的梯度下降方法(例如那些带有动量的方法)效果不佳:当我们剪辑实际的xi时,我们会意外地将输入更改为算法的下一个迭代。
Clipped gradient descent does not clip xi on each iteration; rather, it incorporates the clipping into the objective function to be minimized. In other words, we replace f(x + δ) with f(min(max(x + δ, 0), 1)), with the min and max taken component-wise. While solving the main issue with projected gradient descent, clipping introduces a new problem: the algorithm can get stuck in a flat spot where it has increased some component xi to be substantially larger than the maximum allowed. When this happens, the partial derivative becomes zero, so even if some improvement is possible by later reducing xi, gradient descent has no way to detect this.
剪切梯度下降不会在每次迭代中裁剪xi;相反,它将剪切合并到要最小化的目标函数中。换句话说,我们用f(min(max(x +不计其数),1))替换f(x +不计其数),用组件方面的最小值和最大值。在解决投影梯度下降的主要问题时,剪切引入了一个新问题:当增加了某个分量xi,使其大大超过允许的最大值时,该算法可能会陷入一个平坦点。当这种情况发生时,偏导数变为零,所以即使以后通过减少xi有可能有所改进,梯度下降也无法检测到它。
变量的变化引入了一个新的变量w,而不是对上面定义的变量变量集进行优化,我们应用变量的变化并对w进行优化,从1 tanh(wi) 1开始设置,结果是0 xi+停止i 1,因此解决方案将自动有效。我们可以把这种方法看作是一种平滑的剪切梯度下降,它消除了陷入极端区域的问题。
These methods allow us to use other optimization algorithms that don’t natively support box constraints. We use the Adam [23] optimizer almost exclusively, as we have found it to be the most effective at quickly finding adversarial examples. We tried three solvers — standard gradient descent, gradient descent with momentum, and Adam — and all three produced identical-quality solutions. However, Adam converges substantially more quickly than the others.
这些方法允许我们使用其他原生不支持框约束的优化算法。我们几乎只使用Adam[23]优化器,因为我们发现它在快速查找敌对的示例时是最有效的。我们尝试了三种解决方案——标准梯度下降、动量梯度下降和Adam——这三种方案都产生了相同质量的解决方案。然而,Adam的收敛速度比其他公司快得多。
For each possible objective function f(·) and method to enforce the box constraint, we evaluate the quality of the adversarial examples found.
对于每一个可能的目标函数f(·)和执行框约束的方法,我们评估所发现的对抗性例子的质量。
To choose the optimal c, we perform 20 iterations of binary search over c. For each selected value of c, we run 10, 000 iterations of gradient descent with the Adam optimizer. 9
为了选择最优的c,我们对c进行了20次迭代的二分搜索,对于每个选择的c值,我们使用Adam优化器对梯度下降进行了1万次迭代。9
The results of this analysis are in Table III. We evaluate the quality of the adversarial examples found on the MNIST and CIFAR datasets. The relative ordering of each objective function is identical between the two datasets, so for brevity we report only results for MNIST.
分析结果见表三。我们评估了在MNIST和CIFAR数据集中发现的敌对例子的质量。两个数据集之间的每个目标函数的相对排序是相同的,因此为了简洁,我们只报告MNIST的结果。
There is a factor of three difference in quality between the best objective function and the worst. The choice of method for handling box constraints does not impact the quality of results as significantly for the best minimization functions.
最好的目标函数和最差的目标函数在质量上有三个不同的因素。对于最佳最小化函数来说,处理框约束的方法的选择不会对结果的质量产生显著影响。
In fact, the worst performing objective function, cross entropy loss, is the approach that was most suggested in the literature previously [46], [42].
事实上,最糟糕的目标函数,交叉熵损失,是在之前的文献[46],[42]中提出最多的方法。
Why are some loss functions better than others? When c = 0, gradient descent will not make any move away from the initial image. However, a large c often causes the initial steps of gradient descent to perform in an overly-greedy manner, only traveling in the direction which can most easily reduce f and ignoring the D loss — thus causing gradient descent to find sub-optimal solutions.
为什么有些损失函数比其他函数更好?当c = 0时,梯度下降不会离开初始图像。但是,当c较大时,往往会导致梯度下降的初始步骤过于贪心,只在最容易减少f的方向行进,忽略D的损失,从而导致梯度下降寻找次优解。
This means that for loss function f1 and f4, there is no good constant c that is useful throughout the duration of the gradient descent search. Since the constant c weights the relative importance of the distance term and the loss term, in order for a fixed constant c to be useful, the relative value of these two terms should remain approximately equal. This is not the case for these two loss functions.
这意味着对于损失函数f1和f4,在梯度下降搜索的整个过程中没有一个好的常数c是有用的。由于常数c是距离项和损失项的相对重要性的权重,为了使一个固定的常数c有用,这两个项的相对值应该保持近似相等。这两个损失函数不是这样的。
To explain why this is the case, we will have to take a side discussion to analyze how adversarial examples exist. Consider a valid input x and an adversarial example x on a network.
为了解释为什么会这样,我们将不得不采取一个侧面讨论来分析如何存在敌对的例子。考虑一个有效的输入x和一个网络上的敌对示例x。
What does it look like as we linearly interpolate from x to x ? That is, let y = αx+(1−α)x for α ∈ [0, 1]. It turns out the value of Z(·)t is mostly linear from the input to the adversarial example, and therefore the F(·)t is a logistic. We verify this fact empirically by constructing adversarial examples on the first 1, 000 test images on both the MNIST and CIFAR dataset with our approach, and find the Pearson correlation coefficient r>.9.
从x到x的线性插值是什么样子的?也就是,让y = (3)x +(1)x for (5)结果是Z(·)t的值从输入到反例中大部分是线性的,因此F(·)t是logistic的。我们通过在MNIST和CIFAR数据集的前1000张测试图像上使用我们的方法构建敌对的例子来验证这一事实,并找到皮尔逊相关系数r>.9。
Given this, consider loss function f4 (the argument for f1 is similar). In order for the gradient descent attack to make any change initially, the constant c will have to be large enough that
鉴于此,考虑损失函数f4 (f1的参数是类似的)。为了让梯度下降攻击在开始时产生任何变化,常数c必须足够大
\epsilon<c(f1(x + \epsilon) − f1(x))\tag{17}
1/c < |∇f1(x)| \tag{18}
implying that c must be larger than the inverse of the gradient to make progress, but the gradient of f1 is identical to F(·)t so will be tiny around the initial image, meaning c will have to be extremely large.
这意味着c必须大于梯度的倒数才能继续前进,但f1的梯度与F(·)t相同,所以在初始图像周围很小,这意味着c必须非常大。
However, as soon as we leave the immediate vicinity of the initial image, the gradient of ∇f1(x + δ) increases at an exponential rate, making the large constant c cause gradient descent to perform in an overly greedy manner.
然而,一旦离开初始图像的紧邻区域,∇f1(x +土生土长)的梯度以指数速度增长,使得较大的常数c导致梯度下降过于贪婪。
We verify all of this theory empirically. When we run our attack trying constants chosen from 10−10 to 1010 the average constant for loss function f4 was 106.
我们以经验验证了所有这些理论。当我们运行我们的攻击尝试常数从10 - 10到1010选择的损失函数f4的平均常数是106。
The average gradient of the loss function f1 around the valid image is 2−20 but 2−1 at the closest adversarial example. This means c is a million times larger than it has to be, causing the loss function f4 and f1 to perform worse than any of the others.
在有效图像周围的损失函数f1的平均梯度是2 - 20,而在最接近的对抗例子是2 - 1。这意味着c比它必须要大一百万倍,导致损失函数f4和f1的表现比其他任何函数都差。
We model pixel intensities as a (continuous) real number in the range [0, 1]. However, in a valid image, each pixel intensity must be a (discrete) integer in the range {0, 1,..., 255}. This additional requirement is not captured in our formulation. In practice, we ignore the integrality constraints, solve the continuous optimization problem, and then round to the nearest integer: the intensity of the ith pixel becomes 255(xi + δi) .
我们将像素强度建模为范围内的一个(连续的)实数[0,1]。然而,在一个有效的图像中,每个像素强度必须是一个范围{0,1,…的(离散的)整数。255}。这个额外的要求没有被包含在我们的公式中。在实践中,我们忽略完整性约束,求解连续优化问题,然后四舍五入到最接近的整数:第i个像素的亮度变为255(xi + collections i)。
This rounding will slightly degrade the quality of the adversarial example. If we need to restore the attack quality, we perform greedy search on the lattice defined by the discrete solutions by changing one pixel value at a time. This greedy search never failed for any of our attacks.
这个四舍五入会稍微降低对抗性示例的质量。如果需要恢复攻击质量,我们可以通过一次改变一个像素值来对由离散解定义的格进行贪婪搜索。这种贪婪的搜索从来没有失败过我们的任何攻击。
minimize~\lVert \frac{1}{2}(tanh(\omega)+1)-x \rVert_{2}^{2}+c\cdot f(\frac{1}{2}tanh(\omega)+1)\tag{18}
with f defined as
f定义为
f(x^{\prime}) = max(max{Z(x^{\prime})_{i} : i \neq t} − Z(x^{\prime})_{t}, −κ).\tag{19}
This f is based on the best objective function found earlier, modified slightly so that we can control the confidence with which the misclassification occurs by adjusting κ. The parameter κ encourages the solver to find an adversarial instance x that will be classified as class t with high confidence. We set κ = 0 for our attacks but we note here that a side benefit of this formulation is it allows one to control for the desired confidence. This is discussed further in Section VIII-D.
这f是基于最好的目标函数发现之前,略有修改,这样我们可以控制的信心通过调整κ发生误分类。κ鼓励的参数解算器发现一个敌对的实例x将分为类t高的信心。我们设置κ= 0的攻击,但我们注意,这个配方的附带好处是允许一个控制所需的信心。第八- d节对此作进一步讨论。
Figure 3 shows this attack applied to our MNIST model for each source digit and target digit. Almost all attacks are visually indistinguishable from the original digit.
图3显示了对每个源数字和目标数字应用这种攻击的MNIST模型。几乎所有的攻击在视觉上都与原始数字没有区别。
A comparable figure (Figure 12) for CIFAR is in the appendix. No attack is visually distinguishable from the baseline image.
附录中有CIFAR的可比数据(图12)。没有攻击在视觉上与基线图像区分。
Multiple starting-point gradient descent. The main problem with gradient descent is that its greedy search is not guaranteed to find the optimal solution and can become stuck in a local minimum. To remedy this, we pick multiple random starting points close to the original image and run gradient descent from each of those points for a fixed number of iterations. We randomly sample points uniformly from the ball of radius r, where r is the closest adversarial example found so far. Starting from multiple starting points reduces the likelihood that gradient descent gets stuck in a bad local minimum.
多起点梯度下降。梯度下降的主要问题是它的贪婪搜索不能保证找到最优解,而且会陷入局部最小值。为了解决这个问题,我们在原始图像附近随机选取多个起始点,并对每个点进行梯度下降,进行固定次数的迭代。我们从半径为r的球中均匀地随机采样点,其中r是迄今为止发现的最接近的对抗性例子。从多个起始点开始降低了梯度下降陷入一个糟糕的局部最小值的可能性。
The L0 distance metric is non-differentiable and therefore is ill-suited for standard gradient descent. Instead, we use an iterative algorithm that, in each iteration, identifies some pixels that don’t have much effect on the classifier output and then fixes those pixels, so their value will never be changed. The set of fixed pixels grows in each iteration until we have, by process of elimination, identified a minimal (but possibly not minimum) subset of pixels that can be modified to generate an adversarial example. In each iteration, we use our L2 attack to identify which pixels are unimportant.
L0距离度量是不可微的,因此不适用于标准梯度下降。相反,我们使用一种迭代算法,在每次迭代中,识别出一些对分类器输出没有太大影响的像素,然后对这些像素进行修正,使它们的值永远不会改变。固定像素的集合在每次迭代中都会增长,直到我们通过消除的过程,识别出一个最小(但可能不是最小)的像素子集,这个子集可以被修改以生成一个敌对的示例。在每次迭代中,我们使用L2攻击来识别哪些像素不重要。
In more detail, on each iteration, we call the L2 adversary, restricted to only modify the pixels in the allowed set. Let δ be the solution returned from the L2 adversary on input image x, so that x+δ is an adversarial example. We compute g = ∇f(x + δ) (the gradient of the objective function, evaluated at the adversarial instance). We then select the pixel i = arg mini gi · δi and fix i, i.e., remove i from the allowed set.11 The intuition is that gi ·δi tells us how much reduction to f(·) we obtain from the ith pixel of the image, when moving from x to x + δ: gi tells us how much reduction in f we obtain, per unit change to the ith pixel, and we multiply this by how much the ith pixel has changed. This process repeats until the L2 adversary fails to find an adversarial example.
There is one final detail required to achieve strong results: choosing a constant c to use for the L2 adversary. To do this, we initially set c to a very low value (e.g., 10−4). We then run our L2 adversary at this c-value. If it fails, we double c and try again, until it is successful. We abort the search if c exceeds a fixed threshold (e.g., 1010).
有一个最终的细节要求实现强大的结果:选择一个常数c使用的L2对手。为了做到这一点,我们首先将c设置为一个非常低的值(例如,10 4),然后我们在这个c值运行我们的L2对手。如果失败了,我们就加倍努力,再试一次,直到成功。如果c超过一个固定的阈值(例如1010),我们将中止搜索。
JSMA grows a set — initially empty — of pixels that are allowed to be changed and sets the pixels to maximize the total loss. In contrast, our attack shrinks the set of pixels — initially containing every pixel — that are allowed to be changed.
JSMA将一组最初为空的允许更改的像素进行增长,并将这些像素进行设置以使总损失最大化。相比之下,我们的攻击缩小了最初包含允许更改的每个像素的像素集。
Our algorithm is significantly more effective than JSMA (see Section VII for an evaluation). It is also efficient: we introduce optimizations that make it about as fast as our L2 attack with a single starting point on MNIST and CIFAR; it is substantially slower on ImageNet. Instead of starting gradient descent in each iteration from the initial image, we start the gradient descent from the solution found on the previous iteration (“warm-start”). This dramatically reduces the number of rounds of gradient descent needed during each iteration, as the solution with k pixels held constant is often very similar to the solution with k + 1 pixels held constant.
我们的算法明显比JSMA更有效(见第7节评估)。它也是高效的:我们引入了优化,使它大约作为我们的L2攻击与单一起点MNIST和CIFAR;它在ImageNet上要慢得多。我们不是在每次迭代中从初始图像开始梯度下降,而是从上次迭代中找到的解决方案(warm-start)开始梯度下降。这大大减少了每次迭代所需的梯度下降轮数,因为保持k个像素不变时的解通常与保持k + 1个像素不变时的解非常相似。
Figure 4 shows the L0 attack applied to one digit of each source class, targeting each target class, on the MNIST dataset. The attacks are visually noticeable, implying the L0 attack is more difficult than L2. Perhaps the worst case is that of a 7 being made to classify as a 6; interestingly, this attack for L2 is one of the only visually distinguishable attacks.
图4显示了应用于MNIST数据集上每个源类的一位数字的L0攻击,目标是每个目标类。攻击在视觉上是明显的,这意味着L0攻击比L2更难。也许最坏的情况是7被归类为6;有趣的是,L2的这种攻击是唯一视觉上可区分的攻击之一。
A comparable figure (Figure 11) for CIFAR is in the appendix.
附录中有CIFAR的可比图(图11)。
The L∞ distance metric is not fully differentiable and standard gradient descent does not perform well for it. We experimented with naively optimizing
L∞距离度量不是完全可微的,而标准梯度下降法对其性能不佳。我们进行了天真的优化实验
minimize ~c · f(x + δ) + \lVert δ \rVert_{\infty}\tag{19}
minimize ~c · f(x + δ) + ·\sum\limits_{i}[(δi − τ )^{+}]\tag{20}
We resolve this issue using an iterative attack. We replace the L2 term in the objective function with a penalty for any terms that exceed τ (initially 1, decreasing in each iteration). This prevents oscillation, as this loss term penalizes all large values simultaneously. Specifically, in each iteration we solve After each iteration, if δi < τ for all i, we reduce τ by a factor of 0.9 and repeat; otherwise, we terminate the search.
我们使用迭代攻击来解决这个问题。我们在目标函数中用惩罚替换L2项,对于任何超过局部的项(最初是1,在每次迭代中减少)。这防止了振荡,因为这个损失项会同时惩罚所有的大值。具体来说,在每次迭代中,我们在每次迭代后求解,如果对于所有的i,我们减少了0.9倍的过滤,并重复;否则,我们终止搜索。
minimize ~c · f(x + δ) + ·\sum\limits_{i}[(δi − τ )^{+}]\tag{20}
Again we must choose a good constant c to use for the L∞ adversary. We take the same approach as we do for the L0 attack: initially set c to a very low value and run the L∞ adversary at this c-value. If it fails, we double c and try again, until it is successful. We abort the search if c exceeds a fixed threshold.
同样,我们必须选择一个好的常数c来为L的对手所用。我们采用与L0攻击相同的方法:首先将c设置为一个非常低的值,然后以这个c值运行L对手。如果失败了,我们就加倍努力,再试一次,直到成功。如果c超过一个固定的阈值,我们将中止搜索。
Using “warm-start” for gradient descent in each iteration, this algorithm is about as fast as our L2 algorithm (with a single starting point).
在每次迭代中使用“暖开始”来进行梯度下降,这个算法和我们的L2算法差不多快(只有一个起始点)。
Figure 5 shows the L∞ attack applied to one digit of each source class, targeting each target class, on the MNSIT dataset. While most differences are not visually noticeable, a few are. Again, the worst case is that of a 7 being made to classify as a 6.
图5显示了针对MNSIT数据集上的每个目标类,对每个源类的一个数字应用L∞攻击。虽然大多数差异在视觉上看不出来,但有一些是明显的。再一次,最坏的情况是7被归类为6。
A comparable figure (Figure 13) for CIFAR is in the appendix. No attack is visually distinguishable from the baseline image.
附录中有CIFAR的可比数据(图13)。没有攻击在视觉上与基线图像区分。
We compare our targeted attacks to the best results previously reported in prior publications, for each of the three distance metrics.
针对这三个距离指标,我们将有针对性的攻击与之前的出版物中报道的最佳结果进行比较。
For JSMA we use the implementation in CleverHans [35] with only slight modification (we improve performance by 50× with no impact on accuracy).
对于JSMA,我们使用CleverHans[35]中的实现,只做了轻微的修改(我们将性能提高了50倍,但对精度没有影响)。
JSMA is unable to run on ImageNet due to an inherent significant computational cost: recall that JSMA performs search for a pair of pixels p, q that can be changed together that make the target class more likely and other classes less likely. ImageNet represents images as 299 × 299 × 3 vectors, so searching over all pairs of pixels would require 236 work on each step of the calculation. If we remove the search over pairs of pixels, the success of JSMA falls off dramatically. We therefore report it as failing always on ImageNet.
JSMA无法在ImageNet上运行,原因是其固有的巨大计算成本:回想一下,JSMA搜索的是一对像素p和q,它们可以一起改变,从而增加目标类的可能性,降低其他类的可能性。ImageNet将图像表示为299 299 3个向量,因此在计算的每一步中搜索所有的像素对需要236次工作。如果我们去除对像素的搜索,JSMA的成功程度会急剧下降。因此,我们报告它在ImageNet上总是失败。
We report success if the attack produced an adversarial example with the correct target label, no matter how much change was required. Failure indicates the case where the attack was entirely unable to succeed.
如果攻击产生了带有正确目标标签的敌对示例,则报告成功,无论需要进行多少更改。Failure表示攻击完全无法成功的情况。
We evaluate on the first 1, 000 images in the test set on CIFAR and MNSIT. On ImageNet, we report on 1, 000 images that were initially classified correctly by Inception v3 12. On ImageNet we approximate the best-case and worst-case results by choosing 100 target classes (10%) at random.
我们在CIFAR和MNSIT上对测试集中的前1000张图像进行评估。在ImageNet上,我们报告了1000张最初被Inception v3 12正确分类的图像。在ImageNet上,我们通过随机选择100个目标类(10%)来近似最佳情况和最差情况的结果。
The results are found in Table IV for MNIST and CIFAR, and Table V for ImageNet. 13
MNIST和CIFAR的结果见表IV, ImageNet的结果见表V。13
For each distance metric, across all three datasets, our attacks find closer adversarial examples than the previous state-of-the-art attacks, and our attacks never fail to find an adversarial example. Our L0 and L2 attacks find adversarial examples with 2× to 10× lower distortion than the best previously published attacks, and succeed with 100% probability. Our L∞ attacks are comparable in quality to prior work, but their success rate is higher. Our L∞ attacks on ImageNet are so successful that we can change the classification of an image to any desired label by only flipping the lowest bit of each pixel, a change that would be impossible to detect visually.
对于每一个距离度量,在所有三个数据集上,我们的攻击比以前的最先进的攻击找到更接近的敌对的例子,并且我们的攻击从来没有失败过一个敌对的例子。我们的L0和L2攻击找到了与之前发布的最佳攻击相比失真低2到10倍的对抗例子,并且以100%的概率成功。我们的攻击在质量上与之前的工作差不多,但是成功率更高。我们对ImageNet的攻击是如此成功,以至于我们可以通过只翻转每个像素的最低位来将图像的分类更改为所需的标签,这种更改在视觉上是无法检测到的。
As the learning task becomes increasingly more difficult, the previous attacks produce worse results, due to the complexity of the model. In contrast, our attacks perform even better as the task complexity increases. We have found JSMA is unable to find targeted L0 adversarial examples on ImageNet, whereas ours is able to with 100% success.
随着学习任务的难度越来越大,由于模型的复杂性,以往的攻击产生的效果越来越差。相反,当任务复杂度增加时,我们的攻击会表现得更好。我们发现JSMA无法在ImageNet上找到有针对性的10个敌对实例,而我们的可以100%成功地找到。
It is important to realize that the results between models are not directly comparable. For example, even though a L0 adversary must change 10 times as many pixels to switch an ImageNet classification compared to a MNIST classification, ImageNet has 114× as many pixels and so the fraction of pixels that must change is significantly smaller.
重要的是要认识到模型之间的结果并不是直接可比的。例如,尽管与MNIST分类相比,L0对手必须更改10倍的像素才能切换ImageNet分类,但ImageNet有114倍的像素,因此必须更改的像素的比例要小得多。
Generating synthetic digits. With our targeted adversary, we can start from any image we want and find adversarial examples of each given target. Using this, in Figure 6 we show the minimum perturbation to an entirely-black image required to make it classify as each digit, for each of the distance metrics.
生成合成数字。对于我们的目标对手,我们可以从我们想要的任何图像开始,并找到每个给定目标的敌对例子。利用这个,在图6中,我们显示了对完全黑色的图像的最小扰动,对于每一个距离度量,使它分类为每一位数字。
This experiment was performed for the L0 task previously [38], however when mounting their attack, “for classes 0, 2, 3 and 5 one can clearly recognize the target digit.” With our more powerful attacks, none of the digits are recognizable. Figure 7 performs the same analysis starting from an all-white image.
本实验是针对[38]之前的L0任务进行的,但在实施攻击时,对于0、2、3、5类可以清晰地识别目标数字。在我们更强大的攻击下,没有一个数字是可识别的。图7从一个全白图像开始执行相同的分析。
Notice that the all-black image requires no change to become a digit 1 because it is initially classified as a 1, and the all-white image requires no change to become a 8 because the initial image is already an 8.
注意,全黑图像不需要更改就可以变成数字1,因为它最初被分类为1,而全白图像不需要更改就可以变成数字8,因为初始图像已经是8了。
Runtime Analysis. We believe there are two reasons why one may consider the runtime performance of adversarial example generation algorithms important: first, to understand if the performance would be prohibitive for an adversary to actually mount the attacks, and second, to be used as an inner loop in adversarial re-training [11].
运行时分析。我们认为有两个原因可以解释为什么人们会认为对抗性示例生成算法的运行时性能很重要:首先,为了理解性能是否会让对手无法真正进行攻击,其次,在对抗性再训练[11]中用作内部循环。
Comparing the exact runtime of attacks can be misleading. For example, we have parallelized the implementation of our L2 adversary allowing it to run hundreds of attacks simultaneously on a GPU, increasing performance from 10× to 100×. However, we did not parallelize our L0 or L∞ attacks. Similarly, our implementation of fast gradient sign is parallelized, but JSMA is not. We therefore refrain from giving exact performance numbers because we believe an unfair comparison is worse than no comparison.
比较攻击的准确运行时间可能会引起误解。例如,我们将我们的L2对手的实现并行化,允许它在GPU上同时运行数百次攻击,将性能从10提高到100。但是,我们没有并行化L0或L攻击。类似地,我们的快速梯度符号的实现是并行化的,而JSMA不是。因此,我们避免给出确切的性能数字,因为我们认为,不公平的比较比没有比较更糟糕。
All of our attacks, and all previous attacks, are plenty efficient to be used by an adversary. No attack takes longer than a few minutes to run on any given instance. When compared to L0, our attacks are 2 × −10× slower than our optimized JSMA algorithm (and significantly faster than the un-optimized version). Our attacks are typically 10× −100× slower than previous attacks for L2 and L∞, with exception of iterative gradient sign which we are 10× slower.
我们所有的攻击,以及之前所有的攻击,对于对手来说都是非常有效的。任何攻击在任何给定实例上运行的时间都不超过几分钟。与L0相比,我们的攻击速度比优化后的JSMA算法慢了10倍(比未优化的版本快了很多)。我们对L2和L的攻击通常要慢10到100次,除了迭代梯度符号要慢10次。
Distillation was initially proposed as an approach to reduce a large model (the teacher) down to a smaller distilled model [19]. At a high level, distillation works by first training the teacher model on the training set in a standard manner. Then, we use the teacher to label each instance in the training set with soft labels (the output vector from the teacher network). For example, while the hard label for an image of a hand-written digit 7 will say it is classified as a seven, the soft labels might say it has a 80% chance of being a seven and a 20% chance of being a one. Then, we train the distilled model on the soft labels from the teacher, rather than on the hard labels from the training set. Distillation can potentially increase accuracy on the test set as well as the rate at which the smaller model learns to predict the hard labels [19], [30].
蒸馏最初被提出作为一种方法,以减少一个大模型(教师)到一个更小的蒸馏模型[19]。在高水平上,蒸馏工作是首先以标准的方式在训练集上训练教师模型。然后,我们使用教师对训练集中的每个实例用软标签(教师网络的输出向量)进行标记。例如,手写数字7的图像的硬标签会说它被归类为7,而软标签可能会说它有80%的几率是7,20%的几率是1。然后,我们在老师提供的软标签上而不是在训练集提供的硬标签上训练经过提炼的模型。蒸馏可以潜在地提高测试集的准确性,以及较小的模型学习预测硬标签[19]和[30]的速度。
Defensive distillation uses distillation in order to increase the robustness of a neural network, but with two significant changes. First, both the teacher model and the distilled model are identical in size — defensive distillation does not result in smaller models. Second, and more importantly, defensive distillation uses a large distillation temperature (described below) to force the distilled model to become more confident in its predictions.
防御蒸馏采用蒸馏的方法来提高神经网络的鲁棒性,但有两个显著的变化。首先,教师模型和蒸馏模型在尺寸上是相同的,防御蒸馏不会导致更小的模型。第二,也是更重要的,防御性蒸馏使用一个大的蒸馏温度(下面描述)来迫使蒸馏模型对它的预测更有信心。
Recall that, the softmax function is the last layer of a neural network. Defensive distillation modifies the softmax function to also include a temperature constant T:
回想一下,softmax函数是神经网络的最后一层。防御性蒸馏修改softmax函数,也包括温度常数T:
softmax(x, T)_{i}=\frac{e^{x_{i}/T}}{\sum_{j}e^{x_{j}/T}}\tag{20}
It is easy to see that softmax(x, T) = softmax(x/T, 1). Intuitively, increasing the temperature causes a “softer” maximum, and decreasing it causes a “harder” maximum. As the limit of the temperature goes to 0, softmax approaches max; as the limit goes to infinity, softmax(x) approaches a uniform distribution.
很容易看出,softmax(x, T) = softmax(x/T, 1)。直观地看,升高温度会产生“较软”的最大值,降低温度会产生“较硬”的最大值。当温度极限为0时,softmax接近max;当极限趋于无穷时,softmax(x)趋于均匀分布。
Defensive distillation proceeds in four steps:
防御性蒸馏分为四个步骤:
Train a network, the teacher network, by setting the temperature of the softmax to T during the training phase.
在训练阶段,通过将softmax的温度设置为T来训练一个网络,即教师网络。
Compute soft labels by apply the teacher network to each instance in the training set, again evaluating the softmax at temperature T.
将教师网络应用于训练集的每个实例,计算软标签,再次评估温度T下的softmax。
Train the distilled network (a network with the same shape as the teacher network) on the soft labels, using softmax at temperature T.
在温度T下,使用softmax在软标签上训练纯化后的网络(与教师网络形状相同的网络)。
Finally, when running the distilled network at test time (to classify new inputs), use temperature 1.
最后,在测试时运行经过蒸馏的网络(对新输入进行分类)时,使用温度1。
We briefly investigate the reason that existing attacks fail on distilled networks, and find that existing attacks are very fragile and can easily fail to find adversarial examples even when they exist.
我们简要调查了现有攻击失败的原因,并发现,现有的攻击非常脆弱,很容易无法找到对抗的例子,即使他们存在。
L-BFGS and Deepfool fail due to the fact that the gradient of F(·) is zero almost always, which prohibits the use of the standard objective function.
L-BFGS和Deepfool之所以失败,是因为F(·)的梯度几乎总是为零,这阻止了标准目标函数的使用。
When we train a distilled network at temperature T and then test it at temperature 1, we effectively cause the inputs to the softmax to become larger by a factor of T. By minimizing the cross entropy during training, the output of the softmax is forced to be close to 1.0 for the correct class and 0.0 for all others. Since Z(·) is divided by T, the distilled network will learn to make the Z(·) values T times larger than they otherwise would be. (Positive values are forced to become about T times larger; negative values are multiplied by a factor of about T and thus become even more negative.) Experimentally, we verified this fact: the mean value of the L1 norm of Z(·) (the logits) on the undistilled network is 5.8 with standard deviation 6.4; on the distilled network (with T = 100), the mean is 482 with standard deviation 457.
当我们训练一个网络在蒸馏温度T然后测试温度1,我们有效地导致输入将softmax成为大T的一个因素减到最小交叉熵在培训期间,将softmax被迫的输出接近1.0 0.0,其他所有正确的类。由于Z(·)是除以T的,经过提炼的网络将学会使Z(·)值比不这样做时大T倍。(正值被迫增大了大约T倍;负的值乘以大约T的一倍,因此变得更负。通过实验,我们验证了这一事实:在未提取的网络上,Z(·)(logits)的L1模均值为5.8,标准差为6.4;在经过蒸馏的网络(T = 100)上,均值为482,标准差为457。
This causes the L-BFGS minimization procedure to fail to make progress and terminate. If instead we run L-BFGS with our stable objective function identified earlier, rather than the objective function lossF,l(·) suggested by Szegedy et al. [46], L-BFGS does not fail. An alternate approach to fixing the attack would be to set
这导致L-BFGS最小化程序无法进行并终止。如果我们使用先前确定的稳定目标函数来运行l - bfgs,而不是Szegedy等人提出的目标函数lossF,l(·)。[46],l - bfgs不会失败。另一种修复攻击的方法是设置
F^{\prime} (x) = softmax(Z(x)/T)\tag{22}
where T is the distillation temperature chosen. Then minimizing lossF ,l(·) will not fail, as now the gradients do not vanish due to floating-point arithmetic rounding. This clearly demonstrates the fragility of using the loss function as the objective to minimize.
式中,T为选择的蒸馏温度。然后最小化损失,l(·)就不会失败,因为现在由于浮点算术四舍五入,梯度不会消失。这清楚地说明了使用损失函数作为最小化目标的脆弱性。
JSMA-F (whereby we mean the attack uses the output of the final layer F(·)) fails for the same reason that L-BFGS fails: the output of the Z(·) layer is very large and so softmax becomes essentially a hard maximum. This is the version of the attack that Papernot et al. use to attack defensive distillation in their paper [39].
jsm -F(这里我们指的是攻击使用最后一层F(·)的输出)失败的原因与L-BFGS失败的原因相同:Z(·)层的输出非常大,因此softmax本质上成为一个硬最大值。这是Papernot等人在他们的论文[39]中用来攻击防御蒸馏的攻击版本。
JSMA-Z (the attack that uses the logits) fails for a completely different reason. Recall that in the Z(·) version of the attack, we use the input to the softmax for computing the gradient instead of the final output of the network. This removes any potential issues with the gradient vanishing, however this introduces new issues. This version of the attack is introduced by Papernot et al. [38] but it is not used to attack distillation; we provide here an analysis of why it fails.
jsm - z(使用日志的攻击)失败的原因完全不同。回想一下,在Z(·)版本的攻击中,我们使用softmax的输入来计算梯度,而不是网络的最终输出。这消除了梯度消失的任何潜在问题,但这引入了新的问题。该版本的攻击是由Papernot等[38]介绍的,但它不用于攻击蒸馏;我们在这里提供了它失败的原因分析。
Since this attack uses the Z values, it is important to realize the differences in relative impact. If the smallest input to the softmax layer is −100, then, after the softmax layer, the corresponding output becomes practically zero. If this input changes from −100 to −90, the output will still be practically zero. However, if the largest input to the softmax layer is 10, and it changes to 0, this will have a massive impact on the softmax output.
由于这种攻击使用Z值,因此认识到相对影响的差异是很重要的。如果softmax层的最小输入为−100,则在softmax层之后,相应的输出实际上为零。如果输入从- 100变化到- 90,输出仍然几乎为零。然而,如果softmax层的最大输入是10,而它变为0,这将对softmax输出产生巨大的影响。
Relating this to parameters used in their attack, α and β represent the size of the change at the input to the softmax layer. It is perhaps surprising that JSMA-Z works on undistilled networks, as it treats all changes as being of equal importance, regardless of how much they change the softmax output. If changing a single pixel would increase the target class by 10, but also increase the least likely class by 15, the attack will not increase that pixel.
与此相关的参数在他们的攻击,a和p代表在softmax层的输入改变的大小。令人惊讶的是,jsm - z可以在未精简的网络上工作,因为它将所有的更改都视为同等重要,而不管softmax的输出有多大变化。如果改变一个像素会增加目标职业10个,但同时增加最少职业15个,那么攻击不会增加那个像素。
Recall that distillation at temperature T causes the value of the logits to be T times larger. In effect, this magnifies the suboptimality noted above as logits that are extremely unlikely but have slight variation can cause the attack to refuse to make any changes.
回想一下,温度为T时的蒸馏使对数的值增大了T倍。实际上,这放大了上面提到的次最优性,因为logit极其不可能,但有轻微的变化,就会导致攻击拒绝进行任何更改。
Fast Gradient Sign fails at first for the same reason LBFGS fails: the gradients are almost always zero. However, something interesting happens if we attempt the same division trick and divide the logits by T before feeding them to the softmax function: distillation still remains effective [36]. We are unable to explain this phenomenon.
快速梯度符号首先失败的原因和LBFGS失败的原因是一样的:梯度几乎总是零。然而,如果我们尝试同样的除法技巧,并在将logit输入softmax函数之前除以T,有趣的事情就会发生:蒸馏仍然是有效的[36]。我们无法解释这种现象。
When we apply our attacks to defensively distilled networks, we find distillation provides only marginal value. We re-implement defensive distillation on MNIST and CIFAR-10 as described [39] using the same model we used for our evaluation above. We train our distilled model with temperature T = 100, the value found to be most effective [39].
当我们将攻击应用到防御蒸馏网络时,我们发现蒸馏只提供边际价值。我们在MNIST和CIFAR-10上重新实现了防御蒸馏,就像[39]所描述的那样,使用了与我们在上面评估时使用的相同的模型。我们用温度T = 100训练我们的蒸馏模型,发现[39]是最有效的。
Table VI shows our attacks when applied to distillation. All of the previous attacks fail to find adversarial examples. In contrast, our attack succeeds with 100% success probability for each of the three distance metrics.
表VI显示了应用于蒸馏时的攻击。之前的所有攻击都没有找到与之相反的例子。相比之下,我们的攻击在三个距离指标中每一个都有100%的成功率。
When compared to Table IV, distillation has added almost no value: our L0 and L2 attacks perform slightly worse, and our L∞ attack performs approximately equally. All of our attacks succeed with 100% success.
与表IV相比,蒸馏几乎没有增加任何价值:我们的L0和L2攻击的性能略差,而我们的L∞攻击的性能大致相同。我们所有的攻击都100%成功。
In the original work, increasing the temperature was found to consistently reduce attack success rate. On MNIST, this goes from a 91% success rate at T = 1 to a 24% success rate for T = 5 and finally 0.5% success at T = 100.
在最初的工作中,我们发现升高温度会持续降低攻击成功率。在MNIST上,从T = 1时91%的成功率到T = 5时24%的成功率,最后是T = 100时0.5%的成功率。
Distillation Temperature
馏出温度
We re-implement this experiment with our improved attacks to understand how the choice of temperature impacts robustness. We train models with the temperature varied from t = 1 to t = 100.
我们用改进的攻击重新实现了这个实验,以了解温度的选择如何影响鲁棒性。我们用t = 1到100的温度变化来训练模型。
When we re-run our implementation of JSMA, we observe the same effect: attack success rapidly decreases. However, with our improved L2 attack, we see no effect of temperature on the mean distance to adversarial examples: the correlation coefficient is ρ = −0.05. This clearly demonstrates the fact that increasing the distillation temperature does not increase the robustness of the neural network, it only causes existing attacks to fail more often.
当我们重新运行JSMA的实现时,我们观察到相同的效果:攻击成功率迅速下降。然而,在我们改进的L2攻击中,我们没有看到温度对到敌对例子的平均距离的影响:相关系数为:纵向=−0.05。这清楚地说明了提高蒸馏温度并没有提高神经网络的鲁棒性,它只是使已有的攻击更频繁地失败。
Recent work has shown that an adversarial example for one model will often transfer to be an adversarial on a different model, even if they are trained on different sets of training data [46], [11], and even if they use entirely different algorithms (i.e., adversarial examples on neural networks transfer to random forests [37]).
最近的研究表明,一个敌对的例子为一个模型往往会转移到不同的模型是一个敌对的,即使他们被训练在不同的训练数据集[46],[11],即使他们使用完全不同的算法(例如,敌对的例子在神经网络转移到随机森林[37])。
Value of k
Therefore, any defense that is able to provide robustness against adversarial examples must somehow break this transferability property; otherwise, we could run our attack algorithm on an easy-to-attack model, and then transfer those adversarial examples to the hard-to-attack model.
因此,任何能够对对抗的例子提供健壮性的辩护都必须以某种方式破坏这种可转移性。否则,我们可以在容易攻击的模型上运行我们的攻击算法,然后将那些对抗的例子转移到难以攻击的模型上。
Even though defensive distillation is not robust to our stronger attacks, we demonstrate a second break of distillation by transferring attacks from a standard model to a defensively distilled model.
即使防御蒸馏对更强的攻击不健壮,我们通过将攻击从一个标准模型转移到防御蒸馏模型来演示蒸馏的第二次中断。
We accomplish this by finding high-confidence adversarial examples, which we define as adversarial examples that are strongly misclassified by the original model. Instead of looking for an adversarial example that just barely changes the classification from the source to the target, we want one where the target is much more likely than any other label.
我们通过找到高度自信的敌对例子来实现这一点,我们将其定义为原始模型严重错误分类的敌对例子。我们不需要寻找一个仅仅将分类从源更改为目标的相反示例,而是需要一个目标比任何其他标签更有可能发生的示例。
Recall the loss function defined earlier for L2 attacks:
回想一下先前定义的L2攻击的损失函数:
f(x\prime) = max(max{Z(x\prime)i : i \neq t} − Z(x\prime)_{t}, −κ). \tag{23}
The purpose of the parameter κ is to control the strength of adversarial examples: the larger κ, the stronger the classification of the adversarial example. This allows us to generate high-confidence adversarial examples by increasing κ.
参数κ的目的是控制的力量对抗的例子:κ较大,具有较强的分类敌对的例子。这让我们产生高信任度通过增加κ敌对的例子。
We first investigate if our hypothesis is true that the stronger the classification on the first model, the more likely it will transfer. We do this by varying κ from 0 to 40.
我们首先调查我们的假设是否正确,即第一个模型上的分类越强,它就越有可能转移。我们通过改变κ从0到40。
Our baseline experiment uses two models trained on MNIST as described in Section IV, with each model trained on half of the training data. We find that the transferability success rate increases linearly from κ = 0 to κ = 20 and then plateaus at near-100% success for κ ≈ 20, so clearly increasing κ increases the probability of a successful transferable attack.
我们的基线实验使用了在第四节中描述的MNIST上训练的两个模型,每个模型对一半的训练数据进行训练。我们发现的可转让性的成功率增加线性κ= 0κ= 20,然后在附近的高原- 100%成功κ≈20,所以很明显增加κ增加转移攻击成功的概率。
We then run this same experiment only instead we train the second model with defensive distillation, and find that adversarial examples do transfer. This gives us another attack technique for finding adversarial examples on distilled networks.
然后我们做了同样的实验,只是我们用防御精馏训练了第二个模型,并且发现对抗的例子确实可以转移。这给我们提供了另一种攻击技术,可以在精馏的网络中找到敌对的例子。
However, interestingly, the transferability success rate between the unsecured model and the distilled model only reaches 100% success at κ = 40, in comparison to the previous approach that only required κ = 20.
然而,有趣的是,无担保模式之间的可转让性的成功率和蒸馏模型κ= 40只达到100%成功,相比以前的方法,只需要κ= 20。
We believe that this approach can be used in general to evaluate the robustness of defenses, even if the defense is able to completely block flow of gradients to cause our gradientdescent based approaches from succeeding.
我们相信,这种方法一般可以用来评估防御的健壮性,即使防御能够完全阻止梯度流,从而导致基于梯度下降的方法无法成功。
The existence of adversarial examples limits the areas in which deep learning can be applied. It is an open problem to construct defenses that are robust to adversarial examples. In an attempt to solve this problem, defensive distillation was proposed as a general-purpose procedure to increase the robustness of an arbitrary neural network.
对抗性例子的存在限制了深度学习的应用领域。这是一个开放的问题,构建防御是健壮的对抗的例子。为了解决这个问题,防御蒸馏被提出作为一个通用的过程来增加任意神经网络的鲁棒性。
In this paper, we propose powerful attacks that defeat defensive distillation, demonstrating that our attacks more generally can be used to evaluate the efficacy of potential defenses. By systematically evaluating many possible attack approaches, we settle on one that can consistently find better adversarial examples than all existing approaches. We use this evaluation as the basis of our three L0, L2, and L∞ attacks.
在本文中,我们提出了能够击败防御精馏的强大攻击,证明我们的攻击可以更普遍地用于评估潜在防御的效能。通过系统地评估许多可能的攻击方法,我们确定了一种能够始终如一地找到比所有现有方法更好的对抗例子的方法。我们使用这个评价作为我们的三个L0, L2和L∞攻击的基础。
We encourage those who create defenses to perform the two evaluation approaches we use in this paper:
我们鼓励那些创建防御的人执行我们在本文中使用的两种评估方法:
Use a powerful attack (such as the ones proposed in this paper) to evaluate the robustness of the secured model directly. Since a defense that prevents our L2 attack will prevent our other attacks, defenders should make sure to establish robustness against the L2 distance metric.
使用强大的攻击(如本文中提出的攻击)直接评估受保护模型的健壮性。既然阻止我们的L2攻击的防御将会阻止我们的其他攻击,防御者应该确保建立抵抗L2距离度量的健壮性。
Demonstrate that transferability fails by constructing high-confidence adversarial examples on a unsecured model and showing they fail to transfer to the secured model.
通过在一个不安全的模型上构造高可信度的对抗例子来证明可转移性失败,并表明它们不能转移到安全的模型。
We would like to thank Nicolas Papernot discussing our defensive distillation implementation, and the anonymous reviewers for their helpful feedback. This work was supported by Intel through the ISTC for Secure Computing, Qualcomm, Cisco, the AFOSR under MURI award FA9550-12-1-0040, and the Hewlett Foundation through the Center for Long-Term Cybersecurity.
我们要感谢Nicolas paper没有讨论我们的防御性蒸馏实现,感谢匿名评审人员提供的有帮助的反馈。这项工作由英特尔公司通过安全计算ISTC、高通公司、思科公司、获得MURI奖FA9550-12-1-0040的AFOSR以及长期网络安全中心的休利特基金会提供支持。
[1] ANDOR, D., ALBERTI, C., WEISS, D., SEVERYN, A., PRESTA, A., GANCHEV, K., PETROV, S., AND COLLINS, M. Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042 (2016).
[2] BASTANI, O., IOANNOU, Y., LAMPROPOULOS, L., VYTINIOTIS, D., NORI, A., AND CRIMINISI, A. Measuring neural net robustness with constraints. arXiv preprint arXiv:1605.07262 (2016).
[3] BOJARSKI, M., DEL TESTA, D., DWORAKOWSKI, D., FIRNER, B., FLEPP, B., GOYAL, P., JACKEL, L. D., MONFORT, M., MULLER, U., ZHANG, J., ET AL. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016).
[4] BOURZAC, K. Bringing big neural networks to self-driving cars, smartphones, and drones. http: //spectrum.ieee.org/computing/embedded-systems/ bringing-big-neural-networks-to-selfdriving-cars-smartphones-and-drones, 2016.
[5] CARLINI, N., MISHRA, P., VAIDYA, T., ZHANG, Y., SHERR, M., SHIELDS, C., WAGNER, D., AND ZHOU, W. Hidden voice commands. In 25th USENIX Security Symposium (USENIX Security 16), Austin, TX (2016).
[6] CHANDOLA, V., BANERJEE, A., AND KUMAR, V. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3 (2009), 15.
[7] CLEVERT, D.-A., UNTERTHINER, T., AND HOCHREITER, S. Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289 (2015).
[8] DAHL, G. E., STOKES, J. W., DENG, L., AND YU, D. Large-scale malware classification using random projections and neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), IEEE, pp. 3422–3426.
[9] DENG, J., DONG, W., SOCHER, R., LI, L.-J., LI, K., AND FEI-FEI, L. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (2009), IEEE, pp. 248–255.
[10] GIUSTI, A., GUZZI, J., CIRES¸AN, D. C., HE, F.-L., RODR´IGUEZ, J. P., FONTANA, F., FAESSLER, M., FORSTER, C., SCHMIDHUBER, J., DI CARO, G., ET AL. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters 1, 2 (2016), 661–667.
[11] GOODFELLOW, I. J., SHLENS, J., AND SZEGEDY, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
[12] GRAHAM, B. Fractional max-pooling. arXiv preprint arXiv:1412.6071 (2014).
[13] GRAVES, A., MOHAMED, A.-R., AND HINTON, G. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (2013), IEEE, pp. 6645–6649.
[14] GROSSE, K., PAPERNOT, N., MANOHARAN, P., BACKES, M., AND MCDANIEL, P. Adversarial perturbations against deep neural networks for malware classification. arXiv preprint arXiv:1606.04435 (2016).
[15] GU, S., AND RIGAZIO, L. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068 (2014).
[16] HE, K., ZHANG, X., REN, S., AND SUN, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.
[17] HINTON, G., DENG, L., YU, D., DAHL, G., RAHMAN MOHAMED, A., JAITLY, N., SENIOR, A., VANHOUCKE, V., NGUYEN, P., SAINATH, T., AND KINGSBURY, B. Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine (2012).
[18] HINTON, G., DENG, L., YU, D., DAHL, G. E., MOHAMED, A.-R., JAITLY, N., SENIOR, A., VANHOUCKE, V., NGUYEN, P., SAINATH, T. N., ET AL. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 82–97.
[19] HINTON, G., VINYALS, O., AND DEAN, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[20] HUANG, R., XU, B., SCHUURMANS, D., AND SZEPESVARI ´ , C. Learning with a strong adversary. CoRR, abs/1511.03034 (2015).
[21] HUANG, X., KWIATKOWSKA, M., WANG, S., AND WU, M. Safety verification of deep neural networks. arXiv preprint arXiv:1610.06940 (2016).
[22] JANGLOVA´ , D. Neural networks in mobile robot motion. Cutting Edge Robotics 1, 1 (2005), 243.
[23] KINGMA, D., AND BA, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[24] KRIZHEVSKY, A., AND HINTON, G. Learning multiple layers of features from tiny images.
[25] KRIZHEVSKY, A., SUTSKEVER, I., AND HINTON, G. E. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (2012), pp. 1097–1105.
[26] KURAKIN, A., GOODFELLOW, I., AND BENGIO, S. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016).
[27] LECUN, Y., BOTTOU, L., BENGIO, Y., AND HAFFNER, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
[28] LECUN, Y., CORTES, C., AND BURGES, C. J. The mnist database of handwritten digits, 1998.
[29] MAAS, A. L., HANNUN, A. Y., AND NG, A. Y. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML (2013), vol. 30.
[30] MELICHER, W., UR, B., SEGRETI, S. M., KOMANDURI, S., BAUER, L., CHRISTIN, N., AND CRANOR, L. F. Fast, lean and accurate: Modeling password guessability using neural networks. In Proceedings of USENIX Security (2016).
[31] MISHKIN, D., AND MATAS, J. All you need is a good init. arXiv preprint arXiv:1511.06422 (2015).
[32] MNIH, V., KAVUKCUOGLU, K., SILVER, D., GRAVES, A., ANTONOGLOU, I., WIERSTRA, D., AND RIEDMILLER, M. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
[33] MNIH, V., KAVUKCUOGLU, K., SILVER, D., RUSU, A. A., VENESS, J., BELLEMARE, M. G., GRAVES, A., RIEDMILLER, M., FIDJELAND, A. K., OSTROVSKI, G., ET AL. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
[34] MOOSAVI-DEZFOOLI, S.-M., FAWZI, A., AND FROSSARD, P. Deepfool: a simple and accurate method to fool deep neural networks. arXiv preprint arXiv:1511.04599 (2015).
[35] PAPERNOT, N., GOODFELLOW, I., SHEATSLEY, R., FEINMAN, R., AND MCDANIEL, P. cleverhans v1.0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768 (2016).
[36] PAPERNOT, N., AND MCDANIEL, P. On the effectiveness of defensive distillation. arXiv preprint arXiv:1607.05113 (2016).
[37] PAPERNOT, N., MCDANIEL, P., AND GOODFELLOW, I. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 (2016).
[38] PAPERNOT, N., MCDANIEL, P., JHA, S., FREDRIKSON, M., CELIK, Z. B., AND SWAMI, A. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P) (2016), IEEE, pp. 372–387.
[39] PAPERNOT, N., MCDANIEL, P., WU, X., JHA, S., AND SWAMI, A. Distillation as a defense to adversarial perturbations against deep neural networks. IEEE Symposium on Security and Privacy (2016).
[40] PASCANU, R., STOKES, J. W., SANOSSIAN, H., MARINESCU, M., AND THOMAS, A. Malware classification with recurrent networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), IEEE, pp. 1916–1920.
[41] RUSSAKOVSKY, O., DENG, J., SU, H., KRAUSE, J., SATHEESH, S., MA, S., HUANG, Z., KARPATHY, A., KHOSLA, A., BERNSTEIN, M., BERG, A. C., AND FEI-FEI, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.
[42] SHAHAM, U., YAMADA, Y., AND NEGAHBAN, S. Understanding adversarial training: Increasing local stability of neural nets through robust optimization. arXiv preprint arXiv:1511.05432 (2015).
[43] SILVER, D., HUANG, A., MADDISON, C. J., GUEZ, A., SIFRE, L., VAN DEN DRIESSCHE, G., SCHRITTWIESER, J., ANTONOGLOU, I., PANNEERSHELVAM, V., LANCTOT, M., ET AL. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
[44] SPRINGENBERG, J. T., DOSOVITSKIY, A., BROX, T., AND RIEDMILLER, M. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806 (2014).
[45] SZEGEDY, C., VANHOUCKE, V., IOFFE, S., SHLENS, J., AND WOJNA, Z. Rethinking the Inception architecture for computer vision. arXiv preprint arXiv:1512.00567 (2015).
[46] SZEGEDY, C., ZAREMBA, W., SUTSKEVER, I., BRUNA, J., ERHAN, D., GOODFELLOW, I., AND FERGUS, R. Intriguing properties of neural networks. ICLR (2013).
[47] WARDE-FARLEY, D., AND GOODFELLOW, I. Adversarial perturbations of deep neural networks. Advanced Structured Prediction, T. Hazan, G. Papandreou, and D. Tarlow, Eds (2016).
[48] YUAN, Z., LU, Y., WANG, Z., AND XUE, Y. Droid-sec: Deep learning in android malware detection. In ACM SIGCOMM Computer Communication Review (2014), vol. 44, ACM, pp. 371–372.
In this paper we create a set of attacks that can be used to construct an upper bound on the robustness of neural networks. As a case study, we use these attacks to demonstrate that defensive distillation does not actually eliminate adversarial examples. We construct three new attacks (under three previously used distance metrics: , , and ) that succeed in finding adversarial examples for 100% of images on defensively distilled networks. While defensive distillation stops previously published attacks, it cannot resist the more powerful attack techniques we introduce in this paper.
在本文中,我们建立了一组攻击来构造神经网络鲁棒性的上界。作为一个案例研究,我们使用这些攻击来证明防御性的精馏实际上并不能消除对抗性的例子。我们构造了三种新的攻击(在以前使用的三个距离指标: 、 和 ),能够成功地在经过防御过滤的网络上为100%的图像找到敌对的例子。虽然防御蒸馏停止了以前发表的攻击,但它无法抵抗我们在本文中介绍的更强大的攻击技术。
The key question then becomes exactly how much distortion we must add to cause the classification to change. In each domain, the distance metric that we must use is different. In the space of images, which we focus on in this paper, we rely on previous work that suggests that various norms are reasonable approximations of human perceptual distance (see Section II-D for more information).
接下来的关键问题就变成了我们必须添加多少失真才能引起分类的改变。在每个区域,我们必须使用的距离度量是不同的。在本文所关注的图像空间中,我们依赖于之前的工作,这些工作表明各种 规范是人类感知距离的合理近似(参见II-D部分获取更多信息)。
A neural network is a function that accepts an input and produces an output . The model also implicitly depends on some model parameters ; in our work the model is fixed, so for convenience we don’t show the dependence on θ.
神经网络是一个函数 ,它接受一个输入 ,并产生一个输出 。模型 也隐含地依赖于一些模型参数 在我们的工作中,模型是固定的,所以为了方便起见,我们不显示对内压 的依赖。
In this paper we focus on neural networks used as an mclass classifier. The output of the network is computed using the softmax function, which ensures that the output vector y satisfies and . The output vector is thus treated as a probability distribution, i.e., is treated as the probability that input x has class i. The classifier assigns the label to the input . Let be the correct label of . The inputs to the softmax function are called logits.
在这篇论文中,我们主要研究神经网络作为类分类器。网络的输出采用softmax函数计算,保证输出向量 满足 ,且 。输出向量 因此视为一个概率分布,也就是说,易被视为输入x类的概率分类器分配标签。 输入 。让 是正确的标签 输入将softmax称为分对数函数。
for some non-linear activation function , some matrix of model weights, and some vector of model biases. Together and make up the model parameters. Common choices of σ are tanh [31], sigmoid, ReLU [29], or ELU [7]. In this paper we focus primarily on networks that use a ReLU activation function, as it currently is the most widely used activation function [45], [44], [31], [39].
对于一些非线性激活函数 ,一些模型权重矩阵θ,和一些向量 我的模型偏差。 和 模型参数。常用的选择是tanh [31], sigmoid, ReLU[29],或ELU[7]。在本文中,我们主要关注使用ReLU激活函数的网络,因为它是目前使用最广泛的激活函数[45],[44],[31],[39]。
We use image classification as our primary evaluation domain. An h×w-pixel grey-scale image is a two-dimensional vector , where denotes the intensity of pixel i and is scaled to be in the range [0, 1]. A color RGB image is a three-dimensional vector . We do not convert RGB images to HSV, HSL, or other cylindrical coordinate representations of color images: the neural networks act on raw pixel values.
我们以图像分类作为主要的评价领域。一幅h×w像素灰度图像是二维向量 ,其中xi表示像素i的强度,被缩放到范围[0,1]。彩色RGB图像是三维向量 。我们不将RGB图像转换为HSV、HSL或其他彩色图像的柱坐标表示:神经网络作用于原始像素值。
Szegedy et al. [46] first pointed out the existence of adversarial examples: given a valid input and a target , it is often possible to find a similar input x such that C(x ) = t yet x, x are close according to some distance metric. An example x with this property is known as a targeted adversarial example.
Szegedy 等人. [46] 首先指出了对抗性例子的存在:给定一个有效输入 和目标 , 通常可以找到一个相似的输入 ,使得 ,但 , 根据某个距离度量是接近的。具有此属性的示例 被称为目标对抗性示例。
A less powerful attack also discussed in the literature instead asks for untargeted adversarial examples: instead of classifying x as a given target class, we only search for an input so that and , are close. Untargeted attacks are strictly less powerful than targeted attacks and we do not consider them in this paper. 3
文献中也讨论了一种较弱的攻击,它会要求无目标的对抗例子:与其将x分类为给定的目标类,我们只搜索输入 所以 和 , 很近。无目标攻击比有目标攻击的威力小,因此本文不考虑它们。3.
In our definition of adversarial examples, we require use of a distance metric to quantify similarity. There are three widely-used distance metrics in the literature for generating adversarial examples, all of which are norms.
在我们对抗性例子的定义中,我们需要使用距离度量来量化相似性。文献中有三种广泛使用的距离度量用于生成对抗性示例,它们都是 范数。
The distance is written , where the p-norm is defined as
距离是写 ,在p-norm 被定义为
distance measures the number of coordinates such that . Thus, the L0 distance corresponds to the number of pixels that have been altered in an image.4 Papernot et al. argue for the use of the L0 distance metric, and it is the primary distance metric under which defensive distillation’s security is argued [39].
距离度量的是坐标的数量 所以 。因此, 距离对应于图像中被改变的像素的数量。4 Papernot等人主张使用 距离度量,它是防御蒸馏安全性的主要距离度量[39]。
Szegedy et al. [46] generated adversarial examples using box-constrained L-BFGS. Given an image , their method finds a different image that is similar to under distance, yet is labeled differently by the classifier. They model the problem as a constrained minimization problem:
Szegedy等人使用盒约束的L-BFGS生成了对抗性示例。他们的方法给出了一幅图像 ,在 距离下找到了一幅与 相似的不同的图像 ,但是分类器对其进行了不同的标记。他们将问题建模为约束最小化问题:
where is a function mapping an image to a positive real number. One common loss function to use is cross-entropy. Line search is performed to find the constant c > 0 that yields an adversarial example of minimum distance: in other words, we repeatedly solve this optimization problem for multiple values of , adaptively updating c using bisection search or any other method for one-dimensional optimization.
其中,l是一个映射图像到一个正实数的函数。一个常用的损失函数是交叉熵。通过直线搜索找到常数c > 0,得到一个最小距离的对敌例子:换句话说,我们对c的多个值重复求解这个优化问题,使用二分搜索或任何其他一维优化方法自适应地更新 。
where is chosen to be sufficiently small so as to be undetectable, and t is the target label. Intuitively, for each pixel, the fast gradient sign method uses the gradient of the loss function to determine in which direction the pixel’s intensity should be changed (whether it should be increased or decreased) to minimize the loss function; then, it shifts all pixels simultaneously.
Iterative Gradient Sign: Kurakin et al. introduce a simple refinement of the fast gradient sign method [26] where instead of taking a single step of size in the direction of the gradientsign, multiple smaller steps are taken, and the result is clipped by the same . Specifically, begin by setting
迭代梯度符号:Kurakin等人对快速梯度符号[26]方法进行了简单的改进,在此方法中不再采用单步大小 在梯度符号的方向上,采取多个更小的步骤,结果同样被裁剪 .具体地说,从设置开始
Papernot等人引入了一种在 距离[38]下优化的攻击,称为基于雅可比矩阵的显著性映射攻击(JSMA)。简要介绍了它们的攻击算法;为了获得完整的描述和动机,我们鼓励读者阅读他们的原始论文[38]。
At a high level, the attack is a greedy algorithm that picks pixels to modify one at a time, increasing the target classification on each iteration. They use the gradient to compute a saliency map, which models the impact each pixel has on the resulting classification. A large value indicates that changing it will significantly increase the likelihood of the model labeling the image as the target class l. Given the saliency map, it picks the most important pixel and modify it to increase the likelihood of class l. This is repeated until either more than a set threshold of pixels are modified which makes the attack detectable, or it succeeds in changing the classification.
在较高层次上,攻击是一种贪婪算法,每次选取一个像素进行修改,每次迭代都增加目标分类。他们使用梯度 来计算显著性图,它模拟每个像素对最终分类的影响。一个较大的值表明,改变它的可能性将大大增加模型标识形象目标类l。鉴于显著地图,它挑选最重要的像素和修改增加类l。这个重复的可能性,直到超过设定阈值的像素修改使攻击检测,也成功地改变了分类。
so that represents how much changing both pixels p and q will change the target classification, and βpq represents how much changing p and q will change all other outputs. Then the algorithm picks
这就表示了像素p和q对目标分类的改变程度,而 表示p和q的变化会对所有其他输出产生多大影响。然后算法选择
使 (目标类更可能出现)、 (其他类更不可能出现)和 最大。
是固定的,目标是找到最小的 也就是说,我们希望对图像x进行一些小的改变,从而改变它的分类,但结果仍然是一个有效的图像。这里D是距离度规;对我们来说,它可能是 或 ,就像之前讨论的那样。
上述公式是现有算法难以直接求解的,因为约束 是高度非线性的。因此,我们用一种更适合优化的不同形式来表达它。我们定义一个目标函数f,使当且仅当 时, 。f有很多可能的选择:
Change of variables introduces a new variable w and instead of optimizing over the variable δ defined above, we apply a change-of-variables and optimize over w, setting Since −1 ≤ tanh(wi) ≤ 1, it follows that 0 ≤ xi+δi ≤ 1, so the solution will automatically be valid. 8 We can think of this approach as a smoothing of clipped gradient descent that eliminates the problem of getting stuck in extreme regions.
or, as ,
Prior work has largely ignored the integrality constraints.10 For instance, when using the fast gradient sign attack with (i.e., changing pixel values by 10%), discretization rarely affects the success rate of the attack. In contrast, in our work, we are able to find attacks that make much smaller changes to the images, so discretization effects cannot be ignored. We take care to always generate valid images; when reporting the success rate of our attacks, they always are for attacks that include the discretization post-processing.
之前的工作很大程度上忽略了完整性约束。10例如,在使用 (即像素值变化10%)的快速梯度符号攻击时,离散化很少影响攻击成功率。相比之下,在我们的工作中,我们能够发现攻击对图像做了小得多的改变,所以离散化效果不能被忽略。我们总是注意生成有效的图像;当报告攻击的成功率时,它们总是针对那些包含离散化后处理的攻击。
Putting these ideas together, we obtain a method for finding adversarial examples that will have low distortion in the L2 metric. Given x, we choose a target class t (such that we have ) and then search for w that solves
把这些想法放在一起,我们得到了一种方法,以寻找对抗的例子,将有低失真的L2度规。给定x,我们选择一个目标类t(我们有 )然后寻找w的解
更详细地说,在每一次迭代中,我们称L2对手为“对手”,限制其只修改允许的像素集。让“对手”作为L2在输入图像x上返回的解决方案,因此x+“对手”就是一个“对手”的例子。我们计算 (目标函数的梯度,在对抗性实例中求值). 然后选择像素 而且 fix i,也就是,将i从允许的集合中移除。11直觉上,gi·让i告诉我们从图像的第i个像素中可以得到f(·)的减少量,当我们从x移动到x +位置时,gi告诉我们每单位变化到第i个像素,f减少了多少,我们把它乘以第i个像素的变化。这个过程重复,直到L2对手不能找到一个对抗性的例子。
However, we found that gradient descent produces very poor results: the term only penalizes the largest (in absolute value) entry in δ and has no impact on any of the other. As such, gradient descent very quickly becomes stuck oscillating between two suboptimal solutions. Consider a case where and δj = 0.5 − . The L∞ norm will only penalize will be zero at this point. Thus, the gradient imposes no penalty for increasing δj , even though it is already large. On the next iteration we might move to a position where δj is slightly larger than , say , a mirror image of where we started. In other words, gradient descent may oscillate back and forth across the line δi = δj = 0.5, making it nearly impossible to make progress.
但是,我们发现梯度下降的效果非常差: 只会惩罚最大的(绝对值)条目,对其他条目没有影响。因此,梯度下降很快就会在两个次优解之间陷入振荡。考虑一个例子 和 .L∞规范只会受到处罚 在这一点是零。因此,即使已经很大了,梯度也不会对增加减数j施加任何惩罚。在下一个迭代中,我们可能会移动到一个位置,在那里,我们的用户j略大于 , say , 这是我们开始时的镜像。换句话说,梯度下降可能会在line (n = n = j = 0.5)上来回振荡,使其几乎不可能继续前进。
We re-implement Deepfool, fast gradient sign, and iterative gradient sign. For fast gradient sign, we search over to find the smallest distance that generates an adversarial example; failures is returned if no produces the target class. Our iterative gradient sign method is similar: we search over and return the smallest successful.
我们重新实现了深傻瓜、快速梯度符号和迭代梯度符号。快速梯度符号,我们搜索了 求出产生对抗性示例的最小距离;如果没有,则返回失败 生成目标类。我们的迭代梯度符号法是相似的:我们搜索了 并返回最小的成功。
Because the values of Z(·) are 100 times larger, when we test at temperature 1, the output of F becomes in all components except for the output class which has confidence for some very small (for tasks with 10 classes). In fact, in most cases, is so small that the 32-bit floating-point value is rounded to 0. For similar reasons, the gradient is so small that it becomes 0 when expressed as a 32-bit floating-point value.
因为Z(·)的值要大100倍,当我们在温度1下测试时,F的输出变成 除了有信心的输出类之外,所有组件中都有信心 对于一些非常小的 (用于10个类的任务)。事实上,在大多数情况下, 是如此小,32位浮点值被四舍五入0。出于类似的原因,梯度太小,当表示为32位浮点值时,它变成0。