深度学习模式的对抗攻击

Towards Deep Learning Models Resistant to Adversarial Attacks

原文链接

GB/T 7714: Madry A, Makelov A, Schmidt L, et al. Towards deep learning models resistant to adversarial attacks[J]. arXiv preprint arXiv:1706.06083, 2017.

MLA: Madry, Aleksander, et al. "Towards deep learning models resistant to adversarial attacks." arXiv preprint arXiv:1706.06083 (2017).

APA: Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2017). Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.

Abstract

摘要

Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples—inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. 1

最近的研究表明，深度神经网络容易受到与自然数据几乎无法区分的敌对示例输入的伤害，而且网络的分类也不正确。事实上，一些最新的发现表明，对抗性攻击的存在可能是深度学习模型的一个固有弱点。为了解决这个问题，我们从鲁棒优化的角度研究了神经网络的对抗鲁棒性。这种方法为我们提供了一个关于这个主题之前的许多工作的广泛和统一的视图。它的原则性质也使我们能够确定训练和攻击神经网络的方法是可靠的，并且在某种意义上是普遍的。特别是，它们指定了一个具体的安全保证，可以保护自己不受任何对手的攻击。这些方法让我们训练网络，显著提高对广泛的敌对攻击的抵抗力。它们还提出了对抗一级对手的安全概念，作为一种自然和广泛的安全保障。我们相信，对这种定义良好的敌对类的健壮性是通向完全抵抗深度学习模型的重要跳板。1

1Code和预训练的模型可以通过https://github.com/MadryLab/mnist_challenge和https://github.com/MadryLab/cifar10_challenge 获得。

1 Introduction

1介绍

Recent breakthroughs in computer vision [17, 12] and natural language processing [7] are bringing trained classifiers into the center of security-critical systems. Important examples include vision for autonomous cars, face recognition, and malware detection. These developments make security aspects of machine learning increasingly important. In particular, resistance to adversarially chosen inputs is becoming a crucial design goal. While trained models tend to be very effective in classifying benign inputs, recent work [2, 28, 22] shows that an adversary is often able to manipulate the input so that the model produces an incorrect output.

最近在计算机视觉[17,12]和自然语言处理[7]上的突破正把经过训练的分类器带入安全关键系统的中心。重要的例子包括自动驾驶汽车的视觉、人脸识别和恶意软件检测。这些发展使得机器学习的安全性变得越来越重要。特别是，对反向选择的输入的抵制正成为一个关键的设计目标。虽然经过训练的模型在分类良性输入方面往往非常有效，但最近的研究[2,28,22]表明，对手往往能够操纵输入，从而使模型产生不正确的输出。

This phenomenon has received particular attention in the context of deep neural networks, and there is now a quickly growing body of work on this topic [11, 9, 27, 18, 23, 29]. Computer vision presents a particularly striking challenge: very small changes to the input image can fool state-of-the-art neural networks with high confidence [28, 21]. This holds even when the benign example was classified correctly, and the change is imperceptible to a human. Apart from the security implications, this phenomenon also demonstrates that our current models are not learning the underlying concepts in a robust manner. All these findings raise a fundamental question:

How~ can ~we ~train ~deep ~neural ~networks ~that ~are ~robust ~to ~adversarial ~inputs?

这种现象在深度神经网络的背景下得到了特别的关注，现在关于这个主题的研究正在迅速增长[11,9,27,18,23,29]。计算机视觉提出了一个特别引人注目的挑战:对输入图像进行非常小的改变就可以欺骗它高度自信的先进神经网络[28,21]。即使良性的例子被正确地分类，这种变化对人来说也是难以察觉的。除了安全方面的影响外，这种现象还表明我们当前的模型没有以健壮的方式学习底层概念。所有这些发现都提出了一个基本问题:

How~ can ~we ~train ~deep ~neural ~networks ~that ~are ~robust ~to ~adversarial ~inputs?

There is now a sizable body of work proposing various attack and defense mechanisms for the adversarial setting. Examples include defensive distillation [24, 6], feature squeezing [31, 14], and several other adversarial example detection approaches [5]. These works constitute important first steps in exploring the realm of possibilities here. They, however, do not offer a good understanding of the guarantees they provide. We can never be certain that a given attack finds the “most adversarial” example in the context, or that a particular defense mechanism prevents the existence of some well-defined class of adversarial attacks. This makes it difficult to navigate the landscape of adversarial robustness or to fully evaluate the possible security implications.

现在有相当多的研究提出了对抗环境下的各种攻击和防御机制。例如防御蒸馏[24,6]，特征压缩[31,14]，以及其他一些对抗的例子检测方法[5]。这些作品构成了探索这里可能性领域的重要的第一步。然而，它们对自己所提供的保证并不十分了解。我们永远不能确定一个给定的攻击在上下文中找到了最具对抗性的例子，或者一个特定的防御机制阻止了一些定义明确的对抗性攻击的存在。这使得我们很难在健壮性相对较差的情况下进a行导航，也很难全面评估可能的安全影响。

In this paper, we study the adversarial robustness of neural networks through the lens of robust optimization. We use a natural saddle point (min-max) formulation to capture the notion of security against adversarial attacks in a principled manner. This formulation allows us to be precise about the type of security guarantee we would like to achieve, i.e., the broad class of attacks we want to be resistant to (in contrast to defending only against specific known attacks). The formulation also enables us to cast both attacks and defenses into a common theoretical framework, naturally encapsulating most prior work on adversarial examples. In particular, adversarial training directly corresponds to optimizing this saddle point problem. Similarly, prior methods for attacking neural networks correspond to specific algorithms for solving the underlying constrained optimization problem.

本文从鲁棒优化的角度研究了神经网络的对抗鲁棒性。我们使用一个自然鞍点(最小-最大)公式，以原则性的方式来捕获针对敌对攻击的安全概念。这个公式允许我们精确地了解我们想要实现的安全保证的类型，即，我们想要抵抗的广泛的攻击类型(与只防御特定的已知攻击形成对比)。这个公式也使我们能够把攻击和防御都放到一个共同的理论框架中，自然地封装了大多数以前关于对抗的例子的工作。特别是，对抗性训练直接对应于鞍点问题的优化。类似地，先前攻击神经网络的方法对应于解决底层约束优化问题的特定算法。

Equipped with this perspective, we make the following contributions.

有了这一视角，我们做了以下贡献。

We conduct a careful experimental study of the optimization landscape corresponding to this saddle point formulation. Despite the non-convexity and non-concavity of its constituent parts, we find that the underlying optimization problem is tractable after all. In particular, we provide strong evidence that first-order methods can reliably solve this problem. We supplement these insights with ideas from real analysis to further motivate projected gradient descent (PGD) as a universal “first-order adversary”, i.e., the strongest attack utilizing the local first order information about the network.

1.我们对与此鞍点公式相对应的优化景观进行了仔细的实验研究。尽管其组成部分不凸不凹，但我们发现其根本的优化问题是可处理的。特别地，我们提供了强有力的证据证明一阶方法可以可靠地解决这个问题。我们从实际分析中补充了这些见解，以进一步激发投影梯度下降(PGD)作为一个通用的一阶对手，即，利用网络的局部一阶信息的最强攻击。

2.We explore the impact of network architecture on adversarial robustness and find that model capacity plays an important role here. To reliably withstand strong adversarial attacks, networks require a larger capacity than for correctly classifying benign examples only. This shows that a robust decision boundary of the saddle point problem can be significantly more complicated than a decision boundary that simply separates the benign data points.

2. 我们探讨了网络架构对对抗鲁棒性的影响，发现模型容量在其中扮演了重要的角色。为了可靠地抵御强大的对抗性攻击，网络需要比仅对良性示例进行正确分类更大的容量。这表明鞍点问题的鲁棒决策边界比简单地分离良性数据点的决策边界要复杂得多。

3. Building on the above insights, we train networks on MNIST [19] and CIFAR10 [16] that are robust to a wide range of adversarial attacks. Our approach is based on optimizing the aforementioned saddle point formulation and uses PGD as a reliable first-order adversary. Our best MNIST model achieves an accuracy of more than 89% against the strongest adversaries in our test suite. In particular, our MNIST network is even robust against white box attacks of an iterative adversary. Our CIFAR10 model achieves an accuracy of 46% against the same adversary. Furthermore, in case of the weaker black box/transfer attacks, our MNIST and CIFAR10 networks achieve the accuracy of more than 95% and 64%, respectively. (More detailed overview can be found in Tables 1 and2.) To the best of our knowledge, we are the first to achieve these levels of robustness on MNIST and CIFAR10 against such a broad set of attacks.

基于上述观点，我们对MNIST[19]和CIFAR10[16]网络进行了培训，这些网络对各种敌对攻击都是健壮的。我们的方法是基于上述鞍点公式的优化，并使用PGD作为一个可靠的一阶对手。在我们的测试套件中，我们最好的MNIST模型达到了超过89%的准确率。特别地，我们的MNIST网络甚至对迭代对手的白盒攻击是健壮的。我们的CIFAR10模型达到46%的精度同样的对手。此外，在较弱的黑箱/传输攻击情况下，我们的MNIST和CIFAR10网络的准确率分别达到95%和64%以上。(更详细的概述见表1和表2。)据我们所知，我们是第一个在MNIST和CIFAR10上实现针对如此广泛的攻击的健壮性级别的人。

Overall, these findings suggest that secure neural networks are within reach. In order to further support this claim, we invite the community to attempt attacks against our MNIST and CIFAR10 networks in the form of a challenge. This will let us evaluate its robustness more accurately, and potentially lead to novel attack methods in the process. The complete code, along with the description of the challenge, is available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.

总的来说，这些发现表明安全的神经网络触手可及。为了进一步支持这一主张，我们邀请社区尝试以挑战的形式攻击我们的MNIST和CIFAR10网络。这将使我们更准确地评估它的健壮性，并可能导致新的攻击方法在过程中。完整的代码以及挑战的描述可在https://github.com/MadryLab/mnist_challenge和https://github.com/MadryLab/cifar10_challenge上获得。

2 An Optimization View on Adversarial Robustness

2对抗性鲁棒性的优化观点

Much of our discussion will revolve around an optimization view of adversarial robustness. This perspective not only captures the phenomena we want to study in a precise manner, but will also inform our investigations. To this end, let us consider a standard classification task with an underlying data distribution D over pairs of examples $x ∈ R^{d}$ and corresponding labels $y ∈ [k]$ . We also assume that we are given a suitable loss function $L(θ, x, y)$ , for instance the cross-entropy loss for a neural network. As usual, $θ ∈ R^{p}$ is the set of model parameters. Our goal then is to find model parameters θ that minimize the risk $E_{(x,y)∼D}[L(x, y, θ)]$ .

我们的大部分讨论将围绕着对抗健壮性的优化视图。这一视角不仅准确地捕捉了我们想要研究的现象，而且还将为我们的调查提供信息。为此，让我们考虑一个标准的分类任务，它的底层数据分布D在对示例 $x ∈ R^{d}$ 和相应的标签 $y ∈ [k]$ 上。我们也假设我们有一个合适的损失函数 $L(θ, x, y)$ ，例如一个神经网络的交叉熵损失。通常， $θ ∈ R^{p}$ 是一组模型参数。我们的目标是找到模型参数的极小化风险 $E_{(x,y)∼D}[L(x, y, θ)]$ 。

Empirical risk minimization (ERM) has been tremendously successful as a recipe for finding classifiers with small population risk. Unfortunately, ERM often does not yield models that are robust to adversarially crafted examples [2, 28]. Formally, there are efficient algorithms (“adversaries”) that take an example x belonging to class c1 as input and find examples x adv such that x adv is very close to x but the model incorrectly classifies x adv as belonging to class c2 6= c1.

经验风险最小化(ERM)已经非常成功的作为一个处方寻找分类器与小的人口风险。不幸的是，ERM通常不能产生对敌对的例子健壮的模型[2,28]。在形式上，有一种有效的算法(“对手”)，它以一个属于c1类的example x作为输入，找到example x adv，使其与x非常接近，但模型将x adv错误地归类为class c2 6= c1。

In order to reliably train models that are robust to adversarial attacks, it is necessary to augment the ERM paradigm appropriately. Instead of resorting to methods that directly focus on improving the robustness to specific attacks, our approach is to first propose a concrete guarantee that an adversarially robust model should satisfy. We then adapt our training methods towards achieving this guarantee.

为了可靠地训练对敌对攻击具有鲁棒性的模型，有必要适当地扩充ERM范式。我们的方法不是采用直接专注于提高对特定攻击的鲁棒性的方法，而是首先提出一个具体的保证，即一个反向鲁棒模型应该满足。然后，我们调整我们的培训方法，以实现这一保证。

The first step towards such a guarantee is to specify an attack model, i.e., a precise definition of the attacks our models should be resistant to. For each data point x, we introduce a set of allowed perturbations S ⊆ Rd that formalizes the manipulative power of the adversary. In image classification, we choose S so that it captures perceptual similarity between images. For instance, the ∞-ball around x has recently been studied as a natural notion for adversarial perturbations [11]. While we focus on robustness against∞-bounded attacks in this paper, we remark that more comprehensive notions of perceptual similarity are an important direction for future research.

实现这一保证的第一步是指定攻击模型，即精确定义我们的模型应该能够抵抗的攻击。对于每个数据点x，我们引入一组允许的扰动，使对手的操纵能力形式化。在图像分类中，我们选择S来获取图像之间的感知相似性。例如，围绕x的“-球”最近被作为对抗扰动[11]的一个自然概念来研究。在本文中，我们关注的是对有界攻击的鲁棒性，同时我们指出，更全面的感知相似概念是未来研究的一个重要方向。

Next, we modify the definition of population risk ED[L] by incorporating the above adversary. Instead of feeding samples from the distribution D directly into the loss L, we allow the adversary to perturb the input first. This gives rise to the following saddle point problem, which is our central object of study:

min ρ(θ),~where~ρ(θ)=E_{(x,y)∼D}[max~L(θ, x + δ, y)]

其次，我们修改了人口风险ED[L]的定义，纳入上述对手。我们允许对手先扰动输入，而不是直接将分布D的样品送入损失L。这就产生了下面的鞍点问题，这是我们研究的中心对象

min ρ(θ),~where~ρ(θ)=E_{(x,y)∼D}[max~L(θ, x + δ, y)]

Formulations of this type (and their finite-sample counterparts) have a long history in robust optimization, going back to Wald [30]. It turns out that this formulation is also particularly useful in our context.

这种类型的公式(以及它们的有限样本对应项)在健壮优化方面有很长的历史，可以追溯到Wald[30]。事实证明，这个公式在我们的背景下也特别有用。

First, this formulation gives us a unifying perspective that encompasses much prior work on adversarial robustness. Our perspective stems from viewing the saddle point problem as the composition of an inner maximization problem and an outer minimization problem. Both of these problems have a natural interpretation in our context. The inner maximization problem aims to find an adversarial version of a given data point x that achieves a high loss. This is precisely the problem of attacking a given neural network. On the other hand, the goal of the outer minimization problem is to find model parameters so that the “adversarial loss” given by the inner attack problem is minimized. This is precisely the problem of training a robust classifier using adversarial training techniques.

首先，这个公式为我们提供了一个统一的视角，它包含了以前关于对抗稳健性的许多工作。我们的观点来自于将鞍点问题看作是一个内最大化问题和一个外最小化问题的组合。这两个问题在我们的背景下都有一个自然的解释。内最大化问题的目标是找到一个给定数据点x的敌对版本，从而获得一个高的损失。这正是攻击给定神经网络的问题。另一方面，外部最小化问题的目标是寻找模型参数，以使内部攻击问题给出的对抗损失最小。这正是使用对抗性训练技术来训练一个鲁棒分类器的问题。

Second, the saddle point problem specifies a clear goal that an ideal robust classifier should achieve, as well as a quantitative measure of its robustness. In particular, when the parameters θ yield a (nearly) vanishing risk, the corresponding model is perfectly robust to attacks specified by our attack model.

其次，鞍点问题明确了理想鲁棒分类器应该达到的目标，以及鲁棒分类器鲁棒性的定量度量。特别地，当参数的冗余产生一个(接近)消失的风险时，相应的模型对我们的攻击模型所指定的攻击是完全鲁棒的。

Our paper investigates the structure of this saddle point problem in the context of deep neural networks. These investigations then lead us to training techniques that produce models with high resistance to a wide range of adversarial attacks. Before turning to our contributions, we briefly review prior work on adversarial examples and describe in more detail how it fits into the above formulation.

本文研究了深度神经网络中鞍点问题的结构。这些调查然后引导我们训练技术，产生具有高抵抗力的模型，以广泛的对抗性攻击。在转向我们的贡献之前，我们简要回顾了以前关于对抗例子的工作，并详细描述了它是如何符合上述提法的。

2.1 A Unified View on Attacks and Defenses

2.1对攻击和防御的统一看法

Prior work on adversarial examples has focused on two main questions:

以前关于对抗性例子的工作集中在两个主要问题上:

How can we produce strong adversarial examples, i.e., adversarial examples that fool a model
with high confidence while requiring only a small perturbation?
1. 我们如何才能产生强大的敌对的例子，也就是说，敌对的例子欺骗了一个高度自信的模型，而只需要一个小的干扰?
How can we train a model so that there are no adversarial examples, or at least so that an
adversary cannot find them easily?
1. 我们如何训练一个模型，使其不存在对抗性的例子，或者至少使对手不能轻易找到它们?

Our perspective on the saddle point problem (2.1) gives answers to both these questions. On the attack side, prior work has proposed methods such as the Fast Gradient Sign Method (FGSM) [11] and multiple variations of it [18]. FGSM is an attack for an `∞-bounded adversary and computes an adversarial example as

x + ε sgn(∇xL(θ, x, y)).

我们对鞍点问题的看法(2.1)给出了这两个问题的答案。在攻击方面，之前的工作已经提出了快速梯度符号方法(FGSM)[11]和它的多种变体[18]。FGSM是一种针对∞有界对手的攻击，并给出了一个对抗实例

x + ε sgn(∇xL(θ, x, y)).

One can interpret this attack as a simple one-step scheme for maximizing the inner part of the saddle point formulation. A more powerful adversary is the multi-step variant, which is essentially projected gradient descent (PGD) on the negative loss function

x^{t+1} = Π_{x+S}(x^{t} + α sgn(∇_{x}L(θ, x, y)))

我们可以把这种攻击解释为一个简单的一步方案来最大化鞍点公式的内部部分。一个更强大的对手是多步变式，它本质上是负损失函数上的投影梯度下降(PGD)

x^{t+1} = Π_{x+S}(x^{t} + α sgn(∇_{x}L(θ, x, y)))

Other methods like FGSM with random perturbation have also been proposed [29]. Clearly, all of these approaches can be viewed as specific attempts to solve the inner maximization problem in (2.1).

其他方法如随机扰动的FGSM也被提出。显然，所有这些方法都可以看作是解决(2.1)中内部最大化问题的具体尝试。

On the defense side, the training dataset is often augmented with adversarial examples produced by FGSM. This approach also directly follows from (2.1) when linearizing the inner maximization problem. To solve the simplified robust optimization problem, we replace every training example with its FGSM-perturbed counterpart. More sophisticated defense mechanisms such as training against multiple adversaries can be seen as better, more exhaustive approximations of the inner maximization problem.

在防御方面，训练数据集经常用FGSM产生的对抗性例子进行扩充。该方法在线性化内最大化问题时也直接遵循(2.1)。为了解决简化的鲁棒优化问题，我们将每个训练示例替换为对应的fgsm扰动。更复杂的防御机制，如针对多个敌人的训练，可以看作是更好、更彻底的内部最大化问题的逼近。

3. Towards Universally Robust Networks

3. 走向普遍鲁棒网络

Current work on adversarial examples usually focuses on specific defensive mechanisms, or on attacks against such defenses. An important feature of formulation (2.1) is that attaining small adversarial loss gives a guarantee that no allowed attack will fool the network. By definition, no adversarial perturbations are possible because the loss is small for all perturbations allowed by our attack model. Hence, we now focus our attention on obtaining a good solution to (2.1).

目前的工作对抗性的例子通常集中在特定的防御机制，或攻击这样的防御。公式(2.1)的一个重要特征是，获得较小的对敌损失可以保证不允许攻击欺骗网络。根据定义，不可能出现对抗性扰动，因为对于我们的攻击模型所允许的所有扰动，损失都很小。因此，我们现在的重点是获得(2.1)的一个好的解决方案。

Unfortunately, while the overall guarantee provided by the saddle point problem is evidently useful, it is not clear whether we can actually find a good solution in reasonable time. Solving the saddle point problem (2.1) involves tackling both a non-convex outer minimization problem and a non-concave inner maximization problem. One of our key contributions is demonstrating that, in practice, one can solve the saddle point problem after all. In particular, we now discuss an experimental exploration of the structure given by the non-concave inner problem. We argue that the loss landscape corresponding to this problem has a surprisingly tractable structure of local maxima. This structure also points towards projected gradient descent as the “ultimate” first-order adversary. Sections 4 and 5 then show that the resulting trained networks are indeed robust against a wide range of attacks, provided the networks are sufficiently large.

不幸的是，虽然鞍点问题提供的整体保证显然是有用的，但我们是否能在合理的时间内找到一个好的解决方案还不清楚。解决鞍点问题(2.1)涉及到处理一个非凸的外部最小化问题和一个非凹的内部最大化问题。我们的主要贡献之一是，在实践中证明了鞍点问题是可以解决的。特别地，我们现在讨论由非凹内问题给出的结构的实验探索。我们认为，与这个问题相对应的损失景观具有令人惊讶的可控制的局部极大值结构。这个结构也指向投影梯度下降作为最终的一阶对手。第4节和第5节表明，如果网络足够大的话，得到的训练网络确实能够抵抗广泛的攻击。

3.1 The Landscape of Adversarial Examples

3.1对抗性例子的概况

Recall that the inner problem corresponds to finding an adversarial example for a given network and data point (subject to our attack model). As this problem requires us to maximize a highly nonconcave function, one would expect it to be intractable. Indeed, this is the conclusion reached by prior work which then resorted to linearizing the inner maximization problem [15, 26]. As pointed out above, this linearization approach yields well-known methods such as FGSM. While training against FGSM adversaries has shown some successes, recent work also highlights important shortcomings of this one-step approach [29]—slightly more sophisticated adversaries can still find points of high loss.

回想一下，内部问题对应于为给定的网络和数据点(受制于我们的攻击模型)找到一个敌对的例子。由于这个问题要求我们最大化一个高度非凹的函数，人们会认为它很难处理。事实上，这是之前通过线性化内最大化问题得到的结论[15,26]。如上所述，这种线性化方法产生了众所周知的方法，如FGSM。虽然针对FGSM对手的训练取得了一些成功，但最近的工作也强调了这种一步法的重要缺点，因为稍微复杂的对手仍然可以找到高损失点。

To understand the inner problem in more detail, we investigate the landscape of local maxima for multiple models on MNIST and CIFAR10. The main tool in our experiments is projected gradient descent (PGD), since it is the standard method for large-scale constrained optimization. In order to explore a large part of the loss landscape, we re-start PGD from many points in the `∞ balls around data points from the respective evaluation sets

为了更详细地理解内部问题，我们研究了MNIST和CIFAR10上多个模型的局部极大值景观。我们实验中的主要工具是投影梯度下降(PGD)，因为它是大规模约束优化的标准方法。为了探索损失景观的大部分，我们从各个评估集的数据点周围的球中的许多点重新启动PGD

Surprisingly, our experiments show that the inner problem is tractable after all, at least from the perspective of first-order methods. While there are many local maxima spread widely apart within xi + S, they tend to have very well-concentrated loss values. This echoes the folklore belief that training neural networks is possible because the loss (as a function of model parameters) typically has many local minima with very similar values.

令人惊讶的是，我们的实验表明，内部问题毕竟是可控的，至少从一阶方法的角度来看是如此。虽然在xi + S中有许多分布广泛的局部极大值，但它们往往有非常集中的损失值。这与民间传说中的信念相呼应，即训练神经网络是可能的，因为损失(作为模型参数的函数)通常有许多具有非常相似值的局部极小值。

Specifically, in our experiments we found the following phenomena:

具体来说，在我们的实验中，我们发现了以下现象:

We observe that the loss achieved by the adversary increases in a fairly consistent way and plateaus rapidly when performing projected `∞ gradient descent for randomly chosen starting points inside x + S (see Figure 1).

我们观察到，在对x + S内随机选取的起始点进行投影梯度下降时，对手的损失以相当一致的方式增加，并迅速趋于平稳(见图1)。

Figure 1: Cross-entropy loss values while creating an adversarial example from the MNIST and CIFAR10 evaluation datasets. The plots show how the loss evolves during 20 runs of projected gradient descent (PGD). Each run starts at a uniformly random point in the `∞-ball around the same natural example (additional plots for different examples appear in Figure 11). The adversarial loss plateaus after a small number of iterations. The optimization trajectories and final loss values are also fairly clustered, especially on CIFAR10. Moreover, the final loss values on adversarially trained networks are significantly smaller than on their standard counterparts.

图1:从MNIST和CIFAR10评估数据集创建一个敌对示例时的交叉熵损失值。这些图显示了在预计的梯度下降(PGD)的20次运行中损失是如何演变的。每次运行都从围绕同一自然示例的' -ball中的一个均匀随机的点开始(图11中显示了不同示例的附加图)。在少量的迭代之后，对抗性的损失达到了顶峰。优化轨迹和最终损失值也相当聚集，特别是在CIFAR10上。此外，反向训练的网络的最终损失值明显小于它们的标准对应。

Investigating the concentration of maxima further, we observe that over a large number of random restarts, the loss of the final iterate follows a well-concentrated distribution without extreme outliers (see Figure 2; we verified this concentration based on 105 restarts).

进一步研究maxima的浓度，我们观察到，在大量的随机重启中，最终迭代的损失遵循一个良好的集中分布，没有极端异常值(见图2;我们根据105次的重新启动验证了这个浓度)。

To demonstrate that maxima are noticeably distinct, we also measured the 2 distance and angles between all pairs of them and observed that distances are distributed close to the expected distance between two random points in the∞ ball, and angles are close to 90◦ . Along the line segment between local maxima, the loss is convex, attaining its maximum at the endpoints and is reduced by a constant factor in the middle. Nevertheless, for the entire segment, the loss is considerably higher than that of a random point.

为了证明极大值是明显不同的，我们还测量了所有对它们之间的2距离和角度，观察到距离的分布接近于球中两个随机点之间的预期距离，角度接近90。沿着局部极大值之间的线段，损失是凸的，在端点处达到最大，并在中间减少一个常数因子。然而，对于整个部分来说，损失要比随机点的损失大得多。

Finally, we observe that the distribution of maxima suggests that the recently developed subspace view of adversarial examples is not fully capturing the richness of attacks [29]. In particular, we observe adversarial perturbations with negative inner product with the gradient of the example, and deteriorating overall correlation with the gradient direction as the scale of perturbation increases.

最后，我们观察到极大值的分布表明，最近发展的对敌例子的子空间观点没有完全捕捉攻击[29]的丰富程度。特别地，我们观察到与算例的梯度负内积的对抗性扰动，并且随着扰动尺度的增大，总体上扰动与梯度方向的相关性变差。

All of this evidence points towards PGD being a “universal” adversary among first-order approaches, as we will see next.

所有这些证据都指向PGD在一阶方法中是一个“通用的”对手，正如我们接下来将看到的。

Figure 2: Values of the local maxima given by the cross-entropy loss for five examples from the MNIST and CIFAR10 evaluation datasets. For each example, we start projected gradient descent (PGD) from 105 uniformly random points in the `∞-ball around the example and iterate PGD until the loss plateaus. The blue histogram corresponds to the loss on a standard network, while the red histogram corresponds to the adversarially trained counterpart. The loss is significantly smaller for the adversarially trained networks, and the final loss values are very concentrated without any outliers.

图2:MNIST和CIFAR10评价数据集中五个实例的交叉熵损失给出的局部极大值。对于每个示例，我们从示例周围的“球”中的105个均匀随机点开始投影梯度下降(PGD)，并迭代PGD直到损失稳定。蓝色的直方图对应的是标准网络上的损失，而红色的直方图对应的是反向训练的网络。对于反向训练的网络，损失要小得多，并且最终损失值非常集中，没有任何异常值。

3.2 First-Order Adversaries

3.2一阶的敌人

Our experiments show that the local maxima found by PGD all have similar loss values, both for normally trained networks and adversarially trained networks. This concentration phenomenon suggests an intriguing view on the problem in which robustness against the PGD adversary yields robustness against all first-order adversaries, i.e., attacks that rely only on first-order information. As long as the adversary only uses gradients of the loss function with respect to the input, we conjecture that it will not find significantly better local maxima than PGD. We give more experimental evidence for this hypothesis in Section 5: if we train a network to be robust against PGD adversaries, it becomes robust against a wide range of other attacks as well.

我们的实验表明，PGD发现的局部极大值对于正常训练的网络和反向训练的网络都有相似的损失值。这种集中现象表明了一个有趣的观点，即对PGD对手的鲁棒性会对所有一阶对手产生鲁棒性，即，只依赖于一阶信息的攻击。只要对手只使用损失函数相对于输入的梯度，我们推测它不会找到比PGD更好的局部极大值。在第5节中，我们为这个假设提供了更多的实验证据:如果我们训练一个网络对PGD对手具有鲁棒性，那么它对其他各种攻击也具有鲁棒性。

Of course, our exploration with PGD does not preclude the existence of some isolated maxima with much larger function value. However, our experiments suggest that such better local maxima are hard to find with first order methods: even a large number of random restarts did not find function values with significantly different loss values. Incorporating the computational power of the adversary into the attack model should be reminiscent of the notion of polynomially bounded adversary that is a cornerstone of modern cryptography. There, this classic attack model allows the adversary to only solve problems that require at most polynomial computation time. Here, we employ an optimization-based view on the power of the adversary as it is more suitable in the context of machine learning. After all, we have not yet developed a thorough understanding of the computational complexity of many recent machine learning problems. However, the vast majority of optimization problems in ML is solved with first-order methods, and variants of SGD are the most effective way of training deep learning models in particular. Hence we believe that the class of attacks relying on first-order information is, in some sense, universal for the current practice of deep learning.

当然，我们对PGD的探索并不排除某些具有更大函数值的孤立极大值的存在。但是，我们的实验表明，一阶方法很难找到这种更好的局部极大值，即使大量的随机重新启动也没有找到损失值有显著差异的函数值。将对手的计算能力纳入攻击模型应该会让人想起作为现代密码学基石的多项式有界对手的概念。在这种情况下，这个经典的攻击模型允许攻击者只解决最多需要多项式计算时间的问题。在这里，我们采用基于优化的观点来看待对手的力量，因为它更适合机器学习的背景。毕竟，我们还没有完全理解许多最近的机器学习问题的计算复杂性。然而，ML中的绝大多数优化问题都是用一阶方法解决的，特别是SGD的变体是训练深度学习模型最有效的方式。因此我们认为这类依赖于一阶信息的攻击在某种意义上对于当前的深度学习实践是普遍的。

Put together, these two ideas chart the way towards machine learning models with guaranteed robustness. If we train the network to be robust against PGD adversaries, it will be robust against a wide range of attacks that encompasses all current approaches.

总之，这两种想法为机器学习模型的稳健性指明了道路。如果我们将网络训练成对抗PGD对手的鲁棒性，那么它将对包括所有当前方法的广泛攻击具有鲁棒性。

In fact, this robustness guarantee would become even stronger in the context of black-box attacks, i.e., attacks in which the adversary does not have a direct access to the target network. Instead, the adversary only has less specific information such as the (rough) model architecture and the training data set. One can view this attack model as an example of “zero order” attacks, i.e., attacks in which the adversary has no direct access to the classifier and is only able to evaluate it on chosen examples without gradient feedback.

事实上，这种健壮性保证在黑箱攻击(即对手不能直接访问目标网络的攻击)的情况下会变得更强。相反,对手只有不太特定的信息,如(粗糙)模型的体系结构和训练数据集。这种攻击模式可以考虑为“零阶”攻击的一个例子,例如,攻击对手没有直接访问的分类器,只能评估选择的例子没有梯度的反馈。

We discuss transferability in Section B of the appendix. We observe that increasing network capacity and strengthening the adversary we train against (FGSM or PGD training, rather than standard training) improves resistance against transfer attacks. Also, as expected, the resistance of our best models to such attacks tends to be significantly larger than to the (strongest) first order attacks.

我们在附录B部分讨论可转让性。我们观察到，增加网络容量和加强我们训练的对手(FGSM或PGD训练，而不是标准训练)可以提高对传输攻击的抵抗力。同样，正如预期的那样，我们最好的模型对这种攻击的抵抗往往比(最强的)一阶攻击要大得多。

3.3 Descent Directions for Adversarial Training

3.3对抗性训练下降方向

The preceding discussion suggests that the inner optimization problem can be successfully solved by applying PGD. In order to train adversarially robust networks, we also need to solve the outer optimization problem of the saddle point formulation (2.1), that is find model parameters that minimize the “adversarial loss”, the value of the inner maximization problem.

上述讨论表明，应用PGD可以成功地解决内部优化问题。为了训练对抗鲁棒网络，我们还需要解决鞍点公式(2.1)的外部优化问题，即寻找使“对抗损失”最小化的模型参数，即内部最大化问题的值。

In the context of training neural networks, the main method for minimizing the loss function is Stochastic Gradient Descent (SGD). A natural way of computing the gradient of the outer problem, ∇θρ(θ), is computing the gradient of the loss function at a maximizer of the inner problem. This corresponds to replacing the input points by their corresponding adversarial perturbations and normally training the network on the perturbed input. A priori, it is not clear that this is a valid descent direction for the saddle point problem. However, for the case of continuously differentiable functions, Danskin’s theorem—a classic theorem in optimization—states this is indeed true and gradients at inner maximizers corresponds to descent directions for the saddle point problem.

在神经网络的训练中，使损失函数最小化的主要方法是随机梯度下降法(SGD)。计算外部问题梯度的一种自然方法是计算损失函数在内部问题的最大值处的梯度。这相当于用它们对应的对抗性扰动替换输入点，并通常在受扰动的输入上训练网络。对于鞍点问题，这是否是一个有效的下降方向是一个先验的不清楚的。然而，对于连续可微函数的情况，Danskin s定理，优化状态中的经典定理这确实是真的，对于鞍点问题，内极值点的梯度对应于下降方向。

Despite the fact that the exact assumptions of Danskin’s theorem do not hold for our problem (the function is not continuously differentiable due to ReLU and max-pooling units, and we are only computing approximate maximizers of the inner problem), our experiments suggest that we can still use these gradients to optimize our problem. By applying SGD using the gradient of the loss at adversarial examples we can consistently reduce the loss of the saddle point problem during training, as can be seen in Figure 5. These observations suggest that we reliably optimize the saddle point formulation (2.1) and thus train robust classifiers. We formally state Danskin’s theorem and describe how it applies to our problem in Section A of the Appendix.

尽管Danskin年代定理的具体假设不坚持我们的问题(这个函数不是连续可微的由于ReLU和max-pooling单位,和我们只计算近似极大化者的内在问题),我们的实验表明,我们仍然可以使用这些梯度优化我们的问题。通过在对抗例子中使用损失梯度的SGD，我们可以在训练过程中持续减少鞍点问题的损失，如图5所示。这些观察结果表明，我们可靠地优化了鞍点公式(2.1)，从而训练了鲁棒分类器。在附录A部分，我们正式地陈述了Danskin s定理并描述了它如何应用于我们的问题。

4 Network Capacity and Adversarial Robustness

4. 网络容量和对抗的健壮性

Solving the problem from Equation (2.1) successfully is not sufficient to guarantee robust and accurate classification. We need to also argue that the value of the problem (i.e. the final loss we achieve against adversarial examples) is small, thus providing guarantees for the performance of our classifier. In particular, achieving a very small value corresponds to a perfect classifier, which is robust to adversarial inputs.

成功求解式(2.1)中的问题不足以保证分类的鲁棒性和准确性。我们还需要证明，问题的价值(即我们在对抗例子中获得的最终损失)很小，从而为我们的分类器的性能提供了保证。特别地，获得一个非常小的值对应一个完美的分类器，它对敌对的输入是鲁棒的。

For a fixed set S of possible perturbations, the value of the problem is entirely dependent on the architecture of the classifier we are learning. Consequently, the architectural capacity of the model becomes a major factor affecting its overall performance. At a high level, classifying examples in a robust way requires a stronger classifier, since the presence of adversarial examples changes the decision boundary of the problem to a more complicated one (see Figure 3 for an illustration).

对于一个可能扰动的固定集合，问题的值完全依赖于我们正在学习的分类器的结构。因此，模型的架构能力成为影响其整体性能的主要因素。在较高的层次上，健壮地对示例进行分类需要更强大的分类器，因为敌对示例的存在将问题的决策边界更改为更复杂的边界(参见图3)。

Figure 3: A conceptual illustration of standard vs. adversarial decision boundaries. Left: A set of points that can be easily separated with a simple (in this case, linear) decision boundary. Middle: The simple decision boundary does not separate the ∞-balls (here, squares) around the data points. Hence there are adversarial examples (the red stars) that will be misclassified. Right: Separating the∞-balls requires a significantly more complicated decision boundary. The resulting classifier is robust to adversarial examples with bounded `∞-norm perturbations.

图3:标准和对抗决策边界的概念图示。左:可以通过简单(在本例中为线性)决策边界轻松分离的点集。中间:简单的决策边界不会将数据点周围的“球”(这里是“方块”)分开。因此，有一些对抗性的例子(红色的星星)会被错误地分类。正确:分离“球”需要一个明显更加复杂的决策边界。所得到的分类器对具有有界范数扰动的对抗性例子具有鲁棒性。

Our experiments verify that capacity is crucial for robustness, as well as for the ability to successfully train against strong adversaries. For the MNIST dataset, we consider a simple convolutional network and study how its behavior changes against different adversaries as we keep doubling the size of network (i.e. double the number of convolutional filters and the size of the fully connected layer). The initial network has a convolutional layer with 2 filters, followed by another convolutional layer with 4 filters, and a fully connected hidden layer with 64 units. Convolutional layers are followed by 2 × 2 max-pooling layers and adversarial examples are constructed with ε = 0.3. The results are in Figure 4.

我们的实验证明，能力对于健壮性至关重要，对于成功训练对抗强大对手的能力也是如此。对于MNIST数据集，我们考虑一个简单的卷积网络，并研究当我们将网络的规模翻倍时(即翻倍卷积过滤器的数量和全连接层的规模)，它的行为如何变化以对抗不同的对手。初始网络有2个滤波器的卷积层，然后是4个滤波器的卷积层，以及64单元的全连接隐藏层。卷积层后面是2 2个最大池层，对抗性的例子是用结果如图4所示。

For the CIFAR10 dataset, we used a ResNet model [13]. We performed data augmentation using random crops and flips, as well as per image standarization. To increase the capacity, we modified the network incorporating wider layers by a factor of 10. This results in a network with 5 residual units with (16, 160, 320, 640) filters each. This network can achieve an accuracy of 95.2% when trained with natural examples. Adversarial examples were constructed with ε = 8. Results on capacity experiments appear in Figure 4.

对于CIFAR10数据集，我们使用ResNet模型[13]。我们使用随机作物和翻转以及图像标准化来进行数据增强。为了增加容量，我们对网络进行了改进，增加了10倍的层数。这样得到的网络有5个残差单元，每个残差单元都有(16,160,320,640)滤波器。利用自然实例进行训练，该网络的准确率可达95.2%。相反的例子被构造为:容量实验的结果如图4所示。

We observe the following phenomena:

我们观察到以下现象:

Capacity alone helps. We observe that increasing the capacity of the network when training using only natural examples (apart from increasing accuracy on these examples) increases the robustness against one-step perturbations. This effect is greater when considering adversarial examples with smaller ε.

能力就有帮助。我们观察到，在仅使用自然例子(除了增加这些例子的精确性)进行训练时，增加网络的容量可以增加对一步扰动的鲁棒性。当考虑较小的相对例时，这种效果会更大。

FGSM adversaries don’t increase robustness (for large ε). When training the network using adversarial examples generated with the FGSM, we observe that the network overfits to these adversarial examples. This behavior is known as label leaking [18] and stems from the fact that the adversary produces a very restricted set of adversarial examples that the network can overfit to. These networks have poor performance on natural examples and don’t exhibit any kind of robustness against PGD adversaries. For the case of smaller ε the loss is ofter linear enough in the `∞-ball around natural examples, that FGSM finds adversarial examples close to those found by PGD thus being a reasonable adversary to train against.

FGSM对手不会增加健壮性(对于大的干扰)。当使用由FGSM生成的敌对示例训练网络时，我们观察到网络对这些敌对示例过度适应。这种行为被称为标签泄漏[18]，其根源是对手产生了一组非常有限的敌对示例，网络可以承受这些示例。这些网络在自然例子上的性能很差，并且对PGD对手没有表现出任何的鲁棒性。对于更小的情况下，损失是线性足够在'球周围的自然例子，FGSM发现敌对的例子接近那些由PGD发现，因此成为一个合理的对手训练反对。

Weak models may fail to learn non-trivial classifiers. In the case of small capacity networks, attempting to train against a strong adversary (PGD) prevents the network from learning anything meaningful. The network converges to always predicting a fixed class, even though it could converge to an accurate classifier through standard training. The small capacity of the network forces the training procedure to sacrifice performance on natural examples in order to provide any kind of robustness against adversarial inputs.

弱模型可能无法学习非平凡的分类器。在小容量网络的情况下，试图针对强大的对手(PGD)进行训练会阻止网络学习任何有意义的内容。该网络始终收敛于预测一个固定的类，即使它可以通过标准训练收敛到一个准确的分类器。网络的小容量迫使训练过程牺牲在自然例子上的性能，以提供对敌对输入的任何一种鲁棒性。

The value of the saddle point problem decreases as we increase the capacity. Fixing an adversary model, and training against it, the value of (2.1) drops as capacity increases, indicating the the model can fit the adversarial examples increasingly well.

鞍点问题的值随着容量的增加而减小。(2.1)的值随着容量的增加而下降，说明该模型对对抗实例的拟合效果越来越好。

More capacity and stronger adversaries decrease transferability. Either increasing the capacity of the network, or using a stronger method for the inner optimization problem reduces the effectiveness of transferred adversarial inputs. We validate this experimentally by observing that the correlation between gradients from the source and the transfer network, becomes less significant as capacity increases. We describe our experiments in Section B of the appendix.

更大的容量和更强的对手会降低可转移性。无论是增加网络容量，还是对内部优化问题使用更强的方法，都会降低对抗性输入转移的有效性。我们通过观察来自源和传输网络的梯度之间的相关性来验证这一点，随着容量的增加，这种相关性变得不那么显著。我们在附录的B部分描述我们的实验。

5 Experiments: Adversarially Robust Deep Learning Models

5 实验:反向鲁棒深度学习模型

Following the understanding of the problem we developed in previous sections, we can now apply our proposed approach to train robust classifiers. As our experiments so far demonstrated, we need to focus on two key elements: a) train a sufficiently high capacity network, b) use the strongest possible adversary.

在理解了我们在前几节中开发的问题之后，我们现在可以应用我们提出的方法来训练鲁棒分类器。正如我们目前的实验所证明的，我们需要关注两个关键要素:a)训练一个足够大的容量网络，b)使用可能最强的对手。

For both MNIST and CIFAR10, the adversary of choice will be projected gradient descent (PGD) starting from a random perturbation around the natural example. This corresponds to our notion of a "complete" first-order adversary, an algorithm that can efficiently maximize the loss of an example using only first order information. Since we are training the model for multiple epochs, there is no benefit from restarting PGD multiple times per batch—a new start will be chosen the next time each example is encountered.

对于MNIST和CIFAR10，选择的对手将是投影梯度下降(PGD)，从自然例子周围的随机扰动开始。这就相当于我们的“完全”一阶对手的概念，一种算法，可以有效地最大限度地使用一阶信息的例子损失。由于我们正在为多个epoch训练模型，因此每个批处理多次重新启动PGD没有任何好处，下一次遇到每个示例时将选择新的启动。

When training against that adversary, we observe a steady decrease in the training loss of adversarial examples, illustrated in Figure 5. This behavior indicates that we are indeed successfully solving our original optimization problem during training.

当针对对手进行训练时，我们观察到对手例子的训练损失稳步下降，如图5所示。这个行为表明我们确实在训练中成功地解决了我们最初的优化问题。

Figure 4: The effect of network capacity on the performance of the network. We trained MNIST and CIFAR10 networks of varying capacity on: (a) natural examples, (b) with FGSM-made adversarial examples, (c) with PGD-made adversarial examples. In the first three plots/tables of each dataset, we show how the standard and adversarial accuracy changes with respect to capacity for each training regime. In the final plot/table, we show the value of the cross-entropy loss on the adversarial examples the networks were trained on. This corresponds to the value of our saddle point formulation (2.1) for different sets of allowed perturbations.

图4:网络容量对网络性能的影响。我们训练了不同容量的MNIST和CIFAR10网络:(a)自然例子，(b) fgsm制造的对抗例子，(c) pgd制造的对抗例子。在每个数据集的前三个图/表中，我们展示了标准的和敌对的准确性是如何随着每个训练方案的能力而变化的。在最后的图/表中，我们显示了网络在敌对例子上的交叉熵损失值。对于不同的允许摄动集，这对应于我们的鞍点公式(2.1)的值。

We evaluate the trained models against a range of adversaries. We illustrate our results in Table 1 for MNIST and Table 2 for CIFAR10. The adversaries we consider are:

我们评估训练过的模型对抗一系列的敌人。我们来说明我们的结果表1为MNIST，表2为CIFAR10。我们所考虑的对手是:

White-box attacks with PGD for a different number of of iterations and restarts, denoted by source A.
- 对于不同的迭代和重启次数，使用PGD进行白盒攻击，由源a表示。
White-box attacks with PGD using the Carlini-Wagner (CW) loss function [6] (directly optimizing the difference between correct and incorrect logits). This is denoted as CW, where the corresponding attack with a high confidence parameter (κ = 50) is denoted as CW+.
- 使用PGD进行白盒攻击，使用Carlini-Wagner (CW)损失函数6。这是表示连续波,相应的攻击高的信心参数(κ= 50)和连续波+表示。
Black-box attacks from an independently trained copy of the network, denoted A’.
- 黑盒攻击来自一个独立训练的网络拷贝，表示'。
Black-box attacks from a version of the same network trained only on natural examples, denoted Anat.
- Anat表示，来自同一个网络版本的黑盒攻击只在自然例子上训练。
Black-box attacks from a different convolution architecture, denoted B, described in Tramer et al. 2017 [29].
- 来自不同卷积架构的黑盒攻击，表示B，描述在Tramer et al. 2017[29]。

MNIST. We run 40 iterations of projected gradient descent as our adversary, with a step size of 0.01 (we choose to take gradient steps in the `∞-norm, i.e. adding the sign of the gradient, since this makes the choice of the step size simpler). We train and evaluate against perturbations of size ε = 0.3. We use a network consisting of two convolutional layers with 32 and 64 filters respectively, each followed by 2 × 2 max-pooling, and a fully connected layer of size 1024. When trained with natural examples, this network reaches 99.2% accuracy on the evaluation set. However, when evaluating on examples perturbed with FGSM the accuracy drops to 6.4%. The resulting adversarial accuracies are reported in Table 1. Given that the resulting MNIST model is very robust to `∞-bounded adversaries, we investigated the learned parameters in order to understand how they affect adversarial robustness. The results of the investigation are presented in Appendix C. In particular, we found that the first convolutional layer of the network is learning to threshold input pixels while other weights tend to be sparser.

MNIST。我们运行了40次投影梯度下降迭代作为我们的对手，其步长为0.01(我们选择采用' -norm中的梯度步长，即添加梯度的符号，因为这样可以简化步长的选择)。我们针对尺寸为(= 0.3)的扰动进行训练和评估。我们使用的网络由两个卷积层组成，分别有32和64个过滤器，每个都有2个最大池，和一个大小为1024的完全连接层。当该网络使用自然样本训练，在评价集上的准确率达到99.2%，但在对受FGSM扰动的样本进行评价时，准确率下降到6.4%。对抗性准确度报告在表1中。鉴于得到的MNIST模型对有界的对手非常鲁棒，我们研究了学习的参数，以便了解它们如何影响对抗性鲁棒性。研究结果在附录c中给出。特别地，我们发现网络的第一卷积层正在学习输入像素的阈值，而其他权值则趋向于稀疏。

CIFAR10. For the CIFAR10 dataset, we use the two architectures described in 4 (the original ResNet and its 10× wider variant). We trained the network against a PGD adversary with `∞ projected gradient descent again, this time using 7 steps of size 2, and a total ε = 8. For our hardest adversary we chose 20 steps with the same settings, since other hyperparameter choices didn’t offer a significant decrease in accuracy. The results of our experiments appear in Table 2.

CIFAR10。对于CIFAR10数据集，我们使用了4中描述的两种架构(原始ResNet及其10种更广泛的变体)。我们对网络进行了针对PGD对手的训练，使用“投影梯度下降再次，这一次使用7步大小为2，且总指挥部= 8。”对于我们最困难的对手，我们选择了20步相同的设置，因为其他的超参数选择没有提供一个显著的降低精度。我们的实验结果见表2。

The adversarial robustness of our network is significant, given the power of iterative adversaries, but still far from satisfactory. We believe that these results can be improved by further pushing along these directions, and training networks of larger capacity

考虑到迭代对手的力量，我们网络的对抗鲁棒性是重要的，但仍远不能令人满意。我们相信，通过进一步推动这些方向，培训更大能力的网络，这些结果可以得到改善。

Resistance for different values of ε and 2-bounded attacks. In order to perform a broader evaluation of the adversarial robustness of our models, we run two additional experiments. On one hand, we investigate the resistance to∞-bounded attacks for different values of ε. On the other hand, we examine the resistance of our model to attacks that are bounded in 2-norm as opposed to∞-norm. In the case of 2-bounded PGD we take steps in the gradient direction (not the sign of it) and normalize the steps to be of fixed norm to facilitate step size tuning. For all PGD attacks, we use 100 steps and set the step size to be 2.5 · ε/100 to ensure that we can reach the boundary of the ε-ball from any starting point within it (and still allow for movement on the boundary). Note that the models were training against∞-bounded attacks with the original value of ε = 0.3, for MNIST, and ε = 8 for CIFAR10. The results appear in Figure 6.

对不同的单位攻击和' 2有界攻击值的抵抗。为了对我们的模型的对抗鲁棒性进行更广泛的评估，我们进行了两个额外的实验。一方面，我们研究了对不同值的攻击的抵抗。另一方面，我们检验了我们的模型对局限于“2-norm”而不是“-norm”的攻击的抵抗。在“2-bounded PGD”的情况下，我们在梯度方向(而不是符号)上采取步骤，并将步骤规范化为固定范数，以方便步长调整。对于所有的PGD攻击，我们使用100步并将步长设置为2.5·pie /100，以确保我们可以从其中的任何起点到达滚球边界(并且仍然允许在边界上移动)。注意，对于MNIST，模型训练的是针对“有界攻击”，初始值为root = 0.3，对于CIFAR10，初始值为root = 8。结果如图6所示。

We observe that for smaller ε than the one used during training the models achieve equal or higher accuracy, as expected. For MNIST, we notice a large drop in robustness for slightly large ε values, potentially due to the fact that the threshold operators learned are tuned to the exact value of ε used during training (Appendix C). In contrast, the decay for the case of CIFAR10 is smoother.

我们观察到，对于比在训练中使用的更小的系统，模型达到相同或更高的精度，正如预期的那样。对于MNIST，我们注意到对于稍大的位值，鲁棒性有很大的下降，这可能是由于学习的阈值操作符被调优为训练中使用的确切的值(附录C)。相比之下，CIFAR10的衰减更平滑。

For the case of 2-bounded attacks on MNIST, we observe that PGD is unable to find adversarial examples even for quite large values of ε, e.g., ε = 4.5. To put this value of ε into perspective, we provide a sample of corresponding adversarial examples in Figure 12 of Appendix D. We observe that these perturbations are significant enough that they would change the ground-truth label of the images and it is thus unlikely that our models are actually that robust. Indeed, subsequent work [20, 25] has found that PGD is in fact overestimating the2-robustness of this model. This behavior is potentially due to the fact that the learned threshold filters (Appendix C) mask the gradient, preventing PGD from maximizing the loss. Attacking the model with a decision-based attack [4] which does not rely on model gradients reveals that the model is significantly more brittle against 2-bounded attacks. Nevertheless, the∞-trained model is still more robust to `2 attacks compared to a standard model.

对于“2有界攻击MNIST”的情况，我们观察到，PGD无法找到对抗的例子，即使是对于相当大的单位值，如:$ $ = 4.5。把这个值ε的角度来看,我们提供样品相应的附录d .敌对的例子在图12中我们观察到这些扰动是足够重要,他们将改变图像的真实的标签,因此,我们的模型其实是健壮的。实际上，后续的工作[20,25]已经发现，PGD实际上高估了该模型的“2-鲁棒性”。这种行为可能是由于学习阈值过滤器(附录C)掩盖了梯度，阻止PGD最大限度地减少损失。使用不依赖模型梯度的基于决策的攻击[4]攻击模型，表明该模型对“2边界攻击”明显更脆弱。然而，与标准模型相比，“训练模型”对于2次攻击仍然更加健壮。

6相关工作

Due to the large body of work on adversarial examples we focus only on the most related papers here. Before we compare our contributions, we remark that robust optimization has been studied outside deep learning for multiple decades (see [1] for an overview of this field). We also want to note that the study of adversarial ML predates the widespread use of deep neural networks [8, 10] (see [3] for an overview of earlier work).

由于对抗性例子的大量工作，我们在这里只关注最相关的论文。在我们比较我们的贡献之前，我们注意到，鲁棒优化已经在深度学习之外研究了几十年(请参阅[1]了解该领域的概述)。我们还想指出，对抗性ML的研究早于深度神经网络的广泛使用[8,10] (参见[3]获得早期工作的概述)。

Adversarial training was introduced in [11], however the adversary utilized was quite weak—it relied on linearizing the loss around the data points. As a result, while these models were robust against this particular adversary, they were completely vulnerable to slightly more sophisticated adversaries utilizing iterative attacks.

在[11]中引入了对抗性训练，但是使用的对手很弱——它依赖于对数据点周围的损失进行线性化。因此，虽然这些模型对这个特定的对手是健壮的，但是对于使用迭代攻击的稍微复杂一点的对手，它们是完全脆弱的。

Recent work on adversarial training on ImageNet also observed that the model capacity is important for adversarial training [18]. In contrast to this paper, we find that training against multi-step methods (PGD) does lead to resistance against such adversaries.

最近ImageNet对抗性训练的研究也发现，模型容量对[18]对抗性训练很重要。与本文相比，我们发现针对多步骤方法(PGD)的训练确实会导致对此类对手的抵抗。

In [15] and [26] a version of the min-max optimization problem is also considered for adversarial training. There are, however, three important differences between the formerly mentioned result and the present paper. Firstly, the authors claim that the inner maximization problem can be difficult to solve, whereas we explore the loss surface in more detail and find that randomly re-started projected gradient descent often converges to solutions with comparable quality. This shows that it is possible to obtain sufficiently good solutions to the inner maximization problem, which offers good evidence that deep neural network can be immunized against adversarial examples. Secondly, they consider only one-step adversaries, while we work with multi-step methods. Additionally, while the experiments in [26] produce promising results, they are only evaluated against FGSM. However, FGSM-only evaluations are not fully reliable. One evidence for that is that [26] reports 70% accuracy for ε = 0.7, but any adversary that is allowed to perturb each pixel by more than 0.5 can construct a uniformly gray image, thus fooling any classifier.

在[15]和[26]中，最小-最大优化问题的一个版本也被考虑用于对抗训练。然而，在前面提到的结果和本论文之间有三个重要的区别。首先，作者声称内部最大化问题可能很难解决，而我们在更详细地探索损失面，发现随机重新启动投影梯度下降经常收敛到具有可比质量的解决方案。这表明，内最大化问题有可能得到足够好的解，这为深神经网络能够免疫对抗的例子提供了很好的证据。其次，他们只考虑一个步骤的对手，而我们使用多步骤的方法。此外，虽然在[26]中实验产生了很好的结果，但他们只在FGSM中进行评估。然而，仅使用fgsm的评估并不完全可靠。一个证据是，[26]报告的准确率为70%，但任何对手被允许对每个像素扰动超过0.5，就可以构造一个均匀的灰度图像，从而欺骗任何分类器。

A more recent paper [29] also explores the transferability phenomenon. This exploration focuses mostly on the region around natural examples where the loss is (close to) linear. When large perturbations are allowed, this region does not give a complete picture of the adversarial landscape. This is confirmed by our experiments, as well as pointed out by [29].

最近的一篇论文[29]也探讨了可转移性现象。这种勘探主要集中在自然例子周围的区域，那里的损失是(接近)线性的。当允许有大的扰动时，这个区域并不能提供对抗性景观的完整画面。我们的实验证实了这一点，[29]也指出了这一点。

7 Conclusion

7 结论

Our findings provide evidence that deep neural networks can be made resistant to adversarial attacks. As our theory and experiments indicate, we can design reliable adversarial training methods. One of the key insights behind this is the unexpectedly regular structure of the underlying optimization task: even though the relevant problem corresponds to the maximization of a highly non-concave function with many distinct local maxima, their values are highly concentrated. Overall, our findings give us hope that adversarially robust deep learning models may be within current reach.

我们的发现提供了证据，证明深层神经网络可以抵抗敌对攻击。理论和实验表明，我们可以设计出可靠的对抗性训练方法。这背后的一个关键洞见是底层优化任务出人意料的规则结构:即使相关问题对应于具有许多不同局部极大值的高度非凹函数的最大化，它们的值是高度集中的。总的来说，我们的发现给我们带来了希望，那就是我们可以在目前的情况下使用非常强大的深度学习模型。

For the MNIST dataset, our networks are very robust, achieving high accuracy for a wide range of powerful `∞-bound adversaries and large perturbations. Our experiments on CIFAR10 have not reached the same level of performance yet. However, our results already show that our techniques lead to significant increase in the robustness of the network. We believe that further exploring this direction will lead to adversarially robust networks for this dataset.

对于MNIST数据集，我们的网络是非常鲁棒的，能够对范围广泛的强大的‘∞’边界的敌人和大扰动实现高精度。我们在CIFAR10上的实验还没有达到同样的性能水平。然而，我们的结果已经表明，我们的技术显著提高了网络的鲁棒性。我们相信，对这个方向的进一步探索将会导致这个数据集的反向鲁棒网络。

Acknowledgments

致谢

Aleksander M ˛adry, Aleksandar Makelov, and Dimitris Tsipras were supported by the NSF Grant No. 1553428, a Google Research Fellowship, and a Sloan Research Fellowship. Ludwig Schmidt was supported by a Google PhD Fellowship. Adrian Vladu was supported by the NSF Grants No. 1111109 and No. 1553428.

阿克米˛渴的,亚历山大Makelov, Dimitris齐NSF资助支持的编号1553428，谷歌研究奖学金，和斯隆研究奖学金。路德维希·施密特得到了谷歌博士奖学金的支持。阿德里安·弗拉杜是由美国国家科学基金会资助的。 1111109号和1553428号。

We thank Wojciech Matusik for kindly providing us with computing resources to perform this work.

我们感谢Wojciech Matusik为我们提供计算资源来完成这项工作。

References

参考文献

[1] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton University Press, 2009.

[2] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndi´c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases (ECML-KDD), 2013.

[3] Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. 2018.

[4] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. In International Conference on Learning Representations (ICLR), 2017.

[5] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Workshop on Artificial Intelligence and Security (AISec), 2017.

[6] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Symposium on Security and Privacy (SP), 2017.

[7] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167, 2008.

[8] Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, and Deepak Verma. Adversarial classification. In international conference on Knowledge discovery and data mining, 2004.

[9] Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Analysis of classifiers’ robustness to adversarial perturbations. Machine Learning, 107(3):481–508, 2018.

[10] Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on Machine learning, 2006.

[11] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015.

[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In international conference on computer vision (ICCV), 2015.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[14] Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. Adversarial example defense: Ensembles of weak defenses are not strong. In USENIX Workshop on Offensive Technologies (WOOT), 2017.

[15] Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesvari. Learning with a strong adversary. arXiv preprint arXiv:1511.03034, 2015.

[16] Alex Krizhevsky. Learning multiple layers of features from tiny images. In Technical report, 2009.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012.

[18] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In International Conference on Learning Representations (ICLR), 2017.

[19] Yann LeCun. The mnist database of handwritten digits. In Technical report, 1998.

[20] Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Second-order adversarial attack and certifiable robustness. arXiv preprint arXiv:1809.03113, 2018.

[21] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2016.

[22] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Conference on computer vision and pattern recognition (CVPR), 2015.

[23] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. In ArXiv preprint arXiv:1605.07277, 2016.

[24] Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Symposium on Security and Privacy (SP), 2016.

[25] Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations (ICLR), 2019.

[26] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of supervised models through robust optimization. Neurocomputing, 307:195–204, 2018.

[27] Jure Sokoli´c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neural networks. In Transactions on Signal Processing, 2017.

[28] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.

[29] Florian Tramer, Nicolas Papernot, Ian Goodfellow, and Patrick McDaniel Dan Boneh. The space of transferable adversarial examples. In ArXiv preprint arXiv:1704.03453, 2017.

[30] Abraham Wald. Statistical decision functions which minimize the maximum risk. In Annals of Mathematics, 1945.

[31] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. In Network and Distributed Systems Security Symposium (NDSS), 2018.

Previous神经网络的有趣特性 Next解释和利用敌对的例子

Last updated 4 years ago

Was this helpful?