Defense- gan:使用生成模型保护分类器免受敌方攻击

Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models

原文链接：

https://arxiv.org/pdf/1805.06605.pdf

GB/T 7714 Samangouei P, Kabkab M, Chellappa R. Defense-gan: Protecting classifiers against adversarial attacks using generative models[J]. arXiv preprint arXiv:1805.06605, 2018.

MLA Samangouei, Pouya, Maya Kabkab, and Rama Chellappa. "Defense-gan: Protecting classifiers against adversarial attacks using generative models." arXiv preprint arXiv:1805.06605 (2018).

APA Samangouei, P., Kabkab, M., & Chellappa, R. (2018). Defense-gan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605.

ABSTRACT

摘要

In recent years, deep neural network approaches have been widely adopted for machine learning tasks, including classification. However, they were shown to be vulnerable to adversarial perturbations: carefully crafted small perturbations can cause misclassification of legitimate images. We propose Defense-GAN, a new framework leveraging the expressive capability of generative models to defend deep neural networks against such attacks. Defense-GAN is trained to model the distribution of unperturbed images. At inference time, it finds a close output to a given image which does not contain the adversarial changes. This output is then fed to the classifier. Our proposed method can be used with any classification model and does not modify the classifier structure or training procedure. It can also be used as a defense against any attack as it does not assume knowledge of the process for generating the adversarial examples. We empirically show that Defense-GAN is consistently effective against different attack methods and improves on existing defense strategies. Our code has been made publicly available at https://github.com/kabkabm/defensegan.

近年来，深度神经网络方法被广泛应用于机器学习任务，包括分类。然而，它们被证明容易受到敌对干扰:精心设计的小干扰可能导致对合法图像的错误分类。我们提出了防御- gan，一个新的框架，利用生成模型的表达能力，以保护深度神经网络对抗这样的攻击。通过训练Defense-GAN来建模未受扰动图像的分布。在推理时，它会找到对给定图像的不包含对抗性更改的关闭输出。然后，这个输出被送入分类器。该方法可用于任何分类模型，且不改变分类器的结构和训练过程。它还可以用来防御任何攻击，因为它不假定对生成敌对示例的过程有所了解。我们的经验表明，防御gan对不同的攻击方法是持续有效的，并改进了现有的防御策略。我们的代码已在https://github.com/kabkabm/defensegan公开提供。

1 INTRODUCTION

1介绍

Despite their outstanding performance on several machine learning tasks, deep neural networks have been shown to be susceptible to adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015). These attacks come in the form of adversarial examples: carefully crafted perturbations added to a legitimate input sample. In the context of classification, these perturbations cause the legitimate sample to be misclassified at inference time (Szegedy et al., 2014; Goodfellow et al., 2015; Papernot et al., 2016b; Liu et al., 2017). Such perturbations are often small in magnitude and do not affect human recognition but can drastically change the output of the classifier.

尽管深度神经网络在一些机器学习任务上表现出色，但已经被证明容易受到敌对攻击(Szegedy et al.， 2014;Goodfellow等，2015)。这些攻击以对抗示例的形式出现:将精心设计的扰动添加到合法输入示例中。在分类的背景下，这些扰动导致合法样本在推断时间被错误分类(Szegedy et al.， 2014;Goodfellow等，2015年;Papernot等，2016b;(Liu等，2017)。这种扰动通常在量级上很小，并不影响人类识别，但可以显著改变分类器的输出。

Recent literature has considered two types of threat models: black-box and white-box attacks. Under the black-box attack model, the attacker does not have access to the classification model parameters; whereas in the white-box attack model, the attacker has complete access to the model architecture and parameters, including potential defense mechanisms (Papernot et al., 2017; Tramer et al., 2017; ` Carlini & Wagner, 2017).

最近的文献考虑了两种类型的威胁模型:黑盒攻击和白盒攻击。在黑箱攻击模型下，攻击者无法访问分类模型参数;而在白盒攻击模型中，攻击者可以完全访问模型的架构和参数，包括潜在的防御机制(Papernot et al.， 2017;Tramer等人，2017年;” Carlini & Wagner, 2017)。

Various defenses have been proposed to mitigate the effect of adversarial attacks. These defenses can be grouped under three different approaches: (1) modifying the training data to make the classifier more robust against attacks, e.g., adversarial training which augments the training data of the classifier with adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015), (2) modifying the training procedure of the classifier to reduce the magnitude of gradients, e.g., defensive distillation (Papernot et al., 2016d), and (3) attempting to remove the adversarial noise from the input samples (Hendrycks & Gimpel, 2017; Meng & Chen, 2017). All of these approaches have limitations in the sense that they are effective against either white-box attacks or black-box attacks, but not both (Tramer et al., 2017; Meng & Chen, 2017). Furthermore, some of these defenses are devised ` with specific attack models in mind and are not effective against new attacks.

人们提出了各种防御措施来减轻对敌攻击的影响。这些防御可以分为三种不同的方法:(1)修改训练数据使分类器对攻击具有更强的鲁棒性，如对抗性训练，即用对抗性例子来增强分类器的训练数据(Szegedy et al.， 2014);Goodfellow等，2015)，(2)修改分类器的训练程序以降低梯度的大小，例如防御蒸馏(Papernot等，2016d)，(3)尝试从输入样本中去除对抗性噪声(Hendrycks &Gimpel, 2017;孟,陈,2017)。所有这些方法都有其局限性，即它们对白盒攻击或黑盒攻击都有效，但对两者都无效(Tramer et al.， 2017;孟,陈,2017)。此外，其中一些防御是“根据特定的攻击模型设计的，对新的攻击无效”。

In this paper, we propose a novel defense mechanism which is effective against both white-box and black-box attacks. We propose to leverage the representative power of Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) to diminish the effect of the adversarial perturbation, by “projecting” input images onto the range of the GAN’s generator prior to feeding them to the classifier. In the GAN framework, two models are trained simultaneously in an adversarial setting: a generative model that emulates the data distribution, and a discriminative model that predicts whether a certain input came from real data or was artificially created. The generative model learns a mapping G from a low-dimensional vector z ∈ R k to the high-dimensional input sample space R n. During training of the GAN, G is encouraged to generate samples which resemble the training data. It is, therefore, expected that legitimate samples will be close to some point in the range of G, whereas adversarial samples will be further away from the range of G. Furthermore, “projecting” the adversarial examples onto the range of the generator G can have the desirable effect of reducing the adversarial perturbation. The projected output, computed using Gradient Descent (GD), is fed into the classifier instead of the original (potentially adversarially modified) image. We empirically demonstrate that this is an effective defense against both black-box and white-box attacks on two benchmark image datasets.

在本文中，我们提出了一种新的有效防御白盒和黑盒攻击的机制。我们建议利用生成式对抗网络(GAN)的代表性力量(Goodfellow et al.， 2014)，通过将输入图像投影到GAN s生成器的范围，然后再将其输入到分类器，来减少对抗扰动的影响。在GAN框架中，两个模型在对抗性环境下同时训练:生成模型模拟数据分布，判别模型预测某个输入是来自真实数据还是人工创建的。生成模型学习从低维向量z R k到高维输入样本空间R n的映射。在训练GAN时，鼓励G生成与训练数据相似的样本。因此,认为合理的样本将接近G的范围中,而敌对的样本的范围将进一步远离G .此外,投射的对抗性的例子到范围生成器G可以减少敌对的扰动有令人满意的效果。使用梯度下降(GD)计算的投影输出被送入分类器，而不是原始的(可能经过反向修改的)图像。我们的经验证明，这是一个有效的防御黑盒和白盒攻击的两个基准图像数据集。

The rest of the paper is organized as follows. We introduce the necessary background regarding known attack models, defense mechanisms, and GANs in Section 2. Our defense mechanism, which we call Defense-GAN, is formally motivated and introduced in Section 3. Finally, experimental results, under different threat models, as well as comparisons to other defenses are presented in Section 4.

本文的其余部分组织如下。我们将在第2节中介绍有关已知攻击模型、防御机制和GANs的必要背景。我们的防御机制，我们称为防御- gan，是正式的动机，并在第3节中介绍。最后，给出了不同威胁模型下的实验结果，并与其他防御方法进行了比较第四节。

2相关工作及背景资料

In this work, we propose to use GANs for the purpose of defending against adversarial attacks in classification problems. Before detailing our approach in the next section, we explain related work in three parts. First, we discuss different attack models employed in the literature. We, then, go over related defense mechanisms against these attacks and discuss their strengths and shortcomings. Lastly, we explain necessary background information regarding GANs.

在这篇文章中，我们建议使用GANs来防御分类问题中的对抗性攻击。在下一节中详细介绍我们的方法之前，我们将分三个部分解释相关工作。首先，我们讨论了文献中使用的不同攻击模型。然后，我们回顾了针对这些攻击的相关防御机制，并讨论了它们的优缺点。最后，我们解释有关GANs的必要背景信息。

2.1 ATTACK MODELS AND ALGORITHMS

2.1攻击模型和算法

Various attack models and algorithms have been used to target classifiers. All attack models we consider aim to find a perturbation δ to be added to a (legitimate) input $x ∈ R^{n}$ , resulting in the adversarial example $\widetilde{x} = x+δ$ . The $\ell_{∞}$ -norm of the perturbation is denoted by X (Goodfellow et al., 2015) and is chosen to be small enough so as to remain undetectable. We consider two threat levels: black- and white-box attacks.

各种攻击模型和算法被用于目标分类器。我们所考虑的所有攻击模型的目标都是找到一个加到(合法的)输入 $x ∈ R^{n}$ 上的扰动因子，从而产生一个敌对的例子 $\widetilde{x} = x+δ$ 变量。扰动的 $\ell_{∞}$ -范数由X表示(Goodfellow et al.， 2015)，并被选择为足够小，以保持不可检测。我们考虑了两种威胁级别:黑盒攻击和白盒攻击。

2.1.1 WHITE-BOX ATTACK MODELS

2.1.1白盒攻击模型

White-box models assume that the attacker has complete knowledge of all the classifier parameters, i.e., network architecture and weights, as well as the details of any defense mechanism. Given an input image x and its associated ground-truth label y, the attacker thus has access to the loss function J(x, y) used to train the network, and uses it to compute the adversarial perturbation δ. Attacks can be targeted, in that they attempt to cause the perturbed image to be misclassified to a specific target class, or untargeted when no target class is specified.

白盒模型假设攻击者完全了解所有分类器参数，即网络结构和权重，以及任何防御机制的细节。给定输入图像x及其关联的地真标签y，攻击者就可以获得用于训练网络的损失函数J(x, y)，并用它来计算对敌摄动值。攻击可以是有目标的，因为它们试图导致被扰动的图像被错误地分类为特定的目标类，或者在没有指定目标类时没有目标。

In this work, we focus on untargeted white-box attacks computed using the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015), the Randomized Fast Gradient Sign Method (RAND+FGSM) (Tramer et al., 2017), and the Carlini-Wagner (CW) attack (Carlini & Wagner, ` 2017). Although other attack models exist, such as the Iterative FGSM (Kurakin et al., 2017), the Jacobian-based Saliency Map Attack (JSMA) (Papernot et al., 2016b), and Deepfool (MoosaviDezfooli et al., 2016), we focus on these three models as they cover a good breadth of attack algorthims. FGSM is a very simple and fast attack algorithm which makes it extremely amenable to real-time attack deployment. On the other hand, RAND+FGSM, an equally simple attack, increases the power of FGSM for white-box attacks (Tramer et al., 2017), and finally, the CW attack is one of ` the most powerful white-box attacks to-date (Carlini & Wagner, 2017).

在这项工作中，我们重点研究了使用快速梯度符号法(FGSM) (Goodfellow et al.， 2015)、随机快速梯度符号法(RAND+FGSM) (Tramer et al.， 2017)和Carlini- wagner (CW)攻击(Carlini &瓦格纳,2017)。尽管存在其他攻击模型，如迭代FGSM (Kurakin et al.， 2017)、基于雅可布雅矩阵的显著映射攻击(JSMA) (Papernot et al.， 2016b)和Deepfool (moosavidez愚蠢et al.， 2016)，但我们关注这三种模型，因为它们涵盖了广泛的攻击算法。FGSM是一个非常简单和快速的攻击算法，这使它非常适合实时攻击部署。另一方面，RAND+FGSM，一种同样简单的攻击，增加了FGSM用于白盒攻击的能力(Tramer等人，2017)，最后，CW攻击是“迄今为止最强大的白盒攻击之一(Carlini &瓦格纳,2017)。

Fast Gradient Sign Method (FGSM) Given an image x and its corresponding true label y, the FGSM attack sets the perturbation δ to:

快速梯度符号法(FGSM)给出图像x及其对应的真标号y FGSM攻击将扰动令值设置为:

FGSM (Goodfellow et al., 2015) was designed to be extremely fast rather than optimal. It simply uses the sign of the gradient at every pixel to determine the direction with which to change the corresponding pixel value.

FGSM (Goodfellow et al.， 2015)被设计为非常快而不是最优。它简单地使用每个像素上的梯度符号来确定改变相应像素值的方向。

Randomized Fast Gradient Sign Method (RAND+FGSM) The RAND+FGSM (Tramer et al., ` 2017) attack is a simple yet effective method to increase the power of FGSM against models which were adversarially trained. The idea is to first apply a small random perturbation before using FGSM. More explicitly, for $α < X$ , random noise is first added to the legitimate image x:

随机快速梯度符号方法(RAND+FGSM) RAND+FGSM (Tramer et al.， ' 2017)攻击是一种简单而有效的方法，可以提高FGSM对反向训练的模型的能力。这个想法是在使用FGSM之前先应用一个小的随机扰动。更明确地说，对于 $α < X$ 首先对合法图像X添加随机噪声

Then, the FGSM attack is computed on x 0 , resulting in

然后，对x计算FGSM攻击 0 ,导致

The Carlini-Wagner (CW) attack The CW attack is an effective optimization-based attack model (Carlini & Wagner, 2017). In many cases, it can reduce the classifier accuracy to almost 0% (Carlini & Wagner, 2017; Meng & Chen, 2017). The perturbation δ is found by solving an optimization problem of the form:

攻击是一种有效的基于优化的攻击模型(Carlini - wagner attack)。瓦格纳,2017)。在许多情况下，它可以将分类器的准确率降低到几乎0% (Carlini &瓦格纳,2017;孟,陈,2017)。通过求解一个形式的优化问题，得到了扰动分布:

where f is an objective function that drives the example x to be misclassified, and c > 0 is a suitably

chosen constant. The `2, `0, and `∞ norms are considered. We refer the reader to (Carlini & Wagner,

2017) for details regarding the approach to solving (4) and setting the constant c.

其中f是导致示例x被错误分类的目标函数，c >0是一个合适的常数。“2”、“0”和“规范被考虑在内。我们请读者参阅(卡里尼&安;关于求解(4)和设置常数c的方法的细节。

2.1.2 BLACK-BOX ATTACK MODELS

2.1.2黑盒攻击模型

For black-box attacks we consider untargeted FGSM attacks computed on a substitute model (Papernot et al., 2017). As previously mentioned, black-box adversaries have no access to the classifier or defense parameters. It is further assumed that they do not have access to a large training dataset but can query the targeted DNN as a black-box, i.e., access labels produced by the classifier for specific query images. The adversary trains a model, called substitute, which has a (potentially) different architecture than the targeted classifier, using a very small dataset augmented by synthetic images labeled by querying the classifier. Adversarial examples are then found by applying any attack method on the substitute network. It was found that such examples designed to fool the substitute often end up being misclassified by the targeted classifier (Szegedy et al., 2014; Papernot et al., 2017). In other words, black-box attacks are easily transferrable from one model to the other

对于黑箱攻击，我们考虑基于替代模型计算的无目标FGSM攻击(Papernot等，2017)。如前所述，黑箱对手无法访问分类器或防御参数。进一步假设它们没有访问大型训练数据集的权限，但可以将目标DNN作为一个黑箱来查询，即对特定查询图像的分类器生成的标签进行访问。对手训练一个叫做substitute的模型，它有一个(潜在的)不同于目标分类器的架构，使用一个非常小的数据集，通过查询分类器来标记合成图像。在此基础上，通过对替代网络的各种攻击方法，找出相应的攻击实例。我们发现，设计用来欺骗替代的例子往往会被目标分类器误分类(Szegedy et al.， 2014;Papernot等，2017)。换句话说，黑盒攻击很容易从一个模型转移到另一个模型

2.2 DEFENSE MECHANISMS

2.2防御机制

Various defense mechanisms have been employed to combat the threat from adversarial attacks. In what follows, we describe one representative defense strategy from each of the three general groups of defenses.

各种防御机制被用来对付来自敌对攻击的威胁。在接下来的内容中，我们将从三组防御中分别描述一种具有代表性的防御策略。

2.2.1 ADVERSARIAL TRAINING

2.2.1对抗训练

A popular approach to defend against adversarial noise is to augment the training dataset with adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016). Adversarial examples are generated using one or more chosen attack models and added to the training set. This often results in increased robustness when the attack model used to generate the augmented training set is the same as that used by the attacker. However, adversarial training does not perform as well when a different attack strategy is used by the attacker. Additionally, it tends to make the model more robust to white-box attacks than to black-box attacks due to gradient masking (Papernot et al., 2016c; 2017; Tramer et al., 2017). ` 3

一种抵御敌对噪声的流行方法是使用敌对示例来扩充训练数据集(Szegedy et al.， 2014;Goodfellow等，2015年;moosavie - dez愚i等，2016)。使用一个或多个选定的攻击模型生成反例，并将其添加到训练集中。当用于生成增强训练集的攻击模型与攻击者使用的攻击模型相同时，通常会提高健壮性。然而，当攻击者使用不同的攻击策略时，对抗性训练的效果并不好。此外，由于梯度掩蔽，相对于黑盒攻击，它使模型对白盒攻击具有更强的鲁棒性(Papernot等，2016c;2017;Tramer等，2017)。' 3

2.2.2 DEFENSIVE DISTILLATION

2.2.2防守蒸馏

Defensive distillation (Papernot et al., 2016d) trains the classifier in two rounds using a variant of the distillation (Hinton et al., 2014) method. This has the desirable effect of learning a smoother network and reducing the amplitude of gradients around input points, making it difficult for attackers to generate adversarial examples (Papernot et al., 2016d). It was, however, shown that, while defensive distillation is effective against white-box attacks, it fails to adequately protect against black-box attacks transferred from other networks (Carlini & Wagner, 2017).

防御性蒸馏(Papernot et al.， 2016d)使用蒸馏(Hinton et al.， 2014)方法的一种变体，分两轮对分类器进行训练。这样可以学习一个更平滑的网络，降低输入点周围梯度的幅度，使得攻击者难以生成敌对的示例(Papernot et al.， 2016d)。然而，它表明，虽然防御蒸馏对白盒攻击是有效的，但它不能充分保护黑盒攻击转移从其他网络(Carlini & Wagner, 2017)。

2.2.3磁铁

Recently, Meng & Chen (2017) introduced MagNet as an effective defense strategy. It trains a reformer network (which is an auto-encoder or a collection of auto-encoders) to move adversarial examples closer to the manifold of legitimate, or natural, examples. When using a collection of auto-encoders, one reformer network is chosen at random at test time, thus strengthening the defense. It was shown to be an effective defense against gray-box attacks where the attacker knows everything about the network and defense, except the parameters. MagNet is the closest defense to our approach, as it attempts to reform an adversarial sample using a learnt auto-encoder. The main differences between MagNet and our approach are: (1) we use GANs instead of auto-encoders, and, most importantly, (2) we use GD minimization to find latent codes as opposed to a feedforward encoder network. This makes Defense-GAN more robust, especially against white-box attacks.

最近,孟,Chen(2017)提出磁铁作为一种有效的防御策略。它训练了一个改革者网络(一个自动编码器或自动编码器的集合)，使敌对的例子更接近合法或自然的例子的多样性。当使用一组自动编码器时，在测试时随机选择一个转换器网络，从而加强防御。这被证明是一个有效的防御灰盒攻击，攻击者知道所有关于网络和防御，除了参数。磁铁是最接近我们的方法的防御，因为它试图改革一个对抗性样本使用一个学会的自动编码器。磁体和我们的方法之间的主要区别是:(1)我们使用GANs而不是自动编码器，而且，最重要的是(2)我们使用GD最小化来寻找潜在的代码，而不是前馈编码器网络。这使得防御gan更加健壮，特别是在对抗白盒攻击时。

2.3 GENERATIVE ADVERSARIAL NETWORKS (GANS)

2.3生成式对抗网络(GANS)

GANs, originally introduced by Goodfellow et al. (2014), consist of two neural networks, G and D. $G : \mathbb{R}^{k} → \mathbb{R}^{n}$ maps a low-dimensional latent space to the high dimensional sample space of x. D is a binary neural network classifier. In the training phase, G and D are typically learned in an adversarial fashion using actual input data samples x and random vectors z. An isotropic Gaussian prior is usually assumed on z. While G learns to generate outputs G(z) that have a distribution similar to that of x, D learns to discriminate between “real” samples x and “fake” samples G(z). D and G are trained in an alternating fashion to minimize the following min-max loss (Goodfellow et al., 2014):

GANs最初由Goodfellow等人(2014)提出，由G和D两种神经网络组成。 $G : \mathbb{R}^{k} → \mathbb{R}^{n}$ 将一个低维潜在空间映射到x的高维样本空间，D是一种二值神经网络分类器。在训练阶段,G和D通常学习以对抗性的方式使用实际输入数据样本随机向量x和z。一个各向同性高斯之前通常假定在z。虽然G学会生成输出G (z)的分布类似于x, D学会区分真正的样本x和假样本G (z)。D和G交替训练，以最小化以下最小-最大损失(Goodfellow et al.， 2014):

It was shown that the optimal GAN is obtained when the resulting generator distribution pg = pdata (Goodfellow et al., 2014).

结果表明，当得到的发电机分布pg = pdata时，得到了最优的氮化镓 (Goodfellow et al.， 2014)。

However, GANs turned out to be difficult to train in practice (Gulrajani et al., 2017), and alternative formulations have been proposed. Arjovsky et al. (2017) introduced Wasserstein GANs (WGANs) which are a variant of GANs that use the Wasserstein distance, resulting in a loss function with more desirable properties:

然而，GANs在实践中很难训练(Gulrajani et al.， 2017)，有人提出了替代配方。Arjovsky等人(2017)引入了Wasserstein GANs (WGANs)， WGANs是GANs的一种变体，使用Wasserstein距离，导致损失函数具有更理想的性质:

In this work, we use WGANs as our generative model due to the stability of their training methods, especially using the approach in (Gulrajani et al., 2017).

在本研究中，由于WGANs训练方法的稳定性，我们使用WGANs作为生成模型，特别是使用In方法(Gulrajani et al.， 2017)。

3 PROPOSED DEFENSE-GAN

3提出DEFENSE-GAN

We propose a new defense strategy which uses a WGAN trained on legitimate (un-perturbed) training samples to “denoise” adversarial examples. At test time, prior to feeding an image x to the classifier, we project it onto the range of the generator by minimizing the reconstruction error $||G(z) − x||_{2}^{2}$ , using L steps of GD. The resulting reconstruction G(z) is then given to the classifier. Since the generator was trained to model the unperturbed training data distribution, we expect this added step to result in a substantial reduction of any potential adversarial noise. We formally motivate this approach in the following section.

提出了一种新的防御策略，该策略使用在合法(无扰动)训练样本上训练的WGAN去噪对抗样本。在测试时，在将图像x送入分类器之前，我们使用GD的L步，通过最小化重构误差 $||G(z) − x||_{2}^{2}$ ，将其投影到生成器的范围内。得到重构后的G(z)给出分类器。由于生成器被训练成模拟无干扰的训练数据分布，我们期望这一增加的步骤能导致任何潜在的敌对噪声的实质性减少。我们将在下一节中正式介绍这种方法。

3.1 MOTIVATION

3.1 动力

As mentioned in Section 2.3, the GAN min-max loss in (5) admits a global optimum when pg = pdata (Goodfellow et al., 2014). It can be similarly shown that WGAN admits an optimum to its own minmax loss in (6), when the set {x | pg(x) 6= pdata(x)} has zero Lebesgue-measure. Formally,

如2.3节所述，当pg = pdata时，(5)中的GAN min-max损失为全局最优值 (Goodfellow et al.， 2014)。同样地，当集合{x | pg(x) 6= pdata(x)}的Lebesgue-measure为0时，WGAN在(6)中承认其最小损失为最优。在形式上,

Lemma 1 A generator distribution pg is a global optimum for the WGAN min-max game defined in (6), if and only if $p_{g}(x) = p_{data}(x) ~for~all ~x∈\mathbb{R}^{n}$ , potentially except on a set of zero Lebesguemeasure.

对于(6)中定义的WGAN最小-最大对策，当且仅当 $p_{g}(x) = p_{data}(x) ~for~all ~x∈\mathbb{R}^{n}$ ，可能除了Lebesguemeasure的集合之外，生成器分布pg是全局最优的。

A sketch of the proof can be found in Appendix A.

证明的梗概可在附录A中找到。

Additionally, it was shown that, if G and D have enough capacity to represent the data, and if the training algorithm is such that pg converges to pdata, then

另外，如果G和D有足够的容量来表示数据，并且训练算法能够使pg收敛到pdata，则

where Gt is the generator of a GAN or WGAN1 after t steps of its training algorithm (Kabkab et al., 2018).

其中Gt是其训练算法经过t步后生成GAN或WGAN1 (Kabkab等， 2018)。

This serves to show that, under ideal conditions, the addition of the GAN reconstruction loss minimization step should not affect the performance of the classifier on natural, legitimate samples, as such samples should be almost exactly recovered. Furthermore, we hypothesize that this step will help reduce the adversarial noise which follows a different distribution than that of the GAN training examples.

这表明，在理想条件下，GAN重构损失最小化步骤的加入不会影响分类器对自然、合法样本的性能，因为这些样本应该几乎完全恢复。此外，我们假设这一步骤将有助于减少对抗噪声，它遵循与GAN训练示例不同的分布。

3.2 DEFENSE-GAN ALGORITHM

3.2 DEFENSE-GAN算法

Defense-GAN is a defense strategy to combat both white-box and black-box adversarial attacks against classification networks. At inference time, given a trained GAN generator G and an image x to be classified, z ∗ is first found so as to minimize

defense - gan是一种防御策略，用于打击针对分类网络的白盒和黑盒对抗性攻击。在推理时，给定一个经过训练的GAN生成器G和一个要分类的图像x，首先找到z∗以便最小化

G(z ∗ ) is then given as the input to the classifier. The algorithm is illustrated in Figure 1. As (8) is a highly non-convex minimization problem, we approximate it by doing a fixed number L of GD steps using R different random initializations of z (which we call random restarts), as shown in Figures 1 and 2.

然后给出G(z)作为分类器的输入。算法如图1所示。由于(8)是一个高度非凸最小化问题，我们通过使用R个不同的z随机初始化(我们称之为随机重新启动)执行固定数量的L个GD步骤来逼近它，如图1和图2所示。然后给出G(z)作为分类器的输入。算法如图1所示。由于(8)是一个高度非凸最小化问题，我们通过使用R个不同的z随机初始化(我们称之为随机重新启动)执行固定数量的L个GD步骤来逼近它，如图1和图2所示。

The GAN is trained on the available classifier training dataset in an unsupervised manner. The classifier can be trained on the original training images, their reconstructions using the generator G, or a combination of the two. As was discussed in Section 3.1, as long as the GAN is appropriately trained and has enough capacity to represent the data, original clean images and their reconstructions should not defer much. Therefore, these two classifier training strategies should, at least theoretically, not differ in performance.

GAN以无监督的方式在可用的分类器训练数据集上训练。分类器可以在原始训练图像上进行训练，也可以使用生成器G对原始训练图像进行重构，也可以将两者结合使用。正如3.1节中所讨论的，只要GAN经过适当的训练并有足够的能力表示数据，原始的干净图像及其重构应该不会延迟太多。因此，至少在理论上，这两种分类器训练策略在性能上应该没有区别。

Compared to existing defense mechanisms, our approach is different in the following aspects:

与现有的防御机制相比，我们的方法有以下几个方面的不同:

Defense-GAN can be used in conjunction with any classifier and does not modify the classifier structure itself. It can be seen as an add-on or pre-processing step prior to classification.
1. Defense-GAN可以与任何分类器一起使用，并且不会修改分类器本身的结构。它可以被看作是分类之前的附加或预处理步骤。
If the GAN is representative enough, re-training the classifier should not be necessary and any drop in performance due to the addition of Defense-GAN should not be significant.
1. 如果GAN具有足够的代表性，那么就没有必要对分类器进行再训练，而且由于添加了Defense-GAN而导致的任何性能下降都不会很明显。
Defense-GAN can be used as a defense to any attack: it does not assume an attack model, but simply leverages the generative power of GANs to reconstruct adversarial examples.
1. defense - gan可以用来防御任何攻击:它不假设一个攻击模型，而是简单地利用GANs的生成能力来重建对抗的例子。
Defense-GAN is highly non-linear and white-box gradient-based attacks will be difficult to perform due to the GD loop. A detailed discussion about this can be found in Appendix B.
1. Defense-GAN是高度非线性的，并且由于GD循环，基于白盒梯度的攻击将难以执行。关于这个问题的详细讨论可以在附录B中找到。

4 EXPERIMENTS

4实验

We assume three different attack threat levels:

我们假设有三种不同的攻击威胁级别:

Black-box attacks: the attacker does not have access to the details of the classifier and defense strategy. It therefore trains a substitute network to find adversarial examples.
1. 黑盒攻击:攻击者无法访问分类器和防御策略的细节。因此，它训练一个替代网络来寻找对抗的例子。
White-box attacks: the attacker knows all the details of the classifier and defense strategy. It can compute gradients on the classifier and defense networks in order to find adversarial examples.
1. 白盒攻击:攻击者知道分类器和防御策略的所有细节。它可以计算分类器和防御网络上的梯度，以找到敌对的例子。
White-box attacks, revisited: in addition to the details of the architectures and parameters of the classifier and defense, the attacker has access to the random seed and random number generator. In the case of Defense-GAN, this means that the attacker knows all the random initializations ${z_{0}^{(i)}}_{i=1}^{R}$ .
1. 白盒攻击，重新访问:除了详细的架构和参数的分类器和防御，攻击者可以访问随机种子和随机数生成器。对于Defense-GAN，这意味着攻击者知道所有的随机初始化 ${z_{0}^{(i)}}_{i=1}^{R}$ 。

We compare our method to adversarial training (Goodfellow et al., 2015) and MagNet (Meng & Chen, 2017) under the FGSM, RAND+FGSM, and CW (with `2 norm) white-box attacks, as well as the FGSM black-box attack. Details of all network architectures used in this paper can be found in Appendix C. When the classifier is trained using the reconstructed images (G(z ∗ )), we refer to our method as Defense-GAN-Rec, and we use Defense-GAN-Orig when the original images (x) are used to train the classifier. Our GAN follows the WGAN training procedure in (Gulrajani et al., 2017), and details of the generator and discriminator network architectures are given in Table 6. The reformer network (encoder) for the MagNet baseline is provided in Table 7. Our implementation is based on TensorFlow (Abadi et al., 2015) and builds on open-source software: CleverHans by Papernot et al. (2016a) and improved WGAN training by Gulrajani et al. (2017). We use machines equipped with NVIDIA GeForce GTX TITAN X GPUs.

我们将我们的方法与对抗性训练(Goodfellow et al.， 2015)和MagNet(孟&Chen, 2017)下的FGSM, RAND+FGSM, CW(与' 2 norm)白盒攻击，以及FGSM黑盒攻击。本文中使用的所有网络体系结构的细节可以在附录c中找到分类器训练时使用重建图像(G (z)),我们将我们的方法称为Defense-GAN-Rec,我们使用Defense-GAN-Orig当原始图像(x)是用来训练分类器。我们的GAN遵循了in (Gulrajani et al.， 2017)中的WGAN培训流程，表6给出了生成器和鉴别器网络结构的详细信息。磁体基线的转化器网络(编码器)如表7所示。我们的实现基于TensorFlow (Abadi et al.， 2015)和开源软件:Papernot et al. (2016a)的CleverHans和Gulrajani et al.(2017)的改进WGAN培训。我们使用配备NVIDIA GeForce GTX TITAN X gpu的机器。

In our experiments, we use two different image datasets: the MNIST handwritten digits dataset (LeCun et al., 1998) and the Fashion-MNIST (F-MNIST) clothing articles dataset (Xiao et al., 2017). Both datasets consist of 60, 000 training images and 10, 000 testing images. We split the training images into a training set of 50, 000 images and hold-out a validation set containing 10, 000 images. For white-box attacks, the testing set is kept the same (10, 000 samples). For black-box attacks, the testing set is divided into a small hold-out set of 150 samples reserved for adversary substitute training, as was done in (Papernot et al., 2017), and the remaining 9, 850 samples are used for testing the different methods.

在我们的实验中，我们使用了两个不同的图像数据集:MNIST手写数字数据集(LeCun et al.， 1998)和Fashion-MNIST (F-MNIST)服装文章数据集(Xiao et al.， 2017)。两个数据集都由6万张训练图像和1万张测试图像组成。我们将训练图像分割成一个包含50,000张图像的训练集，并保留一个包含10,000张图像的验证集。对于白盒攻击，测试集保持不变(10000个样本)。对于黑箱攻击，按照(Papernot et al.， 2017)的方法，将测试集分割成一个包含150个样本的小保持集，用于敌手替代训练，剩下的9,850个样本用于测试不同的方法。

4.1 RESULTS ON BLACK-BOX ATTACKS

4.1 黑盒攻击的结果

In this section, we present experimental results on FGSM black-box attacks. As previously mentioned, the attacker trains a substitute model, which could differ in architecture from the targeted model, using a limited dataset consisting of 150 legitimate images augmented with synthetic images labeled using the target classifier. The classifier and substitute model architectures used and referred to throughout this section are described in Table 5 in the Appendix.

在本节中，我们给出了FGSM黑盒攻击的实验结果。如前所述，攻击者训练一个替代模型，这个替代模型在结构上可能与目标模型不同，它使用一个有限的数据集，包括150幅合法图像，再加上使用目标分类器标记的合成图像。在本节中使用和引用的分类器和替代模型架构在附录中的表5中进行了描述。

In Tables 1 and 2, we present our classification accuracy results and compare to other defense methods. As can be seen, FGSM black-box attacks were successful at reducing the classifier accuracy by up to 70%. All considered defense mechanisms are relatively successful at diminishing the effect of the attacks. We note that, as expected, the performance of Defense-GAN-Rec and that of Defense-GAN-Orig are very close. In addition, they both perform consistently well across different classifier and substitute model combinations. MagNet also performs in a consistent manner, but achieves lower accuracy than Defense-GAN. Two adversarial training defenses are presented: the first one obtains the adversarial examples assuming the same attack $\epsilon$ = 0.3, and the second assumes a different = 0.15. With incorrect knowledge of , the performance of adversarial training generally decreases. In addition, the classification performance of this defense method has very large variance across the different architectures. It is worth noting that adversarial training defense is only fit against FGSM attacks, because the adversarially augmented data, even with a different , is generated using the same method as the black-box attack (FGSM). In contrast, Defense-GAN and MagNet are general defense mechanisms which do not assume a specific attack model.

在表1和表2中，我们给出了我们的分类精度结果，并与其他防御方法进行了比较。可以看出，FGSM黑盒攻击成功降低了分类器准确率高达70%。所有被考虑的防御机制在减少攻击的影响方面都相对成功。我们注意到，正如预期的那样，防御-干- rec和防御-干- orig的性能非常接近。此外，它们在不同的分类器和替代模型组合上的性能一致。磁铁也执行一致的方式，但达到较低的准确性比防御gan。提出了两种对抗训练防御方法:第一种方法假设相同攻击为 $\epsilon = 0.3$ ，第二种方法假设不同攻击为0.15。由于认识不正确，对抗性训练的成绩普遍下降。此外，该防御方法的分类性能在不同的体系结构之间存在很大差异。值得注意的是，对抗性训练防御只适用于FGSM攻击，因为对抗性增强数据的生成方法与黑箱攻击(black-box attack, FGSM)相同，即使数据不同。相比之下，防御gan和磁体是一般的防御机制，没有假设一个特定的攻击模型。

The performances of defenses on the F-MNIST dataset, shown in Table 2, are noticeably lower than on MNIST. This is due to the large $\epsilon = 0.3$ in the FGSM attack. Please see Appendix D for qualitative examples showing that $\epsilon = 0.3$ represents very high noise, which makes F-MNIST images difficult to classify, even by a human.

在F-MNIST数据集上的防御性能，如表2所示，明显低于MNIST上的防御性能。这是由于大 $\epsilon = 0.3$ 在FGSM攻击。请参阅附录D的定性例子，说明 $\epsilon = 0.3$ 表示非常高的噪声，这使得F-MNIST图像很难分类，即使是人。

In addition, the Defense-GAN parameters used in this experiment were kept the same for both Tables, in order to study the effect of dataset complexity, and can be further optimized as investigated in the next section.

此外，本实验中使用的 Defense-GAN 参数在两种情况下均保持不变表，以研究数据集复杂性的影响，并可以进一步优化，在下一节中进行研究。

Table 1: Classification accuracies of different classifier and substitute model combinations using various defense strategies on the MNIST dataset, under FGSM black-box attacks with = 0.3. Defense-GAN has L = 200 and R = 10.

表1:在X = 0.3的FGSM黑箱攻击下，MNIST数据集上不同分类器和不同防御策略替代模型组合的分类精度。 Defense-GAN有L = 200和R = 10。

4.1.1 EFFECT OF NUMBER OF GD ITERATIONS L AND RANDOM RESTARTS R

4.1.1 GD迭代次数L和随机重启R的影响

Figure 3 shows the effect of varying the number of GD iterations L as well as the random restarts R used to compute the GAN reconstructions of input images. Across different L and R values, Defense-GAN-Rec and Defense-GAN-Orig have comparable performance. Increasing L has the expected effect of improving performance when no attack is present. Interestingly, with an FGSM attack, the classification performance decreases after a certain L value. With too many GD iterations on the mean squared error (MSE) ||G(z) − (x + δ)||2 2 , some of the adversarial noise components are retained. In the right Figure, the effect of varying R is shown to be extremely pronounced. This is due to the non-convex nature of the MSE, and increasing R enables us to sample different local minima.

图3显示了改变GD迭代次数L以及用于计算输入图像的GAN重建的随机重启R的效果。在不同的L值和R值上，Defense-GAN-Rec和Defense-GAN-Orig具有可比性。增加L可以在不存在攻击时提高性能。有趣的是，FGSM攻击时，当L值达到一定值后，分类性能下降。在平均平方误差(MSE)上有太多GD迭代||G(z) (x +卡)||2，一些不利的噪声成分被保留。在右图中，变化R的效果非常明显。这是由于MSE的非凸性质，增加R使我们能够采样不同的局部极小值。

Table 2: Classification accuracies of different classifier and substitute model combinations using various defense strategies on the F-MNIST dataset, under FGSM black-box attacks with $\epsilon$ = 0.3. Defense-GAN has L = 200 and R = 10.

表2:在 $\epsilon$ = 0.3的FGSM黑箱攻击下，F-MNIST数据集上不同分类器和不同防御策略替代模型组合的分类精度。 Defense-GAN有L = 200和R = 10。

Figure 3: Classification accuracy of Model F using Defense-GAN on the MNIST dataset, under FGSM black-box attacks with $\epsilon$ = 0.3 and substitute Model E. Left: various number of iterations L are used (R = 10). Right: various number of random restarts R are used (L = 100).

图3:模型F使用Defense-GAN对MNIST数据集的分类精度 $\epsilon$ = 0.3时FGSM黑盒攻击，将模型e代入左:使用不同迭代次数L (R = 10)。右:使用不同的随机重启次数R (L = 100)。

4.1.2 EFFECT OF ADVERSARIAL NOISE NORM $\epsilon$

4.1.2对抗性噪声规范的影响 $\epsilon$

We now investigate the effect of changing the attack $\epsilon$ in Table 3. As expected, with higher $\epsilon$ , the FGSM attack is more successful, especially on the F-MNIST dataset where the noise norm seems to have a more pronounced effect with nearly 37% drop in performance between $\epsilon$ = 0.1 and 0.3. Figure 7 in Appendix D shows adversarial samples as well as their reconstructions with DefenseGAN at different values of $\epsilon$ . We can see that for large $\epsilon$ , the class is difficult to discern, even for the human eye.

我们现在在表3中研究更改攻击 $\epsilon$ 的影响。正如预期的那样， $\epsilon$ 值越高，FGSM攻击就越成功，特别是在F-MNIST数据集上，噪声规范似乎有更明显的效果，在 $\epsilon$ = 0.1和0.3之间，性能下降了将近37%。附录D中的图7显示了在不同的 $\epsilon$ 值下，与DefenseGAN对抗的样本以及它们的重构。我们可以看到，对于较大的 $\epsilon$ ，类很难识别，即使是人眼也很难识别。

Even though it seems that increasing $\epsilon$ is a desirable strategy for the attacker, this increases the likelihood that the adversarial noise is discernible and therefore the attack is detected. It is trivial for the attacker to provide adversarial images at very high $\epsilon$ , and a good measure of an attack’s strength is its ability to affect performance at low X. In fact, in the next section, we discuss how Defense-GAN can be used to not only diminish the effect of attacks, but to also detect them.

尽管增加 $\epsilon$ 似乎是攻击者想要的策略，但这增加了对抗性噪音被识别的可能性，从而检测到攻击。是微不足道的攻击者提供对抗的图像在非常高的X,和一个好的测量攻击的强度是影响性能的能力在低X事实上,在下一节中,我们将讨论如何使用Defense-GAN不仅减少攻击的效果,但也发现他们。

4.1.3 ATTACK DETECTION

4.1.3攻击检测

Table 3: Classification accuracy of Model F using Defense-GAN (L = 400, R = 10), under FGSM black-box attacks for various noise norms $\epsilon$ and substitute Model E.

表3:对于各种噪声nrms $\epsilon$ ，在FGSM黑盒攻击下使用Defense-GAN (L = 400, R = 10)对模型F的分类精度，并替代模型E。

Figure 4: ROC Curves when using Defense-GAN MSE for FGSM attack detections on the MNIST dataset (Classifier Model F, Substitute Model E). Left: Results for various number of GD iterations are shown with R = 10, $\epsilon$ = 0.30. Middle: Results for various number of random restarts R are shown with L = 100, $\epsilon$ = 0.30. Right: Results for various $\epsilon$ are shown with L = 400, R = 10.

图4:使用Defense-GAN MSE对MNIST数据集检测FGSM攻击时的ROC曲线(分类器模型F，代入模型E)。左:不同次数GD迭代的结果为R = 10, $\epsilon$ = 0.30。中:在L = 100, $\epsilon$ = 0.30的情况下，显示不同随机重启次数R的结果。右图:L = 400, R = 10时各 $\epsilon$ 的结果。

We intuitively expect that clean, unperturbed images will lie closer to the range of the Defense-GAN generator G than adversarial examples. This is due to the fact that G was trained to produce images which resemble the legitimate data. In light of this observation, we propose to use the MSE of an image with it is reconstruction from (8) as a “metric” to decide whether or not the image was adversarially manipulated. In order words, for a given threshold θ > 0, the hypothesis test is:

我们直观地期望干净的，不受干扰的图像将更接近防御gan发生器G的范围，而不是对抗的例子。这是因为G被训练来生成与合法数据相似的图像。根据这一观察结果，我们建议使用图像的均方误差(8)作为一个度量来决定图像是否被反向操纵。有序地说，对于给定的阈值0，假设检验为:

We compute the reconstruction MSEs for every image from the test dataset, and its adversarially manipulated version using FGSM. We show the Receiver Operating Characteristic (ROC) curves as well as the Area Under the Curve (AUC) metric for different Defense-GAN parameters and $\epsilon$ values in Figures 4 and 5. The results show that this attack detection strategy is effective especially when the number of GD iterations L and random restarts R are large. From the left and middle Figures, we can conclude that the number of random restarts plays a very important role in the detection false positive and true positive rates as was discussed in Section 4.1.1. Furthermore, when $\epsilon$ is very small, it becomes difficult to detect attacks at low false positive rates.

我们计算从测试数据集的每幅图像的重建MSEs，以及使用FGSM对其进行反向操作的版本。在图4和图5中，我们显示了不同防御gan参数和 $\epsilon$ 值的接收机工作特性(ROC)曲线以及曲线下面积(AUC)。结果表明，该攻击检测策略在GD迭代次数L和随机重启次数R较大的情况下是有效的。从左图和中间的图中，我们可以得出，随机重启的次数对检测假阳性和真阳性率起着非常重要的作用，如4.1.1节所讨论的。此外，当 $\epsilon$ 非常小时，很难检测出假阳性率很低的攻击。

4.1.4 RESULTS ON WHITE-BOX ATTACKS

4.1.4白盒攻击的结果

We now present results on white-box attacks using three different strategies: FGSM, RAND+FGSM, and CW. We perform the CW attack for 100 iterations of projected GD, with learning rate 10.0, and use c = 100 in equation (4). Table 4 shows the classification performance of different classifier models across different attack and defense strategies. We note that Defense-GAN significantly outperforms the two other baseline defenses. We even give the adversarial attacker access to the random initializations of z. However, we noticed that the performance does not change much when the attacker does not know the initialization. Adversarial training was done using FGSM to generate the adversarial samples. It is interesting to mention that when CW attack is used, adversarial training performs extremely poorly. As previously discussed, adversarial training does not generalize well against different attack methods.

我们现在展示使用三种不同策略的白盒攻击的结果:FGSM、RAND+FGSM和CW。我们对投影GD的100次迭代进行CW攻击，学习率为10.0，在式(4)中使用c = 100。表4显示了不同分类器模型在不同攻防策略下的分类性能。我们注意到Defense-GAN明显优于其他两个基线防御。我们甚至给了敌对攻击者访问z的随机初始化的权限，但是，我们注意到，当攻击者不知道初始化时，性能并没有太大的变化。利用FGSM进行对抗性训练，生成对抗性样本。有趣的是，当使用CW攻击时，对抗性训练表现非常差。如前所述，对抗性训练不能很好地适用于不同的攻击方法。

Due to the loop of L steps of GD, Defense-GAN is resilient to GD-based white-box attacks, since the attacker needs to “un-roll” the GD loop and propagate the gradient of the loss all the way across L steps. In fact, from Table 4, the performance of classifier A with Defense-GAN on the MNIST dataset drops less than 1% from 0.997 to 0.988 under FGSM. In comparison, from Figure 8, when L = 25, the performance of the same network drops to 0.947 (more than 5% drop). This shows that using a larger L significantly increases the robustness of Defense-GAN against GD-based whitebox attacks. This comes at the expense of increased inference time complexity. We present a more detailed discussion about the difficulty of GD-based white-box attacks in Appendix B and time complexity in Appendix G. Additional white-box experimental results on higher-dimensional images are reported in Appendix F.

由于GD的L步骤循环，Defense-GAN对基于GD的白盒攻击具有弹性，因为攻击者需要展开GD循环并在L步骤中传播损失的梯度。实际上，从表4可以看出，在FGSM下，带Defense-GAN的分类器A在MNIST数据集上的性能从0.997下降到0.988，下降幅度小于1%。相比之下，从图8可以看出，当L = 25时，同一网络的性能下降到0.947(下降幅度超过5%)。这表明，使用较大的L可以显著提高Defense-GAN对基于gd的白盒攻击的鲁棒性。这是以增加推理时间复杂度为代价的。我们在附录B中更详细地讨论了基于gd的白盒攻击的难度和附录g中的时间复杂度。附录F中报告了白盒在高维图像上的实验结果。

Figure 5: ROC Curves when using Defense-GAN MSE for FGSM attack detections on the F-MNIST dataset (Classifier Model F, Substitute Model E). Left: Results for various number of GD iterations are shown with R = 10, X = 0.30. Middle: Results for various number of random restarts R are shown with L = 100, X = 0.30. Right: Results for various X are shown with L = 200, R = 10.

图5:使用Defense-GAN MSE对F- mnist数据集检测FGSM攻击时的ROC曲线(分类器模型F，代入模型E)。左:不同次数GD迭代的结果为R = 10, $\epsilon$ = 0.30。中:在L = 100, $\epsilon$ = 0.30的情况下，显示不同随机重启次数R的结果。右:以L = 200, R = 10表示各 $\epsilon$ 的结果。

Table 4: Classification accuracies of different classifier models using various defense strategies on the MNIST (top) and F-MNIST (bottom) datasets, under FGSM, RAND+FGSM, and CW white-box attacks. Defense-GAN has L = 200 and R = 10.

表4:在FGSM、RAND+FGSM、CW白盒攻击下，MNIST(上)和F-MNIST(下)数据集上不同防御策略下不同分类器模型的分类准确率。Defense-GAN有L = 200和R = 10。

5 CONCLUSION

5. 结论

In this paper, we proposed Defense-GAN, a novel defense strategy utilizing GANs to enhance the robustness of classification models against black-box and white-box adversarial attacks. Our method does not assume a particular attack model and was shown to be effective against most commonly considered attack strategies. We empirically show that Defense-GAN consistently provides adequate defense on two benchmark computer vision datasets, whereas other methods had many shortcomings on at least one type of attack.

在本文中，我们提出了一种新的防御策略防御gan，利用GANs来增强分类模型对黑盒和白盒对敌攻击的鲁棒性。我们的方法不假设一个特定的攻击模型，并被证明对大多数通常考虑的攻击策略有效。我们的经验表明，防御- gan持续提供足够的防御在两个基准计算机视觉数据集，而其他方法有许多缺点，至少一种攻击类型。

It is worth mentioning that, although Defense-GAN was shown to be a feasible defense mechanism against adversarial attacks, one might come across practical difficulties while implementing and deploying this method. The success of Defense-GAN relies on the expressiveness and generative power of the GAN. However, training GANs is still a challenging task and an active area of research, and if the GAN is not properly trained and tuned, the performance of Defense-GAN will suffer on both original and adversarial examples. Moreover, the choice of hyper-parameters L and R is also critical to the effectiveness of the defense and it may be challenging to tune them without knowledge of the attack.

值得一提的是，虽然defense - gan被证明是一种可行的防御对抗性攻击的机制，但在实施和部署该方法时可能会遇到实际困难。赣的成功有赖于赣的表现力和生发力。然而，对甘氨酸的训练仍然是一项具有挑战性的任务，也是一个活跃的研究领域，如果甘氨酸没有得到适当的训练和调整，防御甘氨酸的性能将会在原始的和敌对的例子中受到影响。此外，超参数L和R的选择对防守的有效性也至关重要，在不了解攻击的情况下调整它们可能是一个挑战。

ACKNOWLEDGMENT

致谢

本研究基于国家情报总监办公室(ODNI)的工作支持，情报高级研究项目活动(IARPA)，通过IARPA R&D合同No. 2014-14071600012。本文所包含的观点和结论是作者的观点，不应被解释为代表ODNI、IARPA或美国政府明示或暗示的官方政策或认可。美国政府被授权为政府目的复制和分发再版，无论其上有任何版权注释。

REFERENCES

参考文献

Mart´ın Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.

Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein GAN. ´ arXiv preprint arXiv:1701.07875, 2017.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. International Conference on Learning Representations, 2015.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028, 2017.

Dan Hendrycks and Kevin Gimpel. Early methods for detecting adversarial images. International Conference on Learning Representations, Workshop Track, 2017.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. 2014.

Maya Kabkab, Pouya Samangouei, and Rama Chellappa. Task-aware compressed sensing with generative models. In AAAI Conference on Artificial Intelligence, 2018.

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. International Conference on Learning Representations, Workshop Track, 2017.

Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to ´ document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. International Conference on Learning Representations, 2017.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision, 2015.

Dongyu Meng and Hao Chen. MagNet: a two-pronged defense against adversarial examples. arXiv preprint arXiv:1705.09064, 2017.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

Nicolas Papernot, Ian Goodfellow, Ryan Sheatsley, Reuben Feinman, and Patrick McDaniel. Cleverhans v1. 0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768, 2016a.

Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In IEEE Symposium on Security and Privacy, 2016b.

Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814, 2016c.

Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, 2016d.

Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In ACM Asia Conference on Computer and Communications Security, 2017.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. International Conference on Learning Representations, Workshop Track, 2014.

Florian Tramer, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick McDaniel. Ensemble ` adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.