用敌对的网络生成敌对的例子

Generating Adversarial Examples with Adversarial Networks

原文连接:

https://arxiv.org/pdf/1801.02610.pdf

GB/T 7714 Xiao C, Li B, Zhu J Y, et al. Generating adversarial examples with adversarial networks[J]. arXiv preprint arXiv:1801.02610, 2018.

MLA Xiao, Chaowei, et al. "Generating adversarial examples with adversarial networks." arXiv preprint arXiv:1801.02610 (2018).

APA Xiao, C., Li, B., Zhu, J. Y., He, W., Liu, M., & Song, D. (2018). Generating adversarial examples with adversarial networks. arXiv preprint arXiv:1801.02610.

Abstract

摘要

Deep neural networks (DNNs) have been found to be vulnerable to adversarial examples resulting from adding small-magnitude perturbations to inputs. Such adversarial examples can mislead DNNs to produce adversary-selected results. Different attack strategies have been proposed to generate adversarial examples, but how to produce them with high perceptual quality and more efficiently requires more research efforts. In this paper, we propose AdvGAN to generate adversarial examples with generative adversarial networks (GANs), which can learn and approximate the distribution of original instances. For AdvGAN, once the generator is trained, it can generate perturbations efficiently for any instance, so as to potentially accelerate adversarial training as defenses. We apply AdvGAN in both semi-whitebox and black-box attack settings. In semi-whitebox attacks, there is no need to access the original target model after the generator is trained, in contrast to traditional white-box attacks. In black-box attacks, we dynamically train a distilled model for the black-box model and optimize the generator accordingly. Adversarial examples generated by AdvGAN on different target models have high attack success rate under stateof-the-art defenses compared to other attacks. Our attack has placed the first with 92.76% accuracy on a public MNIST black-box attack challenge.1

人们已经发现,深度神经网络(DNNs)在对输入添加小幅度扰动的不利情况下很脆弱。这种对抗性的例子可以误导dna产生对抗性选择的结果。不同的攻击策略已经被提出来产生敌对的例子,但是如何产生高感知质量和更高效的敌对的例子还需要更多的研究。摘要本文提出了一种利用生成对抗网络生成对抗实例的方法,该方法可以学习和近似原始实例的分布。对于AdvGAN,一旦发生器被训练,它可以有效地为任何实例生成扰动,从而潜在地加速作为防御的对抗性训练。我们将AdvGAN应用在半白盒和黑盒攻击设置中。与传统的白盒攻击不同,在半白盒攻击中,发生器经过训练后,不需要访问原始的目标模型。在黑盒攻击中,我们针对黑盒模型动态训练一个蒸馏模型,并相应地优化生成器。针对不同的目标模型,AdvGAN生成的对抗实例在先进防御条件下的攻击成功率比其他攻击高。在公共MNIST黑箱攻击挑战中,我们的攻击以92.76%的准确率排在第一位。

1 Introduction

1. 介绍

Deep Neural Networks (DNNs) have achieved great successes in a variety of applications. However, recent work has demonstrated that DNNs are vulnerable to adversarial perturbations [Szegedy et al., 2014; Goodfellow et al., 2015; Hu and Tan, 2017]. An adversary can add small-magnitude perturbations to inputs and generate adversarial examples to mislead DNNs. Such maliciously perturbed instances can cause the learning system to misclassify them into either a maliciously-chosen target class (in a targeted attack) or classes that are different from the ground truth (in an untargeted attack). Different algorithms have been proposed for generating such adversarial examples, such as the fast gradient sign method (FGSM) [Goodfellow et al., 2015] and optimization-based methods (Opt.) [Carlini and Wagner, 2017b; Liu et al., 2017; Xiao et al., 2018; Evtimov et al., 2017].

深度神经网络(DNNs)在各种应用中取得了巨大的成功。然而,最近的研究表明,DNNs容易受到敌对干扰[Szegedy等人,2014;Goodfellow等,2015年;Hu and Tan, 2017]。对手可以给输入增加小幅度的扰动,并产生对抗性的例子来误导DNNs。这种恶意干扰的实例可能会导致学习系统将它们错误地分类为恶意选择的目标类(在有目标的攻击中)或与基本事实不同的类(在无目标的攻击中)。不同的算法被提出来生成这种敌对的例子,如快速梯度符号法(FGSM) [Goodfellow et al., 2015]和基于优化的方法(Opt.) [Carlini and Wagner, 2017b;Liu等,2017;肖等人,2018;Evtimov等,2017]。

Most of the the current attack algorithms [Carlini and Wagner, 2017b; Liu et al., 2017] rely on optimization schemes with simple pixel space metrics, such as L∞ distance from a benign image, to encourage visual realism. To generate more perceptually realistic adversarial examples efficiently, in this paper, we propose to train (i) a feed-forward network that generate perturbations to create diverse adversarial examples and (ii) a discriminator network to ensure that the generated examples are realistic. We apply generative adversarial networks (GANs) [Goodfellow et al., 2014] to produce adversarial examples in both the semi-whitebox and blackbox settings. As conditional GANs are capable of producing high-quality images [Isola et al., 2017], we apply a similar paradigm to produce perceptually realistic adversarial instances. We name our method AdvGAN.

大多数当前的攻击算法[Carlini和Wagner, 2017b;Liu et al., 2017]依靠简单像素空间度量的优化方案,如与良性图像的L距离,以鼓励视觉逼真性。为了有效地生成更多感知上真实的敌对例子,在本文中,我们提出训练(i)一个产生扰动的前馈网络来创建不同的敌对例子,(ii)一个鉴别器网络来确保生成的例子是真实的。我们应用生成式对抗网络(GANs) [Goodfellow等人,2014]在半白盒和黑盒设置中生成对抗例子。由于条件GANs能够产生高质量的图像[Isola等人,2017],我们应用类似的范例来产生感知上真实的敌对实例。我们将方法命名为AdvGAN。

Note that in the previous white-box attacks, such as FGSM and optimization methods, the adversary needs to have whitebox access to the architecture and parameters of the model all the time. However, by deploying AdvGAN, once the feedforward network is trained, it can instantly produce adversarial perturbations for any input instances without requiring access to the model itself anymore. We name this attack setting semi-whitebox.

需要注意的是,在之前的白盒攻击中,如FGSM和优化方法,对手需要一直拥有白盒访问模型的架构和参数。然而,通过部署AdvGAN,一旦训练了前馈网络,它就可以立即对任何输入实例产生敌对的干扰,而不再需要访问模型本身。我们将这种攻击设置命名为semi-whitebox。

To evaluate the effectiveness of our attack strategy AdvGAN , we first generate adversarial instances based on AdvGAN and other attack strategies on different target models. We then apply the state-of-the-art defenses to defend against these generated adversarial examples [Goodfellow et al., 2015; M ˛adry et al., 2017]. We evaluate these attack strategies in both semi-whitebox and black-box settings. We show that adversarial examples generated by AdvGAN can achieve a high attack success rate, potentially due to the fact that these adversarial instances appear closer to real instances compared to other recent attack strategies.

为了评估我们的攻击策略AdvGAN的有效性,我们首先在不同的目标模型上生成基于AdvGAN和其他攻击策略的对抗性实例。然后,我们应用最先进的防御手段来防御这些产生的对抗性实例[Goodfellow等人,2015;M adry等,2017]。我们在半白盒和黑盒设置中评估这些攻击策略。我们表明,由AdvGAN生成的对抗实例可以获得很高的攻击成功率,这可能是因为与其他最近的攻击策略相比,这些对抗实例看起来更接近真实的实例。

Our contributions are listed as follows.

我们的贡献如下。

  • Different from the previous optimization-based methods, we train a conditional adversarial network to directly produce adversarial examples, which not only results in perceptually realistic examples that achieve state-of-the-art attack success rate against different target models, but also the generation process is more efficient.

    • 与以往的基于优化的方法不同,我们通过训练一个条件对抗性网络直接生成对抗性实例,不仅得到了感知真实的实例,对不同目标模型的攻击成功率达到了最新水平,而且生成过程更加高效。

  • We show that AdvGAN can attack black-box models by training a distilled model. We propose to dynamically train the distilled model with query information and achieve high black-box attack success rate and targeted black-box attack, which is difficult to achieve for transferability-based black-box attacks.

    • 我们证明了AdvGAN可以通过训练一个经过提炼的模型来攻击黑盒模型。我们提出用查询信息对提取出来的模型进行动态训练,实现高黑箱攻击成功率和有针对性的黑箱攻击,这是基于可转移性的黑箱攻击难以实现的。

  • We use the state-of-the-art defense methods to defend against adversarial examples and show that AdvGAN achieves much higher attack success rate under current defenses.

    • 我们使用最新的防御方法对对抗实例进行防御,并表明AdvGAN在现有防御条件下取得了更高的攻击成功率。

  • We apply AdvGAN on M ˛adry et al.’s MNIST challenge (2017) and achieve 88.93% accuracy on the published robust model in the semi-whitebox setting and 92.76% in the black-box setting, which wins the top position in the challenge.

    • 我们将AdvGAN应用在M adry等人的MNIST challenge(2017)上,在半白盒设置和黑盒设置中,我们发布的鲁棒模型的准确率分别达到了88.93%和92.76%,在挑战书中排名第一。

2相关工作

Here we review recent work on adversarial examples and generative adversarial networks.

在这里,我们回顾了对抗性例子和生成的对抗性网络的最新工作。

Adversarial Examples A number of attack strategies to generate adversarial examples have been proposed in the white-box setting, where the adversary has full access to the classifier [Szegedy et al., 2014; Goodfellow et al., 2015; Carlini and Wagner, 2017b; Xiao et al., 2018; Hu and Tan, 2017]. Goodfellow et al. propose the fast gradient sign method (FGSM), which applies a first-order approximation of the loss function to construct adversarial samples. Formally, given an instance x, an adversary generates adversarial example xA=x+ηx_{A} = x + η with L∞ constraints in the untargeted attack setting as η=ϵsign(xf(x,y))η = \epsilon · sign(∇_{x}\ell_{f} (x, y)) , wheref (·) is the cross-entropy loss used to train the neural network f, and y represents the ground truth of x. Optimization based methods (Opt) have also been proposed to optimize adversarial perturbation for targeted attacks while satisfying certain constraints [Carlini and Wagner, 2017b; Liu et al., 2017]. Its goal is to minimize the objective function as η+λf(xA,y),where||η|| + λ\ell_{f} (x_{A}, y), where || · || is an appropriately chosen norm function. However, the optimization process is slow and can only optimize perturbation for one specific instance each time. In contrast, our method uses feed-forward network to generate an adversarial image, rather than an optimization procedure. Our method achieves higher attack success rate against different defenses and performs much faster than the current attack algorithms.

在白盒设置中,敌人拥有对分类器的完全访问权,提出了许多生成敌对示例的攻击策略[Szegedy等,2014;Goodfellow等,2015年;Carlini和Wagner, 2017b;肖等人,2018;Hu and Tan, 2017]。Goodfellow等人提出了快速梯度符号法(FGSM),该方法利用损失函数的一阶近似来构造对抗性样本。正式,因为x的一个实例,敌人产生敌对的例子 xA=x+ηx_{A} = x + η 与L约束没有针对性攻击设置 η=ϵsign(xf(x,y))η = \epsilon · sign(∇_{x}\ell_{f} (x, y)) ,在“f(·)是叉损失f用于训练神经网络,和y代表x的地面实况。基于优化的方法(选择)也提出了优化的目标攻击敌对的扰动,当满足一定的约束条件(Carlini和瓦格纳,2017 b;Liu等,2017]。它的目标是将目标函数最小化为 η+λf(xA,y),where||η|| + λ\ell_{f} (x_{A}, y), where || · || ,其中||·||是一个适当选择的范数函数。但是,优化过程很慢,而且每次只能优化一个特定实例的扰动。与此相反,我们的方法使用前馈网络生成一个敌对的图像,而不是优化过程。与现有的攻击算法相比,该方法具有更高的攻击成功率和更快的攻击速度。

Independently from our work, feed-forward networks have been applied to generate adversarial perturbation [Baluja and Fischer, 2017]. However, Baluja and Fischer combine the re-ranking loss and an L2 norm loss, aiming to constrain the generated adversarial instance to be close to the original one in terms of L2; while we apply a deep neural network as a discriminator to help distinguish the instance with other real images to encourage the perceptual quality of the generated adversarial examples . Hu and Tan[Hu and Tan, 2017] also proposed to use GAN to generate adversarial examples. However, they aim to generate adversarial examples for malware while our work focus on generating perceptual realistic adversarial examples for image.

独立于我们的工作,前馈网络已经被用于产生对敌摄动[Baluja和Fischer, 2017]。然而,Baluja和Fischer结合了再排序损失和L2范数损失,旨在约束生成的敌对实例在L2方面接近于原始实例;同时,我们应用深度神经网络作为鉴别器,以帮助区分实例与其他真实图像,以鼓励对生成的敌对示例的感知质量。Hu and Tan[Hu and Tan, 2017]也提出使用GAN生成对抗性示例。然而,他们的目标是为恶意软件生成敌对的例子,而我们的工作集中在为图像生成感性的现实的敌对的例子。

Black-box Attacks Current learning systems usually do not allow white-box accesses against the model for security reasons. Therefore, there is a great need for black-box attacks analysis.

由于安全原因,当前的学习系统通常不允许白盒访问模型。因此,非常需要对黑盒攻击进行分析。

Most of the black-box attack strategies are based on the transferability phenomenon [Papernot et al., 2016], where an adversary can train a local model first and generate adversarial examples against it, hoping the same adversarial examples will also be able to attack the other models. Many learning systems allow query accesses to the model. However, there is little work that can leverage query-based access to target models to construct adversarial samples and move beyond transferability. Hu and Tan proposed to leverage GANs to construct evasion instance for malware. Papernot et al. proposed to train a local substitute model with queries to the target model to generate adversarial samples, but this strategy still relies on transferability. In contrast, we show that the proposed AdvGAN can perform black-box attacks without depending on transferability.

大多数黑箱攻击策略都基于可转移性现象[Papernot等人,2016],即对手可以先训练一个局部模型,然后生成针对该模型的对抗性例子,希望同样的对抗性例子也能够攻击其他模型。许多学习系统允许对模型进行查询访问。然而,很少有工作可以利用对目标模型的基于查询的访问来构建对抗性样本并超越可转移性。Hu和Tan提出利用GANs构建恶意软件的逃税实例。Papernot等人提出训练一个对目标模型进行查询的局部替代模型来生成对抗性样本,但该策略仍然依赖于可转移性。相比之下,我们证明了所提出的AdvGAN可以执行黑盒攻击而不依赖于可转移性。

Generative Adversarial Networks (GANs) Goodfellow et al. have achieved visually appealing results in both image generation and manipulation [Zhu et al., 2016] settings. Recently, image-to-image conditional GANs have further improved the quality of synthesis results [Isola et al., 2017]. We adopt a similar adversarial loss and image-to-image network architecture to learn the mapping from an original image to a perturbed output such that the perturbed image cannot be distinguished from real images in the original class. Different from prior work, we aim to produce output results that are not only visually realistic but also able to mislead target learning models.

生成对抗网络(GANs) Goodfellow等人在图像生成和操作[Zhu et al., 2016]设置方面都取得了视觉上吸引人的结果。最近,图像到图像的条件GANs进一步提高了合成结果的质量[Isola et al., 2017]。我们采用类似的对抗性损失和图像到图像的网络架构来学习从原始图像到摄动输出的映射,使摄动图像在原始类中无法与真实图像区分。与之前的工作不同,我们的目标是输出结果,不仅在视觉上真实,而且能够误导目标学习模型。

3 Generating Adversarial Examples with Adversarial Networks

3. 生成与之对抗的例子 敌对的网络

3.1 Problem Definition

3.1问题定义

Let XRnX ⊆ R_{n} be the feature space, with n the number of features. Suppose that (xi,yi)(x_{i} , y_{i}) is the ith instance within the training set, which is comprised of feature vectors xiXx_{i} ∈ X , generated according to some unknown distribution xiPdataxi\sim Pdata , and yiYy_{i} ∈ Y the corresponding true class labels. The learning system aims to learn a classifier f:XYf : X → Y from the domain X to the set of classification outputs Y, where |Y| denotes the number of possible classification outputs. Given an instance x, the goal of an adversary is to generate adversarial example xA, which is classified as f(xA)yf(xA) \neq y (untargeted attack), where y denotes the true label; or f(xA)=tf(xA) = t (targeted attack) where t is the target class. xA should also be close to the original instance x in terms of L2L_{2} or other distance metric.

XRnX ⊆ R_{n} 是特征空间,n是特征的数量。假设 (xi,yi)(x_{i} , y_{i}) 是训练集内的第i个实例,该训练集由根据某个未知分布 xiPdataxi\sim Pdata 生成的特征向量 xiXx_{i} ∈ X 组成, yiYy_{i} ∈ Y 是对应的真类标签。学习系统的目标是从域X学习一个分类器 f:XYf : X → Y 到分类输出集合Y,其中|Y|表示可能的分类输出的数量。给定实例x,对手的目标是生成对抗性实例xA,分类为 f(xA)yf(xA) \neq y (非目标攻击),其中y为真标号;或 f(xA)=tf(xA) = t (目标攻击),其中t是目标类。在 L2L_{2} 或其他距离度量中,xA也应该接近原始实例x。

3.2 AdvGAN Framework

3.2 AdvGAN框架

Figure 1 illustrates the overall architecture of AdvGAN, which mainly consists of three parts: a generator G, a discriminator D, and the target neural network f. Here the generator G takes the original instance x as its input and generates a perturbation G(x). Then x+G(x) will be sent to the discriminator D, which is used to distinguish the generated data and the original instance x. The goal of D is to encourage that the generated instance is indistinguishable with the data from its original class. To fulfill the goal of fooling a learning model, we first perform the white-box attack, where the target model is f in this case. f takes x + G(x) as its input and outputs its loss Ladv, which represents the distance between the prediction and the target class t (targeted attack), or the opposite of the distance between the prediction and the ground truth class (untargeted attack).

图1给出了AdvGAN的总体架构,主要由生成器G、甄别器D和目标神经网络f三部分组成。其中生成器G以原始实例x为输入,生成一个扰动G(x)。然后将x+G(x)发送给判别器D,判别器用于区分生成的数据和原始实例x。D的目的是促使生成的实例与原始类的数据无法区分。为了实现愚弄一个学习模型的目标,我们首先进行白盒攻击,在这种情况下,目标模型为f。f以x + G(x)为输入,输出其损失Ladv, Ladv表示预测与目标类t(目标攻击)之间的距离,或与预测与ground truth类(非目标攻击)之间的距离相反。

The adversarial loss [Goodfellow et al., 2014] can be written as:

L_{GAN}=\mathbb{E}_{x}logD(x)+\mathbb{E}_{x}logD(1-D(x+g(x)))\tag{1}

对抗性损失[Goodfellow et al., 2014]可以写成:

L_{GAN}=\mathbb{E}_{x}logD(x)+\mathbb{E}_{x}logD(1-D(x+g(x)))\tag{1}

Here, the discriminator D aims to distinguish the perturbed data x+G(x) from the original data x. Note that the real data is sampled from the true class, so as to encourage that the generated instances are close to data from the original class.

这里,discriminator D的目的是将被扰动的数据x+G(x)与原始数据x区分开来。注意,真实数据是从真实的类中采样的,以促使生成的实例与原始类的数据接近。

The loss for fooling the target model f in a targeted attack is:

L_{adv}^{f}=\mathbb{E_{x}}\ell_{f}(x+g(x),t),\tag{2}

在目标攻击中愚弄目标模型f的损失是:

L_{adv}^{f}=\mathbb{E_{x}}\ell_{f}(x+g(x),t),\tag{2}

where t is the target class and `f denotes the loss function (e.g., cross-entropy loss) used to train the original model f. The L f adv loss encourages the perturbed image to be misclassified as target class t. Here we can also perform the untargeted attack by maximizing the distance between the prediction and the ground truth, but we will focus on the targeted attack in the rest of the paper.

其中t是目标类和“f表示损失函数(例如,叉损失)用于训练原始模型f。L f副词分类错误是鼓励摄动形象损失为目标类t。我们还可以执行没有针对性攻击预测和地面之间的距离最大化真理,但我们会关注其他地区的有针对性的攻击。

To bound the magnitude of the perturbation, which is a common practice in prior work [Carlini and Wagner, 2017b; Liu et al., 2017], we add a soft hinge loss on the L2 norm as

L_{hinge}=\mathbb{E}_{x}max(0,\lVert g(x) \rVert_{2}),\tag{3}

确定扰动的大小,这是在之前的工作中常见的做法[Carlini和Wagner, 2017b;Liu等人,2017],我们在L2范数上增加了软铰损失as

L_{hinge}=\mathbb{E}_{x}max(0,\lVert g(x) \rVert_{2}),\tag{3}

where c denotes a user-specified bound. This can also stabilize the GAN’s training, as shown in Isola et al. (2017). Finally, our full objective can be expressed as

L=L_{adv}^{f}+\alpha L_{GAN}+\beta L_{hinge},\tag{4}

其中c表示用户指定的绑定。这也可以稳定GAN的训练,如Isola等人(2017)所示。 最后,我们的全部目标可以表示为

L=L_{adv}^{f}+\alpha L_{GAN}+\beta L_{hinge},\tag{4}

where α and β control the relative importance of each objective. Note that LGAN here is used to encourage the perturbed data to appear similar to the original data x, while L f adv is leveraged to generate adversarial examples, optimizing for the high attack success rate. We obtain our G and D by solving the minmax game arg minG maxD L. Once G is trained on the training data and the target model, it can produce perturbations for any input instance to perform a semi-whitebox attack.

在哪里,and控制每个目标的相对重要性。注意,这里使用LGAN使受干扰的数据看起来类似于原始数据x,而利用L f adv生成相反的示例,优化高攻击成功率。我们通过求解最小博弈参数minG maxD l来得到G和D。一旦G在训练数据和目标模型上被训练,它可以对任何输入实例产生扰动来执行半白盒攻击。

3.3 Black-box Attacks with Adversarial Networks

3.3利用对抗性网络进行黑盒攻击

Static Distillation For black-box attack, we assume adversaries have no prior knowledge of training data or the model itself. In our experiments in Section 4, we randomly draw data that is disjoint from the training data of the black-box model to distill it, since we assume the adversaries have no prior knowledge about the training data or the model. To achieve black-box attacks, we first build a distilled network f based on the output of the black-box model b [Hinton et al., 2015]. Once we obtain the distilled network f, we carry out the same attack strategy as described in the white-box setting (see Equation (4)). Here, we minimize the following network distillation objective:argmin\mathbb{E}_{x} H(f(x),b(x)),\tag{5}

静态蒸馏黑箱攻击,我们假设对手没有训练数据或模型本身的先验知识。在第4节的实验中,我们随机抽取与黑箱模型的训练数据不相关的数据进行提取,因为我们假设对手对训练数据或模型没有先验知识。为了实现黑盒攻击,我们首先基于黑盒模型b的输出构建一个精馏的网络f [Hinton et al., 2015]。一旦我们得到了经过提炼的网络f,我们执行白盒设置中所描述的相同的攻击策略(见式(4))。这里,我们最小化以下网络蒸馏目标:

argmin\mathbb{E}_{x} H(f(x),b(x)),\tag{5}

where f(x) and b(x) denote the output from the distilled model and black-box model respectively for the given training image x, and H denotes the commonly used cross-entropy loss. By optimizing the objective over all the training images, we can obtain a model f which behaves very close to the black-box model b. We then carry out the attack on the distilled network.

其中f(x)和b(x)分别为对给定训练图像x的蒸馏模型和黑盒模型的输出,H为常用的交叉熵损失。通过对所有训练图像的目标进行优化,我们可以得到一个非常接近黑盒模型b的模型f,然后对提取出来的网络进行攻击。

Note that unlike training the discriminator D, where we only use the real data from the original class to encourage that the generated instance is close to its original class, here we train the distilled model with data from all classes.

请注意,在训练discriminator D时,我们只使用来自原始类的真实数据来鼓励生成的实例接近其原始类,而在这里,我们使用来自所有类的数据来训练经过提炼的模型。

Dynamic Distillation Only training the distilled model with all the pristine training data is not enough, since it is unclear how close the black-box and distilled model perform on the generated adversarial examples, which have not appeared in the training set before. Here we propose an alternative minimization approach to dynamically make queries and train the distilled model f and our generator G jointly. We perform the following two steps in each iteration. During iteration i:

动态蒸馏仅仅用所有原始训练数据来训练蒸馏后的模型是不够的,因为对于生成的敌对例子,黑盒和蒸馏后的模型之间的距离是不清楚的,这些例子之前在训练集中没有出现过。在此,我们提出了一种可选的最小化方法来动态地进行查询,并联合训练经过提炼的模型f和生成器G。我们在每次迭代中执行以下两个步骤。在迭代中i:

  1. Update GiG_{i} given a fixed network fi1f_{i−1} : We follow the white-box setting (see Equation 4) and train the generator and discriminator based on a previously distilled model fi1f_{i−1} . We initialize the weights GiG_{i} as Gi1G_{i-1} . Gi,Di=argmingmaxDLadvfi1+αLGAN+βLhingeG_{i} , D_{i} = arg min_{g} max_{D} L_{adv}^{f_{i−1}} + αL_{GAN} + βL_{hinge}

    1. 在固定网络 fi1f_{i−1} 的情况下更新 GiG_{i} :我们遵循白盒设置(见公式4),并基于之前提取的模型 fi1f_{i−1} 对生成器和鉴别器进行训练。我们将权重 GiG_{i} 初始化为 Gi1G_{i-1}Gi,Di=argmingmaxDLadvfi1+αLGAN+βLhingeG_{i} , D_{i} = arg min_{g} max_{D} L_{adv}^{f_{i−1}} + αL_{GAN} + βL_{hinge}

  2. Update fif_{i} given a fixed generator GiG_{i} : First, we use fi1f_{i−1} to initialize fif_{i} . Then, given the generated adversarial examples x+Gi(x)x + Gi(x) from GiG_{i} , the distilled model fif_{i} will be updated based on the set of new query results for the generated adversarial examples against the blackbox model, as well as the original training images. fi=argminfExH(f(x),b(x))+ExH(f(x+Gi(x)),b(x+Gi(x)))f_{i} = arg min_{f} E_{x}H(f(x), b(x)) + \mathbb{E}_{x}H(f(x + G_{i}(x)), b(x + G_{i}(x))) , where we use both the original images x and the newly generated adversarial examples x + sGi(x) to update f.

    1. 给定一个固定的生成器 GiG_{i} 更新 fif_{i} :首先,我们使用fi 1初始化fi。然后,给出从Gi中生成的敌对示例 x+Gi(x)x + Gi(x) ,根据生成的针对blackbox模型的敌对示例的新查询结果集以及原始训练图像,对提取的模型fi进行更新。 fi=argminfExH(f(x),b(x))+ExH(f(x+Gi(x)),b(x+Gi(x)))f_{i} = arg min_{f} E_{x}H(f(x), b(x)) + \mathbb{E}_{x}H(f(x + G_{i}(x)), b(x + G_{i}(x))) ,其中我们使用原始图像x和新生成的敌对示例x + sGi(x)来更新f。

In the experiment section, we compare the performance of both the static and dynamic distillation approaches and observe that simultaneously updating G and f produces higher attack performance. See Table 2 for more details.

在实验部分,我们比较了静态和动态蒸馏方法的性能,观察到同时更新G和f可以产生更高的攻击性能。更多细节见表2。

Table 1: Comparison with the state-of-the-art attack methods. Run time is measured for generating 1,000 adversarial instances during test time. Opt. represents the optimization based method, and Trans. denotes black-box attacks based on transferability.

表1:与最先进的攻击方法的比较。运行时间用于在测试期间生成1,000个对抗性实例。Opt.表示基于优化的方法,Trans.表示基于优化的方法。 表示基于可转移性的黑箱攻击。

Table 2: Accuracy of different models on pristine data, and the attack success rate of adversarial examples generated against different models by AdvGAN on MNIST and CIFAR-10. p: pristine test data; w: semi-whitebox attack; b-D: black-box attack with dynamic distillation strategy; b-S: black-box attack with static distillation strategy.

表2:不同模型对原始数据的准确性,以及AdvGAN在MNIST和CIFAR-10上对不同模型产生的对抗性实例攻击成功率。p:原始测试数据;w: semi-whitebox攻击;b-D:黑盒攻击,动态蒸馏策略;b-S:黑盒攻击静态蒸馏策略。

4 Experimental Results

4实验结果

In this section, we first evaluate AdvGAN for both semiwhitebox and black-box settings on MNIST [LeCun and Cortes, 1998] and CIFAR-10 [Krizhevsky and Hinton, 2009]. We also perform a semi-whitebox attack on the ImageNet dataset [Deng et al., 2009]. We then apply AdvGAN to generate adversarial examples on different target models and test the attack success rate for them under the state-of-the-art defenses and show that our method can achieve higher attack success rates compared to other existing attack strategies. We generate all adversarial examples for different attack methods under an LL_{∞} bound of 0.3 on MNIST and 8 on CIFAR-10, for a fair comparison. In general, as shown in Table 1, AdvGAN has several advantages over other white-box and blackbox attacks. For instance, regarding computation efficiency, AdvGAN performs much faster than others even including the efficient FGSM, although AdvGAN needs extra training time to train the generator. All these strategies can perform targeted attack except transferability based attack, although the ensemble strategy can help to improve. Besides, FGSM and optimization methods can only perform white-box attack, while AdvGAN is able to attack in semi-whitebox setting. Implementation Details We adopt similar architectures for generator and discriminator with image-to-image translation literature [Isola et al., 2017; Zhu et al., 2017]. We apply the loss in Carlini and Wagner (2017b) as our loss Ladvf=max(maxitf(xA)if(xA)t,κ)L_{adv}^{f} = max(max_{i\neq t} f(x_{A})_{i}−f(x_{A})t, κ) , where t is the target class, and f represents the target network in the semi-whitebox setting and the distilled model in the black-box setting. We set the confidence κ = 0 for both Opt. and AdvGAN. We use a batch size of 128 and a learning rate of 0.001. For GANs training, we use the least squares objective proposed by LSGAN [Mao et al., 2017], as it has been shown to produce better results with more stable training.

在本节中,我们首先在MNIST [LeCun and Cortes, 1998]和CIFAR-10 [Krizhevsky and Hinton, 2009]上评估AdvGAN的半白盒和黑盒设置。我们还对ImageNet数据集执行半白盒攻击[Deng等人,2009]。然后应用AdvGAN对不同的目标模型进行了对抗算例,并对其在最先进防御条件下的攻击成功率进行了测试,结果表明,与现有的其他攻击策略相比,该方法可以获得更高的攻击成功率。在MNIST的L界为0.3,ci远-10的L界为8的情况下,我们生成了不同攻击方法的所有对抗性示例,以便进行公平比较。一般来说,如表1所示,AdvGAN与其他白盒和黑盒攻击相比有几个优势。例如,在计算效率方面,AdvGAN的执行速度比其他人快得多,甚至比高效的FGSM,尽管AdvGAN需要额外的训练时间来训练生成器。除了基于可转移性的攻击外,所有的攻击策略都可以实现有针对性的攻击,尽管集成策略可以提高攻击性能。FGSM和优化方法只能进行白盒攻击,而AdvGAN可以在半白盒设置下进行攻击。在图像到图像的翻译文献中,我们对生成器和鉴别器采用了类似的架构[Isola等人,2017;Zhu等,2017]。我们应用损失Carlini和瓦格纳(2017 b)作为我们的损失 Ladvf=max(maxitf(xA)if(xA)t,κ)L_{adv}^{f} = max(max_{i\neq t} f(x_{A})_{i}−f(x_{A})t, κ) ,其中t是目标类,和f表示目标网络semi-whitebox设置和黑箱中的蒸馏模型设置。我们设置了信心κOpt.和AdvGAN = 0。我们使用的批大小为128,学习率为0.001。对于GANs的训练,我们使用LSGAN提出的最小二乘目标[Mao et al., 2017],因为它已经被证明在训练更稳定的情况下产生更好的结果。

Figure 2: Adversarial examples generated from the same original image to different targets by AdvGAN on MNIST. Row 1: semiwhitebox attack; Row 2: black-box attack. Left to right: models A, B, and C.On the diagonal, the original images are shown, and the numer on the top denote the targets.

图2:AdvGAN在MNIST上从相同的原始图像生成不同目标的敌对示例。第1行:半白盒攻击;第2行:黑盒攻击。从左到右:模型A, B, c。对角线上显示原始图像,顶部的数字表示目标。

Models Used in the Experiments For MNIST we generate adversarial examples for three models, where models A and B are used in Tramèr et al. (2017). Model C is the target network architecture used in Carlini and Wagner (2017b). For CIFAR-10, we select ResNet-32 and Wide ResNet-34 [He et al., 2016; Zagoruyko and Komodakis, 2016]. Specifically, we use a 32-layer ResNet implemented in TensorFlow3 and Wide ResNet derived from the variant of “w32-10 wide.”4 We show the classification accuracy of pristine MNIST and CIFAR-10 test data (p) and attack success rate of adversarial examples generated by AdvGAN on different models in Table 2.

我们为三个模型生成了相反的例子,其中模型A和模型B在Tramer等人(2017)中使用。模型C是Carlini和Wagner (2017b)中使用的目标网络架构。对于CIFAR-10,我们选择ResNet-32和Wide ResNet-34 [He et al., 2016;Zagoruyko和Komodakis, 2016年]。具体地说,我们使用了在TensorFlow3中实现的32层ResNet和从w32-10 Wide派生的Wide ResNet。4我们在表2中展示了原始MNIST和CIFAR-10测试数据(p)的分类精度和AdvGAN在不同模型上生成的敌对算例攻击成功率。

4.1 AdvGAN in semi-whitebox Setting

4.1半白盒设置AdvGAN

We evaluate AdvGAN on f with different architectures for MNIST and CIFAR-10. We first apply AdvGAN to perform semi-whitebox attack against different models on MNIST dataset. From the performance of semi-whitebox attack (Attack Rate (w)) in Table 2, we can see that AdvGAN is able to generate adversarial instances to attack all models with high attack success rate.

我们使用MNIST和CIFAR-10的不同架构来评估f上的AdvGAN。首先应用AdvGAN对MNIST数据集上的不同模型执行半白盒攻击。从表2中半白盒攻击(攻击率(w))的性能可以看出,AdvGAN能够生成对抗性实例对所有攻击成功率高的模型进行攻击。

We also generate adversarial examples from the same original instance x, targeting other different classes, as shown in Figures 2. In the semi-whitebox setting on MNIST (a)-(c), we can see that the generated adversarial examples for different models appear close to the ground truth/pristine images (lying on the diagonal of the matrix).

我们还从相同的原始实例x生成了相反的示例,目标是其他不同的类,如图2所示。在MNIST (a)-(c)的半白盒设置中,我们可以看到为不同模型生成的对抗性示例与ground truth/原始图像很接近(位于矩阵的对角线上)。

In addition, we analyze the attack success rate based on different loss functions on MNIST. Under the same bounded perturbations (0.3), if we replace the full loss function in (4) with L=G(x)2+LadvfL = ||G(x)||_{2} + L_{adv}^{f} , which is similar to the objective used in Baluja and Fischer, the attack success rate becomes 86.2%. If we replace the loss function with L=Lhinge+LadvfL = L_{hinge} + L_{adv}^{f} , the attack success rate is 91.1%, compared to that of AdvGAN, 98.3%.

此外,我们还分析了MNIST上基于不同损失函数的攻击成功率。在相同的有界摄动(0.3)下,将(4)中的全损失函数替换为 L=G(x)2+LadvfL = ||G(x)||_{2} + L_{adv}^{f} ,与Baluja和Fischer的目标相似,攻击成功率为86.2%。将损失函数替换为 L=Lhinge+LadvfL = L_{hinge} + L_{adv}^{f} ,攻击成功率为91.1%,而AdvGAN为98.3%。

Figure 3: Adversarial examples generated by AdvGAN on CIFAR10 for (a) semi-whitebox attack and (b) black-box attack. Image from each class is perturbed to other different classes. On the diagonal, the original images are shown.

图3:AdvGAN在CIFAR10上生成的(a)半白盒攻击和(b)黑盒攻击的对抗性例子。从每个类的图像被干扰到其他不同的类。对角线上显示的是原始图像。

Similarly, on CIFAR-10, we apply the same semi-whitebox attack for ResNet and Wide ResNet based on AdvGAN, and Figure 3 (a) shows some adversarial examples, which are perceptually realistic.

类似地,在CIFAR-10上,我们对ResNet和基于AdvGAN的Wide ResNet应用了同样的半白盒攻击,图3 (a)显示了一些敌对的示例,这些示例在感知上是真实的。

We show adversarial examples for the same original instance targeting different other classes. It is clear that with different targets, the adversarial examples keep similar visual quality compared to the pristine instances on the diagonal.

我们展示了针对不同其他类的同一个原始实例的相反示例。很明显,在不同的目标下,与对角线上原始实例相比,敌对实例保持了相似的视觉质量。

4.2 AdvGAN in Black-box Setting

4.2黑盒设置中的AdvGAN

Our black-box attack here is based on dynamic distillation strategy. We construct a local model to distill model f, and we select the architecture of Model C as our local model. Note that we randomly select a subset of instances disjoint from the training data of AdvGAN to train the local model; that is, we assume the adversaries do not have any prior knowledge of the training data or the model itself. With the dynamic distillation strategy, the adversarial examples generated by AdvGAN achieve an attack success rate, above 90% for MNIST and 80% for CIFAR-10, compared to 30% and 10% with the static distillation approach, as shown in Table 2.

我们的黑盒攻击是基于动态蒸馏策略。我们构建一个局部模型来提取模型f,选择模型C的架构作为我们的局部模型。注意,我们从AdvGAN的训练数据中随机选取一个不相交的实例子集来训练局部模型;也就是说,我们假设对手对训练数据或模型本身没有任何先验知识。采用动态精馏策略,AdvGAN生成的敌对算例攻击成功率为MNIST的90%以上,CIFAR-10的80%以上,而静态精馏方法为30%和10%,如表2所示。

We apply AdvGAN to generate adversarial examples for the same instance targeting different classes on MNIST and randomly select some instances to show in Figure 2 (d)-(f). By comparing with the pristine instances on the diagonal, we can see that these adversarial instances can achieve high perceptual quality as the original digits. Specifically, the original digit is somewhat highlighted by adversarial perturbations, which implies a type of perceptually realistic manipulation. Figure 3(b) shows similar results for adversarial examples generated on CIFAR-10. These adversarial instances appear photo-realistic compared with the original ones on the diagonal.

我们应用AdvGAN来生成针对MNIST上不同类的相同实例的敌对示例,并随机选择一些实例如图2 (d)-(f)所示。通过与对角线上原始实例的比较可以看出,这些敌对实例与原始数字相比具有较高的感知质量。具体地说,原始数字在某种程度上被对抗性扰动突出,这意味着一种感知上真实的操作。图3(b)显示了在CIFAR-10上生成的敌对示例的类似结果。这些对抗性的实例与对角线上的原始实例相比显得非常逼真。

4.3 Attack Effectiveness Under Defenses

4.3防御状态下的攻击效率

Facing different types of attack strategies, various defenses have been provided. Among them, different types of adversarial training methods are the most effective. Other categories of defenses, such as those which pre-process an input have mostly been defeated by adaptive attacks [He et al., 2017; Carlini and Wagner, 2017a]. Goodfellow et al. first propose adversarial training as an effective way to improve the robustness of DNNs, and Tramèr et al. extend it to ensemble adversarial learning. M ˛adry et al. have also proposed robust networks against adversarial examples based on welldefined adversaries. Given the fact that AdvGAN strives to generate adversarial instances from the underlying true data distribution, it can essentially produce more photo-realistic adversarial perturbations compared with other attack strategies. Thus, AdvGAN could have a higher chance to produce adversarial examples that are resilient under different defense methods. In this section, we quantitatively evaluate this property for AdvGAN compared with other attack strategies.

面对不同类型的攻击策略,提供了不同的防御。其中,不同类型的对抗性训练方法是最有效的。其他类别的防御,比如对输入进行预处理的防御,大多被自适应攻击击败[He et al., 2017;Carlini and Wagner, 2017a]。Goodfellow等首先提出对抗训练是提高DNNs鲁棒性的有效方法,Tramer等将其扩展到集成对抗学习中。M adry等人也提出了针对基于明确敌人的敌对实例的鲁棒网络。考虑到AdvGAN努力从底层真实数据分布中生成敌对实例的事实,与其他攻击策略相比,它本质上可以产生更逼真的敌对干扰。因此,AdvGAN可能有更高的机会产生对抗的例子,在不同的防御方法下有弹性。在本节中,我们将AdvGAN的这一特性与其他攻击策略进行定量评估。

Threat Model As shown in the literature, most of the current defense strategies are not robust when attacking against them [Carlini and Wagner, 2017b; He et al., 2017]. Here we consider a weaker threat model, where the adversary is not aware of the defenses and directly tries to attack the original learning model, which is also the first threat model analyzed in Carlini and Wagner. In this case, if an adversary can still successfully attack the model, it implies the robustness of the attack strategy. Under this setting, we first apply different attack methods to generate adversarial examples based on the original model without being aware of any defense. Then we apply different defenses to directly defend against these adversarial instances.

从文献中可以看出,当前的大多数防御策略在攻击时都不健壮[Carlini and Wagner, 2017b;He et al., 2017]。这里我们考虑一个较弱的威胁模型,在这个模型中,对手没有意识到防御,并直接尝试攻击原始的学习模型,这也是Carlini和Wagner分析的第一个威胁模型。在这种情况下,如果对手仍然能够成功地攻击模型,则表示攻击策略的鲁棒性。在此设置下,我们首先在不知道有任何防御的情况下,根据原始模型应用不同的攻击方法来生成对抗性示例。然后我们应用不同的防御方法来直接防御这些对抗性的实例。

Semi-whitebox Attack First, we consider the semiwhitebox attack setting, where the adversary has white-box access to the model architecture as well as the parameters. Here, we replace f in Figure 1 with our model A, B, and C, respectively. As a result, adversarial examples will be generated against different models. We use three adversarial training defenses to train different models for each model architecture: standard FGSM adversarial training (Adv.) [Goodfellow et al., 2015], ensemble adversarial training (Ens.) [Tramèr et al., 2017], 5 and iterative training (Iter. Adv.) [M ˛adry et al., 2017]. We evaluate the effectiveness of these attacks against these defended models. In Table 3, we show that the attack success rate of adversarial examples generated by AdvGAN on different models is higher than those of FGSM and Opt. [Carlini and Wagner, 2017b].

首先,我们考虑半白盒攻击设置,其中对手有白盒访问模型体系结构和参数的权限。在这里,我们将图1中的f分别替换为模型A、B和C。因此,针对不同的模型将生成相反的示例。针对每种模型架构,我们使用三种对抗训练防御来训练不同的模型:标准FGSM对抗训练(Adv.) [Goodfellow et al., 2015]、集成对抗训练(en .) [Tramer et al., 2017]、5和迭代训练(Iter.)。[M adry等,2017]。我们评估这些攻击对这些防御模型的有效性。在表3中,我们表明,AdvGAN在不同模型上生成的对抗实例的攻击成功率高于FGSM和Opt [Carlini and Wagner, 2017b]。

Black-box Attack For AdvGAN, we use model B as the black-box model and train a distilled model to perform blackbox attack against model B and report the attack success rate in Table 4. For the black-box attack comparison purpose, transferability based attack is applied for FGSM and Opt. We use FGSM and Opt. to attack model A on MNIST, and we use these adversarial examples to test on model B and report the corresponding classification accuracy. We can see that the adversarial examples generated by the black-box AdvGAN consistently achieve much higher attack success rate compared with other attack methods.

对于AdvGAN,我们使用模型B作为黑盒模型,训练一个提取的模型对模型B进行黑盒攻击,攻击成功率如表4所示。为了比较黑箱攻击的目的,将基于可转移性的攻击应用于FGSM和Opt,我们使用FGSM和Opt.攻击MNIST上的A模型,并利用这些对抗性的例子对B模型进行测试,报告相应的分类精度。我们可以看到,黑盒AdvGAN所生成的对抗实例与其他攻击方法相比,始终取得了更高的攻击成功率。

For CIFAR-10, we use a ResNet as the black-box model and train a distilled model to perform black-box attack against the ResNet. To evaluate black-box attack for optimization method and FGSM, we use adversarial examples generated by attacking Wide ResNet and test them on ResNet to report black-box attack results for these two methods.

对于CIFAR-10,我们使用ResNet作为黑盒模型,并训练一个蒸馏后的模型来对ResNet执行黑盒攻击。为了评估优化方法和FGSM的黑盒攻击,我们使用了攻击Wide ResNet生成的对敌实例,并在ResNet上进行了测试,报告了这两种方法的黑盒攻击结果。

Table 3: Attack success rate of adversarial examples generated by AdvGAN in semi-whitebox setting, and other white-box attacks under defenses on MNIST and CIFAR-10.

表3:由生成的对抗性实例攻击成功率 AdvGAN在半白盒设置,和其他白盒攻击防御MNIST和CIFAR-10。

Table 4: Attack success rate of adversarial examples generated by different black-box adversarial strategies under defenses on MNIST and CIFAR-10

表4:在MNIST和CIFAR-10防御下,不同黑箱对抗策略产生的对抗算例攻击成功率

Table 5: Accuracy of the MadryLab public model under different attacks in white-box setting. AdvGAN here achieved the best performance.

表5:白盒设置下,MadryLab公共模型在不同攻击下的准确率。AdvGAN在这里取得了最好的成绩。

In addition, we apply AdvGAN to the MNIST challenge. Among all the standard attacks shown in Table 5, AdvGAN achieve 88.93% in the white-box setting.

此外,我们将AdvGAN应用到MNIST挑战中。 在表5所示的所有标准攻击中,AdvGAN在白盒设置下达到了88.93%。

Among reported black-box attacks, AdvGAN achieved an accuracy of 92.76%, outperforming all other state-of-the-art attack strategies submitted to the challenge.

在报告的黑盒攻击中,AdvGAN的准确率达到了92.76%,超过了所有提交给挑战的最先进的攻击策略。

4.4 High Resolution Adversarial Examples

4.4高分辨率对抗性实例

To evaluate AdvGAN’s ability of generating high resolution adversarial examples, we attack against Inception_v3 and quantify attack success rate and perceptual realism of generated adversarial examples.

通过对Inception_v3进行攻击,量化生成的攻击成功率和对生成的攻击实例的感知真实感,以评价攻击者生成高分辨率对抗实例的能力。

Experiment settings. In the following experiments, we select 100 benign images from the DEV dataset of the NIPS 2017 adversarial attack competition [Kurakin et al., 2018]. This competition provided a dataset compatible with ImageNet. We generate adversarial examples (299×299 pixels), each targeting a random incorrect class, with L∞ bounded within 0.01 for Inception_v3. The attack success rate is 100%.

实验设置。在接下来的实验中,我们从NIPS 2017对抗性攻击竞赛的开发数据集中选择了100幅良性图像[Kurakin et al., 2018]。这次竞赛提供了一个与ImageNet兼容的数据集。我们生成了相反的示例(299X299像素),每个示例都针对一个随机的不正确的类,对于Inception_v3, L限制在0.01范围内。攻击成功率为100%。

In Figure 4, we show some randomly selected examples of original and adversarial examples generated by AdvGAN.

在图4中,我们展示了一些随机选择的由AdvGAN生成的原始示例和对抗示例。

Human Perceptual Study. We validate the realism of AdvGAN’s adversarial examples with a user study on Amazon Mechanical Turk (AMT). We use 100 pairs of original images and adversarial examples (generated as described above) and ask workers to choose which image of a pair is more visually realistic.

人类知觉的研究。通过对亚马逊土耳其机器人(AMT)的用户研究,验证了AdvGAN的对抗实例的真实性。我们使用了100对原始图像和敌对的示例(如上面所述生成),并要求工作人员选择其中的哪一对图像在视觉上更真实。

Our study follows a protocol from Isola et al., where a worker is shown a pair of images for 2 seconds, then the worker has unlimited time to decide. We limit each worker to at most 20 of these tasks. We collected 500 choices, about 5 per pair of images, from 50 workers on AMT. The AdvGAN examples were chosen as more realistic than the original image in 49.4% ± 1.96% of the tasks (random guessing would result in about 50%). This result show that these high-resolution AdvGAN adversarial examples are about as realistic as benign images.

我们的研究遵循了Isola等人的协议,即向工人展示两幅图像,时间为2秒,然后工人有无限的时间来决定。我们限制每个工人最多做20项这样的工作。我们从AMT上的50个工作人员中收集了500个选择,大约每对图片5个。在49.4%和1.96%的任务中,选择的AdvGAN示例比原始图像更真实(随机猜测的结果约为50%)。这一结果表明,这些高分辨率的AdvGAN对抗的例子是与良性图像一样真实。

5 Conclusion

5 的结论

In this paper, we propose AdvGAN to generate adversarial examples using generative adversarial networks (GANs). In our AdvGAN framework, once trained, the feed-forward generator can produce adversarial perturbations efficiently. It can also perform both semi-whitebox and black-box attacks with high attack success rate. In addition, when we apply AdvGAN to generate adversarial instances on different models without knowledge of the defenses in place, the generated adversarial examples can preserve high perceptual quality and attack the state-of-the-art defenses with higher attack success rate than examples generated by the competing methods. This property makes AdvGAN a promising candidate for improving adversarial training defense methods.

在本文中,我们提出了一种利用生成式对抗网络(GANs)生成对抗实例的方法。在我们的AdvGAN框架中,一旦经过训练,前馈发生器可以有效地产生对敌摄动。它还可以执行半白盒和黑盒攻击,攻击成功率高。此外,应用AdvGAN方法在不了解现有防御措施的情况下,在不同模型上生成对抗实例时,所生成的对抗实例与采用竞争方法生成的实例相比,能够保持较高的感知质量,攻击最先进的防御措施,攻击成功率更高。这一特性使AdvGAN成为改进对抗性训练防御方法的一个有前途的候选者。

Acknowledgments

致谢

We thank Weiwei Hu for his valuable discussions on this work. This work was supported in part by Berkeley Deep Drive, JD.COM, the Center for Long-Term Cybersecurity, and FORCES (Foundations Of Resilient CybEr-Physical Systems), which receives support from the National Science Foundation (NSF award numbers CNS-1238959, CNS1238962, CNS-1239054, CNS-1239166, CNS-1422211 and CNS-1616575).

我们感谢胡伟伟就这项工作所作的宝贵讨论。这项工作的部分支持伯克利深度驱动,JD。长期网络安全中心和部队(弹性网络物理系统基础),获得了国家科学基金会的支持(NSF奖号为CNS-1238959、CNS1238962、CNS-1239054、CNS-1239166、CNS-1422211和CNS-1616575)。

References

参考文献

[Baluja and Fischer, 2017] Shumeet Baluja and Ian Fischer. Adversarial transformation networks: Learning to generate adversarial examples. arXiv preprint arXiv:1703.09387, 2017.

[Carlini and Wagner, 2017a] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3–14. ACM, 2017.

[Carlini and Wagner, 2017b] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.

[Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009.

[Evtimov et al., 2017] Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. Robust physical-world attacks on machine learning models. arXiv preprint arXiv:1707.08945, 2017.

[Goodfellow et al., 2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.

[Goodfellow et al., 2015] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.

[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.

[He et al., 2017] Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. Adversarial example defenses: Ensembles of weak defenses are not strong. arXiv preprint arXiv:1706.04701, 2017.

[Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[Hu and Tan, 2017] Weiwei Hu and Ying Tan. Generating adversarial malware examples for black-box attacks based on GAN. arXiv preprint arXiv:1702.05983, 2017.

[Isola et al., 2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.

[Krizhevsky and Hinton, 2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.

[Kurakin et al., 2018] Alexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, Cihang Xie, et al. Adversarial attacks and defences competition. arXiv preprint arXiv:1804.00097, 2018.

[LeCun and Cortes, 1998] Yann LeCun and Corrina Cortes. The MNIST database of handwritten digits. 1998.

[Liu et al., 2017] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. In ICLR, 2017.

[Mao et al., 2017] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2813–2821. IEEE, 2017.

[M ˛adry et al., 2017] Aleksander M ˛adry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083 [cs, stat], June 2017.

[Papernot et al., 2016] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against deep learning systems using adversarial examples. arXiv preprint, 2016.

[Szegedy et al., 2014] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014.

[Tramèr et al., 2017] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.

[Xiao et al., 2018] Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song. Spatially transformed adversarial examples. arXiv preprint arXiv:1801.02612, 2018.

[Zagoruyko and Komodakis, 2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

[Zhu et al., 2016] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In ECCV, pages 597–613. Springer, 2016.

[Zhu et al., 2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV, pages 2242–2251, 2017.

Last updated

Was this helpful?