用GAN消除对抗扰动

APE-GAN: Adversarial Perturbation Elimination with GAN

原文连接：

https://arxiv.org/pdf/1707.05474.pdf

GB/T 7714 Shen S, Jin G, Gao K, et al. Ape-gan: Adversarial perturbation elimination with gan[J]. arXiv preprint arXiv:1707.05474, 2017.

MLA Shen, Shiwei, et al. "Ape-gan: Adversarial perturbation elimination with gan." arXiv preprint arXiv:1707.05474 (2017).

APA Shen, S., Jin, G., Gao, K., & Zhang, Y. (2017). Ape-gan: Adversarial perturbation elimination with gan. arXiv preprint arXiv:1707.05474.

Abstract

摘要

Although neural networks could achieve state-of-the-art performance while recongnizing images, they often suffer a tremendous defeat from adversarial examples–inputs generated by utilizing imperceptible but intentional perturbation to clean samples from the datasets. How to defense against adversarial examples is an important problem which is well worth researching. So far, very few methods have provided a significant defense to adversarial examples. In this paper, a novel idea is proposed and an effective framework based Generative Adversarial Nets named APE-GAN is implemented to defense against the adversarial examples. The experimental results on three benchmark datasets including MNIST, CIFAR10 and ImageNet indicate that APE-GAN is effective to resist adversarial examples generated from five attacks.

尽管神经网络在识别图像时能够达到最先进的性能，但它们经常会在使用难以察觉但有意的扰动从数据集中清除样本而产生的敌对示例输入中遭受巨大的失败。如何防范对抗性实例是一个非常值得研究的重要问题。到目前为止，很少有方法能对敌对的例子提供有效的辩护。本文提出了一种新的生成式对抗网络的思想，并实现了一种有效的框架——APE-GAN来对对抗实例进行防御。在MNIST、CIFAR10和ImageNet三个基准数据集上的实验结果表明，APE-GAN能够有效地抵抗5种攻击产生的敌对示例。

1 Introduction

1介绍

Deep neural networks have recently achieved excellent performance on a variety of visual and speech recognition tasks. However, they have non-intuitive characterisitics and intrinsic blind spots that are easy to attack using obscure manipulation of their input[6, 11, 19, 26]. In many cases, the structure of the neural networks is strongly related to the training data distribution, which is in contradiction with the network’s ability to achieve high generalization performance.

深度神经网络最近在各种视觉和语音识别任务上取得了优异的表现。然而，它们具有非直观的特性和固有的盲点，易于通过对其输入的模糊操作进行攻击[6,11,19,26]。在很多情况下，神经网络的结构与训练数据的分布密切相关，这与网络实现高泛化性能的能力存在矛盾。

Szegedy et al. [26] first noticed that imperceptible perturbation of test samples can be misclassified by neural networks. They term this kind of subtly perturbed samples “adversarial examples”. In contrast to noise samples, adversarial examples are imperceptible, designed intentionally, and more likely to cause false predictions in the image classification domain. What is more serious is that adversarial examples transfer across models named transferability, which can be leveraged to perform black-box attacks[14, 19]. In other words, an adversary can find the adversarial examples generated from substitute model trained by the adversary and apply it to attack the target model. Howerver, so far, transferability is mostly appears on small datasets, such as MNIST and CIFAR10. Transferability over large scale datasets, such as ImageNet, has yet to be better understood. Therefore, in this paper, black-box attacks is not taken into consideration, and resisting the white-box attacks (the adversary has complete access to the target model including the architecture and all paramaters) is the core work of the paper.

Szegedy等人[26]首先注意到，测试样本的微小扰动可能被神经网络误分类。他们把这种微妙的扰动样本称为对抗例子。与噪声样本相比，对抗性样本不易察觉、有意设计，更容易在图像分类领域造成错误的预测。更严重的是，被称为可转移性的模型之间的对抗性例子转移，可以被用来执行黑箱攻击[14,19]。也就是说，敌手可以找到敌手训练的替代模型生成的对抗性实例，并将其用于攻击目标模型。然而，到目前为止，可转移性大多出现在小型数据集上，如MNIST和CIFAR10。大规模数据集(如ImageNet)上的可转让性还有待更好地理解。因此，本文不考虑黑盒攻击，抵抗白盒攻击(对手可以完全访问目标模型，包括架构和所有参数)是本文的核心工作。

Adversarial examples pose potential security threats for practical machine learning applications. Recent research has shown that a large fraction of adversarial examples are classified incorrectly even when obtained from the cellphone camera[11]. This makes it possible that an adversary crafts adversarial images of traffic signs to cause the self-driving cars to take unwanted actions[20]. Therefore, the research of resisting adversarial examples is very significant and urgent.

相反的例子对实际的机器学习应用程序构成了潜在的安全威胁。最近的研究表明，即使是从手机摄像头[11]获取的对抗性例子，也有很大一部分分类错误。这使得对手可能制造出敌对的交通标志图像，从而导致自动驾驶汽车采取不必要的动作[20]。因此，对对抗案例的研究具有十分重要和迫切的意义。

Until now, there are two classes of approaches to defend against adversarial examples. The straightforward way is to make the model inherently more robust with enhanced training data as show in Figure 2 or adjusted learning strategies. Adversarial training [6, 27] or defensive distillation [18, 22] belongs to this class. It is noteworthy that the original defensive distillation is broken by Carlini and Wagner’s attack [2], however, which can be resisted by the extending defensive distillation. In addition, the faultiness original adversarial training remains highly vulnerable to transferred adversarial examples crafted on other models which is discussed in the ensemble adversarial training, and the model trained using ensemble adversarial training are slightly less robust to some white-box attacks. The second is a series of detection mechanisms used to detect and reject an adversarial sample[4, 15]. Unfortunately, Carlini et al. [1] indicate that adversarial examples generated from Carlini and Wagner’s attack are significantly harder to detect than previously appreciated via bypassing ten detection methods. Therefore, defensing against adversarial examples is still a huge challenge.

到目前为止，有两类方法来抵御敌对的例子。直接的方法是通过增强的训练数据(如图2所示)或调整学习策略，使模型本质上更加健壮。对抗性训练[6,27]或防御蒸馏[18,22]属于这一类。值得注意的是，最初的防御蒸馏被Carlini和Wagner的攻击[2]所打破，但可以通过扩展防御蒸馏来抵抗。此外，原有对抗训练的不完善性在集成对抗训练中讨论的其他模型上的转移对抗实例时仍然很脆弱，而使用集成对抗训练的模型对某些白盒攻击的鲁棒性稍差。第二种是一系列检测机制，用于检测和拒收敌对样本[4,15]。不幸的是，Carlini等人[1]指出，从Carlini和Wagner s攻击中生成的对抗例子明显比以前通过10种检测方法更难检测。因此，防范敌对的例子仍然是一个巨大的挑战。

Misclassification of the adversarial examples is mainly due to the intentionally imperceptible perturbations to some pixels of the input images. Thus, we propose an algorithm to eliminate the adversarial perturbation of input data to defense against the adversarial examples. The adversarial perturbation elimination of the samples can be defined as the problem of learning a manifold mapping from adversarial examples to original examples. Generative Adversarial Net (GAN) proposed by Goodfellow et al[5] is able to generate images similar to training set with a random noise. Therefore, we designed an framework utilizing GAN to generate clean examples from adversarial examples. Meanwhile, SRGAN[13], the successful application of GAN on super-resolution issues, provides a valuable experience to the implementation of our algorithm.

对抗性实例的误分类主要是由于输入图像中某些像素点的故意微扰所致。因此，我们提出一种消除输入数据的对抗性扰动的算法来对抗对抗性实例。样本的对抗性扰动消除可以定义为学习对抗性实例到原始实例的流形映射问题。Goodfellow等[5]提出的生成对战网(GAN)能够生成带有随机噪声的与训练集相似的图像。因此，我们设计了一个利用GAN从敌对的例子中生成干净的例子的框架。同时，GAN[13]作为GAN在超分辨率问题上的成功应用，为我们的算法的实现提供了宝贵的经验。

In this paper, an effective framework is implementated to eliminate the aggressivity of adversarial examples before being recognized, as shown in Figure 2. The code and trained models of the framework are available on https://github.com/ shenqixiaojiang/APE-GAN, so we welcome new attacks to break our defense.

在本文中，实现了一个有效的框架来消除敌对例子在被识别前的攻击性，如图2。框架的代码和训练模型可以在https://github.com/ shenqixiaojiang/APE-GAN上找到，所以我们欢迎新的攻击来打破我们的防御。

This paper makes the following contributions:

本文的贡献如下:

A new perspective of defending against adversarial examples is proposed. The idea is to first eliminate the adversarial perturbation using a trained network and then feed the processed example to classification networks.
- 提出了防范对抗性实例的新视角。其思想是首先用一个训练过的网络消除对抗性的扰动，然后将处理过的例子反馈给分类网络。
An effective and reasonable framework based on the above idea is implemented to resist adversarial examples. The experimental results on three benchmark datasets demonstrate the effectiveness.
- 在此基础上实现了一个有效、合理的框架，以避免实例的冲突。在三个基准数据集上的实验结果验证了该方法的有效性。
The proposed framework possesses strong applicability. It can tackle adversarial examples without knowing what target model they are constructed upon.
- 该框架具有很强的适用性。它可以处理敌对的例子，而不知道它们是建立在什么目标模型上。
The training procedure of the APE-GAN needs no knowledge of the architecture and parameters of the target model.
- ape- gan的训练过程不需要了解目标模型的结构和参数。

2相关工作

In this section, methods of generating adversarial examples that are closely related to this work is briefly reviewed. In addition, GAN and its connection to our method will be discussed.

在这一节中，我们将简要回顾与本工作密切相关的产生对抗例子的方法。此外，还将讨论GAN及其与我们方法的连接。

In the remaining of the paper we use the following notation and terminology:

在本文的其余部分，我们使用以下符号和术语:

X - the clean image from the datasets
- X-数据集中的干净图像
$X^{adv}$ -the adversarial image.
- $X^{adv}$ -敌对的图像。
$y_{true}$ -the true class for the input X.
- $y_{true}$ -输入X的真实类。
$y_{fool}$ -the class label different from ytrue for the input X.
- $y_{fool}$ -对于输入X，类标签与ytrue不同。
f- the classifier mapping from input image to a discrete label set.
- f-分类器从输入图像映射到一个离散的标签集。

图2:(a)传统的深度学习框架对干净的图像具有鲁棒性，但对敌对的例子非常脆弱。(b)现有的对抗性训练框架可以增强训练数据，提高目标模型的鲁棒性。(c)我们提出一个名为APE-GAN的对抗性扰动消除框架，在将其输入目标模型之前消除对抗性示例的扰动，以增加鲁棒性。

$J(X,y)$ -the cost function used to train the model given image X and class y.
- $J(X,y)$ -代价函数用于训练给定图像X和类y的模型。
$\epsilon$ -the size of worst-case perturbations, $\epsilon$ is the upper bound of the L∞ norm of the perturbation.
- $\epsilon$ 最坏情况扰动的大小， $\epsilon$ 是扰动的L∞范数的上界。
$Clip_{X,\epsilon}{A}$ -an $\epsilon$ -neighbourhood clipping of A to the range $[X_{i,j}-\epsilon,X_{i,j}+\epsilon]$
- $Clip_{X,\epsilon}{A}$ - $\epsilon$ A到范围的邻域剪裁 $[X_{i,j}-\epsilon,X_{i,j}+\epsilon]$ .
Non-targeted adversarial attack - the goal of it is to slightly modify clean image that it will be classified incorrectly by classifier.
- 非目标对抗性攻击-目标是对清晰的图像进行轻微的修改，使其被分类器错误地分类。
Targeted adversarial attack - the goal of it is to slightly modify source image that it will be classified as specified target class by classifier.
- 目标对抗性攻击-目标是对源图像进行轻微的修改，使其被分类器分类为指定的目标类。

2.1 Methods Generating Adversarial Examples

2.1产生对抗性实例的方法

In this subsection, six approaches we utilized to generate adversarial images are provided with a brief description.

在本节中，我们使用六种方法来产生敌对的图像，并提供一个简短的描述。

2.1.1 L-BFGS Attack

2.1.1 L-BFGS攻击

Minimum distortion generation function D where $f(D(x)) \neq f(x)$ defined as an optimization problem can be solved by a box-constrained L-BFGS to craft adversarial perturbation under L2 distance metric [26].

最小畸变产生函数D，其中 $f(D(x)) \neq f(x)$ 定义为优化问题，可通过盒约束的L-BFGS求解，在L2距离度量[26]下制造对抗扰动。

The optimization problem can be formulized as:

优化问题可以公式化为:

The constant λ > 0 controls the trade-off between the perturbation’s amplitude and its attack power which can be found with line search.

恒定的扰动> 0控制了扰动振幅与攻击能力之间的平衡，通过线性搜索可以得到扰动的攻击能力。

2.1.2 Fast Gradient Sign Method Attack (FGSM)

2.1.2快速梯度符号攻击(FGSM)

Goodfellow et al.[6] proposed this method to generate adversarial images under L∞ distance metric.

Goodfellow等[6]提出了在L∞距离度量下生成敌对图像的方法。

Given the input X the fast gradient sign method generates the adversarial images with :

给定输入X，快速梯度符号法生成对应的图像:

where $\epsilon$ is chosen to controls the perturbation’s amplitude.

其中选择 $\epsilon$ 来控制摄动的振幅。

The Eqn.2 indicates that all pixels of the input X are shifted simultaneously in the direction of the gradient with a single step. This method is simpler and faster than other methods, but lower attack success rate since at the beginning, it was designed to be fast rather than optimal.

Eqn.2表示输入X的所有像素沿梯度方向进行单步移动。该方法比其他方法更简单、更快，但攻击成功率较低，因为从一开始就被设计为快速而不是最优。

2.1.3 Iterative Gradient Sign

2.1.3迭代梯度符号

An straightforward way introduced by Kurakin et al.[11] to extend the FGSM is to apply it several times with a smaller step size α and the intermediate result is clipped by the ε. Formally,

由Kurakin等人[11]介绍的一种扩展FGSM的直接方法是用更小的步长发散应用它几次，中间的结果被直接压缩。在形式上,

2.1.4 DeepFool Attack

2.1.4 DeepFool攻击

DeepFool is a non-targeted attack introduced by MoosaviDezfooli et al [16]. It is one of methods to apply the minimal perturbation for misclassification under the L2 distance metric. The method performs iterative steps on the adversarial direction of the gradient provided by a locally linear approximation of the classifier until the decision hyperplane has crossed. The objective of DeepFool is

DeepFool是由moosavidez愚i等人[16]引入的一种非目标攻击。这是在L2距离度量下应用最小扰动进行误分类的方法之一。该方法在分类器的局部线性逼近所提供的梯度的对抗性方向上进行迭代，直到决策超平面相交。DeepFool的目标是

Although it is different from L-BFGS, the attack can also be seen as a first-order method to craft adversarial perturbation.

虽然它不同于L-BFGS，攻击也可以视为一阶方法来制造敌对摄动。

2.1.5 Jacobian-Based Saliency Map Attack(JSMA)

2.1.5基于jacobian的显著性映射攻击(JSMA)

The targeted attack is also a gradient-based method which uses the gradient to compute a saliency score for each pixel. The saliency score reflects how strongly each pixel can affect the resulting classification. Given the saliency map computed by the model’s Jacobian matrix, the attack greedily modifies the most important pixel at each iteration until the prediction has changed to a target class. This attack seeks to craft adversarial perturbation under L0 distance metric [21].

目标攻击也是一种基于梯度的方法，利用梯度计算每个像素的显著性得分。显著性分数反映了每个像素对分类结果的影响程度。根据模型的雅可比矩阵计算出显著性映射，攻击会在每次迭代时贪婪地修改最重要的像素点，直到预测结果变为目标类为止。这种攻击试图在距离度量[21]下制造对抗性扰动。

2.1.6 Carlini and Wagner Attack(CW)

2.1.6 Carlini 和 Wagner 攻击（CW）

There are three attacks for the L0, L2 and L∞ distance metric proposed by Carlini et al.[2] Here, we just give a brief description of L2 attack. The objective of L2 attack is

由Carlini等人提出的L0、L2和L∞距离度量有三种攻击。在这里，我们只是简单地描述L2攻击。L2攻击的目的是

where the loss function l is defined as

其中损失函数l定义为

The Z is the logits of a given model and κ is used to control the confidence of adversarial examples. As κ increases, the more powerful adversarial examples become. The constant c can be chosen with binary search which is similar to λ in the L-BFGS attack.

Z是一个给定的分对数模型和κ用于控制敌对的例子的信心。随着κ增加,敌对的例子会变得更加强大。常数c可以通过二分查找来选择，类似于L-BFGS攻击中的伪变量。

The Carlini and Wagner’s attack is proved by the authors that it is superior to other published attacks. Then, all the three attacks should be taken into consideration to defense against.

Carlini和Wagner的攻击被作者证明优于其他发表的攻击。然后，防御时要考虑这三种攻击。

In addition, we use CW-L0, CW-L2, CW-L∞ to represent the attack for the L0, L2 and L∞ distance metric respectively in the following experiments.

此外，在接下来的实验中，我们分别使用CW-L0、CW-L2、CW-L∞来表示对L0、L2和L∞距离度量的攻击。

2.2 Generative Adversarial Nets

2.2生成的对抗性网络

Generative Adversarial Net (GAN) is a framework incorporating an adversarial discriminator into the procedure of training generative models. There are two models in the GAN: a generator G that is optimized to estimate the data distribution and a discriminator D that aims to distinguish between samples from the training data and fake samples from G.

生成式对抗网(GAN)是一个将对抗甄别器整合到生成模型训练过程中的框架。在GAN中有两个模型:一个是用于优化估计数据分布的生成器G，另一个是用于区分来自训练数据的样本和来自G的假样本的鉴别器D。

The objective of GAN can be formulized as a minimax value function V (G, D):

GAN的目标可以化为极小值函数V (G, D):

GAN has been known to be unstable to train, often resulting in generators that produce nonsensical outputs since it is difficult to maintain a balance between the G and D.

GAN 在训练中是不稳定的，经常导致发电机产生荒谬的输出，因为很难保持G和D之间的平衡。

The Deep Convolutional Generative Adversarial Nets (DCGAN)[23] is a good implementation of the GAN with convolutional networks that make them stable to train in most settings.

深度卷积生成对抗网 [23]是GAN的一个很好的实现，它带有卷积网络，可以使它们在大多数设置中稳定地训练。

Our model in this paper is implemented based on DCGAN owing to its stability. The details will be discussed in the following section.

由于模型的稳定性，本文的模型是基于DCGAN实现的。细节将在下一节中讨论。

3 Our Approach

3我们的方法

The fundamental idea of defending against adversarial examples is to eliminate or damage of the trivial perturbations of the input before being recognized by the target model.

防御敌对的例子的基本思想是在被目标模型识别之前消除或破坏输入的小扰动。

The infinitesimal difference of adversarial image and clean image can be formulated as:

对象与干净象的极小差分可表示为:

Ideally, the perturbations η can be got rid of from Xadv . That means the distribution of Xadv is highly consistent with X.

理想情况下，可以从Xadv中消除扰动。这意味着Xadv的分布与X高度一致。

The global optimality of GAN is the consistence of the generative distribution of G with samples from the data generating distribution:

GAN的全局最优性为G的生成分布与来自数据生成分布的样本的一致性:

The procedure of converging to a good estimator of pdata coincides with the demand of the elimimation of adversarial perturbations η.

收敛到一个好的pdata估计量的过程与对敌扰动消除的要求是一致的。

Based on the above analysis, a novel framework based GAN to eliminate the adversarial perturbations is proposed. We name this class of architectures defending against adversarial examples based on GAN, adversarial perturbation elimination with GAN(APE-GAN), as shown in Figure 2.

基于以上分析，提出了一种基于提出了消除对抗性摄动的GAN。我们将这类抵御基于GAN的敌对示例的架构命名为使用GAN(APE-GAN)消除敌对干扰，如图2所示。

The APE-GAN network is trained in an adversarial setting. While the generator G is trained to alter the perturbation with tiny changes to the input examples, the discriminator D is optimized to seperate the clean examples and reconstructed examples without adversarial perturbations obtained from G. To achieve this, a task specified fusion loss function is invented to make the adversarial examples highly consistent with original clean image manifold.

APE-GAN 网络是在敌对环境下训练的。在发电机G训练改变微小变化的扰动输入的例子,鉴别器D优化分离干净的例子和重建的例子没有敌对的扰动来自G .为了达到这个目标,一个任务指定融合损失函数是发明使敌对的例子非常符合原来的清廉形象歧管。

3.1 Architecture

3.1体系结构

The ultimate goal of APE-GAN is to train a generating function G that estimates for a given adversarial input image $X^{adv}$ its corresponding $\widehat{X}$ counterpart. To achieve this, a generator network parametrized by $θ_{G}$ is trained. Here $θ_{G}$ denotes the weights and baises of a generate network and is obtained by optimizing an adversarial perturbation elimination specified loss function lape. For training images Xadv k obtained by applying FGSM with corresponding original clean image $X_{k}$ , $k = 1, ..., N$ , we solve:

APE-GAN的终极目标是培养为一个给定的生成函数G估计敌对的输入图像 $X^{adv}$ 对应 $\widehat{X}$ 同行。为了实现这一目标，我们训练了一个由小波变换G参数化的发生器网络。其中，表示生成网络的权值和权重，通过优化一个对抗性摄动消除指定损失函数lape得到。对于使用FGSM得到的训练图像Xadv k与对应的原始干净图像 $X_{k}$ , $k = 1, ..., N$ ，则求解:

A discriminator network DθD along with GθG is defined to solve the adversarial zero sum problem:

定义了一个判别器网络D和G和G，以解决对抗性零和问题:

The general idea behind this formulation is that it allows one to train a generative model G with the goal of deceiting a differentiable discriminator D that is trained to tell apart reconstructed images G(Xadv) from original clean images. Consequently, the generator can be trained to produce reconstructed images that are highly similar to original clean images, and thus D is unable to distinguish them.

这个公式背后的一般想法是，它允许训练一个生成模型G，目的是欺骗一个可区分的鉴别器D，该鉴别器被训练来区分重建图像G(Xadv)和原始的干净图像。因此，经过训练后，可以生成与原始干净图像高度相似的重建图像，使得D无法对其进行区分。

The general architecture of our generator network G is illustrated in Figure 2. Some convolutional layers with stride = 2 are leveraged to get feature maps with lower resolution and followed by some deconvolutional layers with stride = 2 to recover the original resolution.

我们的生成网络G的总体架构如图2所示。利用stride = 2的卷积层得到低分辨率的feature map，然后利用stride = 2的反卷积层恢复原始分辨率。

To discriminate original clean images from reconstructed images, we train a discriminator network. The general architecture is illustrated in Figure 2. The discriminator network is trained to solve the maximization problem in Equation 10. It also contains some convolutional layers with stride = 2 to get some high-level feature maps, two dense layers and a final sigmoid activation function to obtain a probability for sample classification.

为了区分原始的干净图像和重建图像，我们训练一个鉴别器网络。一般的体系结构如图2所示。通过训练鉴别器网络来求解公式10中的最大化问题。它还包含了一些convolutional layer, stride = 2可以得到一些高级的feature map，两个稠密的layer和一个最终的sigmoid activation function，可以得到样本分类的概率。

The specific architectures on MNIST,CIFAR10 and ImageNet are introduced in the experimental setup.

在实验装置中介绍了MNIST、CIFAR10和ImageNet的具体结构。

3.2 Loss Function

3.2损失函数

3.2.1 Discriminator Loss

3.2.1鉴别器的损失

According to Equation 10, the loss function of discriminator , ld is designed easily:

根据公式10，鉴频器的损耗函数，ld易于设计:

3.2.2 Generator Loss

3.2.2生成损失

The definition of our adversarial perturbation elimination specified loss function lape is critical for the performance of our generator network to produce images without adversarial perturbations. We define lape as the weighted sum of several loss functions as:

对抗性摄动消除指定损失函数lape的定义对于生成无对抗性摄动图像的生成网络的性能至关重要。我们将lape定义为几个损失函数的加权和:

which consists of pixel-wise MSE(mean square error) loss and adversarial loss. 它包括像素级均方误差损失和对抗性损失。

Content Loss: Inspired by image super-resolution method[13], the pixel-wise MSE loss is defined as:
- 内容丢失:受图像超分辨率[13]方法启发，像素级MSE丢失定义为:
Adversarial perturbations can be viewed as a special noise constructed delicately. The most widely used loss for image denoising or super-resolution will be able to achieve satisfactory results for adversarial perturbation elimination.
- 对抗性扰动可以看作是一种精心构造的特殊噪声。在图像去噪或超分辨率中应用最广泛的损失在对敌摄动消除时也能取得满意的效果。

表1:由目标模型和的五种方法生成的对抗性例子的错误率(%) 在MNIST, CIFAR10和Imagenet上的APE-GAN。在实验中报告了目标模型对干净图像的错误率。

4 Evaluation

4评价

The L-BFGS, DeepFool, JSMA, FGSM, CW including CW-L0 CW-L2 CW-L∞ attacks introduced in the related-work are resisted by APE-GAN on three standard datasets: MNIST[12], a database of handwritten digits has 70,000 28x28 gray images in 10 classes(digit 0-9), CIFAR10[10], a dataset consists of 60,000 32x32 colour images in 10 classes, and ImageNet[3], a largeimage recognition task with 1000 classes and more than 1,000,000 images provided.

在三个标准数据集:MNIST[12]上，一个手写体数字数据库拥有10个类别(数字)的70,000张28x28灰度图像，该数据库中引入的L-BFGS、DeepFool、JSMA、FGSM、CW攻击包括CW- l0 CW- l2 CW- l∞攻击被APE-GAN抵抗 0-9)， CIFAR10[10]，一个数据集包含60000个32x32彩色图像，10个类，ImageNet[3]，一个有1000个类以上的大型图像识别任务 1000000张图片。

It is noteworthy that the adversarial samples cannot be saved in the form of picture, since discretizing the values from a real-numbered value to one of the 256 points seriously degrades the quality. Then it should be saved and loaded as float32.

值得注意的是，对抗性样本不能以图片的形式保存，因为将数值从实数离散到256个点中的一个会严重降低质量。然后将其保存并加载为float32。

4.1 Experimental Setup

4.1实验装置

4.1.1 Input

4.4.1输入

The input samples of target model can be classified into adversarial input obtained from attack approaches and benign input which is taken into account by the traditional deep learning framework including clean images and clean images added with random noise. Adding random noise to original clean images is the common trick used in data augmentation to improve the robustness, but does not belong to the standard training procedures of target model. Hence, it is not shown in Figure 2.

目标模型的输入样本可以分为攻击方法得到的对抗性输入和传统深度学习框架考虑的良性输入，包括干净图像和加入随机噪声的干净图像。在原始干净图像中加入随机噪声是数据增强中提高鲁棒性的常用方法，但不属于目标模型的标准训练程序。因此，它没有显示在图2中。

Adversarial Input: The FGSM and JSMA attacks have been implemented in the cleverhans v.1 [17] which is a Python library to benchmark machine learning systems’ vulnerability to adversarial examples and the L-BFGS attack and DeepFool attacks have been implemented in the Foolbox[24] which is a Python toolbox to create adversarial examples that fool neural networks. The code of CW attack has been provided by Carlini et al [2]. We experiment with $\epsilon = 0.3$ on MNIST, $\epsilon = 0.1$ on CIFAE10, $\epsilon = 8/255$ on ImageNet for the FGSM attack, κ = 0 on the three datasets for the CW-L2 attack and default parameters of other attacks are utilised.
- 敌对的输入:FGSM和JSMA攻击已经实现cleverhans 1[17]这是一个Python库基准机器学习系统脆弱性敌对的例子和L-BFGS攻击和DeepFool攻击已经实现Foolbox[24]这是一个Python傻瓜神经网络工具箱创建敌对的例子。CW攻击的代码已经由Carlini等[2]提供。我们在MNIST实验与 $\epsilon = 0.3$ , $\epsilon = 0.1$ X = CIFAE10, $\epsilon = 8/255$ 的ImageNet FGSM攻击,κ= 0的三个数据集CW-L2攻击和其他攻击利用的缺省参数。
Benign Input: The full test set of MNIST and CIFAR10 are utilized for the evaluation while results on ImageNet use a random sample of 10,000 RGB inputs from the test set. In addition, Gaussian white noise of mean 0 and variance 0.05 is employed in the following.
- 良性输入:利用MNIST和CIFAR10的完整测试集进行评估，ImageNet上的结果使用来自测试集的10000个RGB输入的随机样本，并采用均值为0、方差为0.05的高斯白噪声。

4.1.2 Target Models

4.1.2目标模型

表2:MNIST、CIFAR10和Imagenet上的目标模型和APE-GAN良性输入的错误率(单位为%)。这里的目标模型分别为model C、DenseNet40、InceptionV3，与FGSM攻击MNIST、CIFAR10和Imagenet的目标模型相同

In order to provide the most accurate and fair comparison, whenever possible, the models provided by the authors or libraries should be used.

为了提供最准确和公平的比较，只要可能，应该使用作者或图书馆提供的模型。

MNIST: We train a convolutional networks (denoted A in the Appendix) for L-BFGS and DeepFool attacks. For CW attack, the model is provided by Carlini (denoted B in the Appendix). For FGSM and JSMA attacks, the model is provided by cleverhans (denoted C in the Appendix). The 0.9%, 0.5% and 0.8% error rates can be achieved by the models A, B and C on clean images respectively, comparable to the state of the art.
- MNIST:我们训练一个卷积网络(在附录中表示为a)用于L-BFGS和DeepFool攻击。对于CW攻击，模型由Carlini提供(在附录中表示B)。对于FGSM和JSMA攻击，该模型由cleverhans提供(附录中用C表示)。模型A、B和C分别可以在干净图像上实现0.9%、0.5%和0.8%的错误率，与目前的技术水平相当。
CIFAR10: ResNet18[7] is trained by us for L-BFGS and DeepFool attack. For CW attack, the model is provided by Carlini (denoted D in the Appendix). For FGSM and JSMA attacks, DenseNet[8] with depth=40 is trained. The 7.1%, 20.2%∗ and 9.9% error rates can be achieved by the models ResNet18, D and DenseNet40 on clean images respectively.
- CIFAR10: ResNet18[7]是我们训练的L-BFGS和DeepFool攻击。对于CW攻击，模型由Carlini提供(在附录中用D表示)。对于FGSM和JSMA攻击，训练深度=40的DenseNet[8]。∗和9.9%的错误率可以通过ResNet18模型实现， D和DenseNet40分别用于清洁图像。
ImageNet: We use ResNet50 one pre-trained networks for L-BFGS and DeepFool attacks. For other three attacks, another pre-trained network InceptionV3 is leveraged [25]. ResNet50 achieves the top1 error rate 24.4% and the top-5 error rate 7.2% while InceptionV3 achieves the top-1 error rate 22.9% and the top-5 error rate 6.1% on clean images.
- ImageNet:我们使用ResNet50一个预先训练的网络进行L-BFGS和DeepFool攻击。对于其他三种攻击，使用了另一个预先训练过的网络InceptionV3。ResNet50的错误率达到前1的24.4%，前5的错误率为7.2% 在干净图像上，InceptionV3实现了前1的错误率22.9%和前5的错误率6.1%。

4.1.3 APE-GAN

Three models are trained with the APE-GAN architecture on MNIST, CIFAR10 and ImageNet(denoted APEGANm, APE-GANc, APE-GANi in the Appendix). The full training set of MNIST and CIFAR10 are utilized for the training of APE-GANm and APE-GANc respectively while a random sample of 50,000 RGB inputs from the training set of ImageNet make a contribution to train the APE-GANi .

在MNIST、CIFAR10和ImageNet上使用APE-GAN架构训练了三个模型(附录中表示APEGANm、APE-GANc、APE-GANi)。利用MNIST和CIFAR10的完整训练集分别训练APE-GANm和APE-GANc，而从ImageNet的训练集中随机抽取50000 RGB输入的样本对训练 APE-GANi 。

MNIST: APE-GANm is trained for 2 epochs on batches of 64 FGSM samples with $\epsilon = 0.3$ , input size = (28,28,1).
- 对64个FGSM样本批次进行2个epoch的训练， $\epsilon = 0.3$ ，输入大小=(28,28,1)。
CIFAR10: APE-GANc is trained for 10 epochs on batches of 64 FGSM samples with $\epsilon = 0.1$ , input size = (32,32,3).
- CIFAR10:在 $\epsilon = 0.1$ ，输入尺寸=(32,32,3)的64个FGSM样本批次上，对ap - ganc进行10个epoch的训练。
ImageNet: APE-GANi is trained for 30 epochs on batches of 16 FGSM samples with $\epsilon = 8 / 255$ , input size = (256,256,3). However, as we all know, the input size of ResNet50 is 224 224 and the InceptionV3 is 299 299. So we use the resize operation to handle this.
- ImageNet: apee - gani在 $\epsilon = 8 / 255$ ，输入大小=(256,256,3)的16个FGSM样本批次上接受30个epoch的训练。但是，我们都知道ResNet50的输入尺寸是224 224,InceptionV3是299 299。我们使用调整大小操作来处理这个。

The straightforward method to train the generator and the discriminator is update both in every batch. However, the discriminator network often learns much faster than the generator network because the generator is more complex than distinguishing between real samples and fake samples. Therefore, generator should be run twice with each iteration to make sure that the loss of discriminator does not go to zero. The learning rate is initialized with 0.0002 and Adam[9] optimizer is used to update parameters and optimize the networks. The weights of the adversarial perturbation elimination specified loss ξ1 and ξ2 used in the Eqn.12 are fixed to 0.7 and 0.3 separately.

直接训练生成器和鉴别器的方法是每批更新。然而，鉴别器网络的学习速度往往比生成器网络快得多，因为生成器比鉴别真样本和假样本要复杂得多。因此，发生器在每次迭代时应运行两次，以确保鉴别器的损耗不会趋近于零。学习率用0.0002初始化，Adam[9]优化器用于更新参数和优化网络。指定的对抗性的扰动消除重量损失ξ1,ξ2用于Eqn.12分别固定在0.7和0.3。

The training procedure of the APE-GAN needs no knowledge of the architecture and parameters of the target model.

APE- gan的训练过程不需要了解目标模型的结构和参数。

4.2 Results

4.2结果

4.2.1 Effectiveness:

4.2.1有效性准备:

Adversarial Input Table 1 indicates that the error rates of adversarial inputs are significantly decreased after its perturbation is eliminated by APE-GAN. Among all the attacks, the CW attack is more offensive than the others, and among the three attacks of CW, the CW-L0 is more offensive. The error rate of FGSM is greater than the L-BFGS which may be caused by different target models. As it is shown in Figure 3, the aggressivity of adversarial examples can be eliminated by APE-GAN even though these is imperceptible differences between (a) and (b). In addition, the adversarial examples generated from FGSM with different X are resisted and the result is shown in Table 3 and Table 4.
- 从表1可以看出，通过APE-GAN消除了对抗性输入的扰动后，对抗性输入的错误率显著降低。在所有的攻击中，CW的攻击比其他的更具有进攻性，而在CW的3次攻击中，CW- 10的攻击更具有进攻性。FGSM的错误率大于L-BFGS，这可能是由不同的目标模型引起的。如图3所示,可以消除敌对的例子的攻击力APE-GAN即使这些是听不清(a)和(b)之间的差异。另外,从FGSM不同X生成的对抗性的例子是抵制和结果表3和表4所示。
Benign Input The error rate of clean images and the clean images added with random Gaussian noise is shown in Table 2. Actual details within the image can be lost with multiple levels of convolutional and down-sampling layers which has a negative effect on the classification. However, Table 2 indicates that there is no marked increase in the error rate of clean images. Meanwhile, APE-GAN has a good performance on resisting the random noise. Figure 4 shows that the perturbation generated from random Gaussian noise is irregular and all in a muddle while the perturbation obtained from the FGSM attack is regular and intentional. However the perturbation, whether regular or irregular, can be eliminated by APE-GAN.
- 干净图像和加入随机高斯噪声的干净图像的错误率如表2所示。多层卷积和下采样层会丢失图像内部的实际细节，对分类产生负面影响。但从表2可以看出，干净图像的错误率没有明显增加。同时，APE-GAN具有良好的抗随机噪声性能。从图4可以看出，随机高斯噪声产生的扰动是不规则的，都是混乱的，而FGSM攻击产生的扰动是规则的，有意图的。然而，不论是规则的扰动还是不规则的扰动，都可以被猿根消除。

In summary, APE-GAN has the capability to provide a good performance to various input, whether adversarial or benign, on three benchmark datasets.

总之，APE-GAN能够在三个基准数据集上为各种输入(无论是敌对的还是良性的)提供良好的性能。

4.2.2 Strong Applicability:

4.2.2强大的适用性:

The experimental setup of target models indicates that there is more than one target model designed in experiments on MNIST, CIFAR10 and ImageNet respectively. Table 1 demonstrates that APE-GAN can tackle adversarial examples for different target models. Actually, it can provide a defense without knowing what model they are constructed upon. Therefore, we can conclude that the APE-GAN possesses strong applicability.

目标模型的实验设置表明，MNIST、CIFAR10和ImageNet实验中设计的目标模型不止一个。表1显示了APE-GAN可以处理不同目标模型的敌对示例。实际上，它可以在不知道它们建立在什么模型上的情况下提供防御。由此可见，猿型gan具有较强的适用性。

5 Discussion and Future Work

5讨论和未来工作

Pre-processing the input to eliminate the adversarial perturbations is another appealing aspect of the framework which makes sure there is no conflict between the framework and other existing defenses. Then APE-GAN can work with other defenses such as adversarial training together. Another method APE-GAN followed by a target model trained using adversarial training is experimented. The results on MNIST and CIFAR10 have been done shown in Table 7, 8, 9 in the Appendix. The adversarial examples leveraged in Table 9 in the Appendix are generated from Iterative Gradient Sign with N = 2. Actually, the FGSM leveraged to craft the adversarial examples of Table 8 is identical to the Iterative Gradient Sign with N = 1. Compared with Table 8, Table 9 indicates that the robustness of target model cannot be significantly improved with adversarial training. However, the combination of APE-GAN and adversarial training makes a notable defence against Iterative Gradient Sign. New combinations of different defenses will be researched in the future work.

对输入进行预处理以消除对抗性干扰是框架的另一个吸引人的方面，它可以确保框架和其他现有防御之间不存在冲突。然后，猿猴gan可以与其他防御措施一起工作，比如对抗训练。另一种方法是用敌手训练的目标模型进行实验。MNIST和CIFAR10的结果见附录中的表7、表8、表9。附录表9中使用的相反示例是由N = 2的迭代梯度符号生成的。实际上，用于制作表8的敌对示例的FGSM与N = 1的迭代梯度符号是相同的。与表8相比，表9表明对抗性训练并不能显著提高目标模型的鲁棒性。然而，猿- gan和对抗性训练的结合对迭代梯度符号有显著的防御作用。在未来的工作中将研究不同防御的新组合。

The core work in this paper is to propose a new perspective of defending against adversarial examples and to first eliminate the adversarial perturbations using a trained network and then feed the processed example to classification networks. The training of this adversarial perturbation elimination network is based on the Generative Adversarial Nets framework. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed approach.

本文的核心工作是提出一种新的防范敌对算例的观点，首先利用训练好的网络消除敌对扰动，然后将处理过的算例反馈给分类网络。对抗性摄动消除网络的训练是基于生成的对抗性网框架。在三个基准数据集上的实验结果证明了该方法的有效性。

References

参考文献

[1] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. arXiv preprint arXiv:1705.07263, 2017. 2

[2] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017. 2, 5, 7, 8

[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 7

[4] Z. Gong, W. Wang, and W.-S. Ku. Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960, 2017. 2

[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. 2

[6] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. 1, 2, 4

[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 8

[8] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 8

[9] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 9

[10] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009. 7

[11] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016. 1, 2, 4

[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 7

[13] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunning- ´ ham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016. 2, 6

[14] Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016. 1

[15] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017. 2

[16] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016. 4

[17] N. Papernot, I. Goodfellow, R. Sheatsley, R. Feinman, and P. McDaniel. cleverhans v1.0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768, 2016. 7

[18] N. Papernot and P. McDaniel. Extending defensive distillation. arXiv preprint arXiv:1705.05264, 2017. 2

[19] N. Papernot, P. McDaniel, and I. Goodfellow. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016. 1

[20] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against deep learning systems using adversarial examples. arXiv preprint arXiv:1602.02697, 2016. 2

[21] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pages 372– 387. IEEE, 2016. 4

[22] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582–597. IEEE, 2016. 2

[23] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. 5

[24] J. Rauber, W. Brendel, and M. Bethge. Foolbox v0.8.0: A python toolbox to benchmark the robustness of machine learning models. arXiv preprint, 2017. 7

[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. 8

[26] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. 1, 4

[27] F. Tramer, A. Kurakin, N. Papernot, D. Boneh, and P. Mc- ` Daniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017. 2

Previous生成对抗网络 NextDefense- gan:使用生成模型保护分类器免受敌方攻击

Last updated 5 years ago

Was this helpful?

原文连接：

Abstract

摘要

1 Introduction

1介绍

2 Related Work

2相关工作

2.1 Methods Generating Adversarial Examples

2.1产生对抗性实例的方法

2.1.1 L-BFGS Attack

2.1.1 L-BFGS攻击

2.1.2 Fast Gradient Sign Method Attack (FGSM)

2.1.2快速梯度符号攻击(FGSM)

2.1.3 Iterative Gradient Sign

2.1.3迭代梯度符号

2.1.4 DeepFool Attack

2.1.4 DeepFool攻击

2.1.5 Jacobian-Based Saliency Map Attack(JSMA)

2.1.5基于jacobian的显著性映射攻击(JSMA)

2.1.6 Carlini and Wagner Attack(CW)

2.1.6 Carlini 和 Wagner 攻击（CW）

2.2 Generative Adversarial Nets

2.2生成的对抗性网络

3 Our Approach

3我们的方法

3.1 Architecture

3.1体系结构

3.2 Loss Function

3.2损失函数

3.2.1 Discriminator Loss

3.2.1鉴别器的损失

3.2.2 Generator Loss

3.2.2生成损失

4 Evaluation

4评价

4.1 Experimental Setup

4.1实验装置

4.1.1 Input

4.4.1输入

4.1.2 Target Models

4.1.2目标模型

4.1.3 APE-GAN

4.1.3 APE-GAN

4.2 Results

4.2结果

4.2.1 Effectiveness:

4.2.1有效性准备:

4.2.2 Strong Applicability:

4.2.2强大的适用性:

5 Discussion and Future Work

5讨论和未来工作

References

参考文献