学习通用对抗性扰动生成模型

Learning Universal Adversarial Perturbations withGenerative Models

原文链接：

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8424631

GB/T 7714 Hayes J, Danezis G. Learning universal adversarial perturbations with generative models[C]//2018 IEEE Security and Privacy Workshops (SPW). IEEE, 2018: 43-49.

MLA Hayes, Jamie, and George Danezis. "Learning universal adversarial perturbations with generative models." 2018 IEEE Security and Privacy Workshops (SPW). IEEE, 2018.

APA Hayes, J., & Danezis, G. (2018, May). Learning universal adversarial perturbations with generative models. In 2018 IEEE Security and Privacy Workshops (SPW) (pp. 43-49). IEEE.

Abstract

摘要

Neural networks are known to be vulnerable to adversarial examples, inputs that have been intentionally perturbed to remain visually similar to the source input, but cause a misclassification. It was recently shown that given a dataset and classifier, there exists so called universal adversarial perturbations, a single perturbation that causes a misclassification when applied to any input. In this work, we introduce universal adversarial networks, a generative network that is capable of fooling a target classifier when it’s generated output is added to a clean sample from a dataset. We show that this technique improves on known universal adversarial attacks.

众所周知，神经网络容易受到敌对的例子的攻击，这些输入被故意干扰以保持视觉上与源输入相似，但却导致了错误的分类。最近的研究表明，给定一个数据集和分类器，存在所谓的普遍对抗式扰动，当应用于任何输入时，单个扰动会导致错误分类。在这项工作中，我们引入了通用对抗性网络，这是一个生成网络，当目标分类器生成的输出被添加到数据集的干净样本中时，它能够欺骗目标分类器。我们表明，该技术改进了已知的普遍对抗性攻击。

I. I NTRODUCTION

I. 介绍。

Machine Learning models are increasingly relied upon for safety and business critical tasks such as in medicine [23], [29], [39], robotics and automotive [27], [31], [38], security [2], [17], [36] and financial [13], [18], [34] applications. Recent research shows that machine learning models trained on entirely uncorrupted data, are still vulnerable to adversarial examples [7], [12], [24], [25], [33], [35]: samples that have been maliciously altered so as to be misclassified by a target model while appearing unaltered to the human eye.

机器学习模型越来越多地用于安全和商业关键任务，如医学[23]、[29]、[18]0、机器人和汽车[27]、[31]、[38]、安全[18]1、[17]、[36]和金融[13]、[18]、[34]应用。最近的研究表明，在完全未被破坏的数据上训练的机器学习模型，仍然容易受到敌对例子[7]、[12]、[24]、[25]、[33]、[35]的攻击:这些样本已经被恶意改变，以致被目标模型误分类，而在人眼看来却没有被改变。

Most work has focused on generating perturbations that cause a specific input to be misclassified, however, it has been shown that adversarial perturbations generalize across many inputs [7], [33]. Moosavi-Dezfooli et al. [20] showed, in the most extreme case, that given a target model and a dataset, it is possible to construct a single perturbation that when applied to any input, will cause a misclassification with high likelihood. These are referred to as universal adversarial perturbations (UAPs).

大多数的工作集中在产生扰动，导致一个特定的输入被错误分类，然而，它已经表明，对抗性扰动推广到许多输入[7]，[33]。moosavie - dez愚蠢等人[20]表明，在最极端的情况下，给定一个目标模型和一个数据集，有可能构造一个单一的扰动，当应用到任何输入时，将导致高可能性的错误分类。这些被称为普遍对抗性扰动(UAPs)。

In this work, we study the capacity for generative models to learn to craft UAPs on image datasets, we refer to these networks as universal adversarial networks (UANs). This is similar to work by Baluja and Fischer [1], who studied the capacity for models to learn to craft adversarial examples. We show that a UAN is able to sample from noise and generate a perturbation such that when applied to any input from the dataset, it will result in a misclassification in the target model. Furthermore, we show perturbations produced by UANs: improve on state-of-the-art methods for crafting UAPs (Section IV-A), have robust transferable properties (Section IV-D), and reduce the success of recently proposed defenses [19] (Section V).

在这项工作中，我们研究生成模型的能力来学习在图像数据集上制作uap，我们将这些网络称为通用对抗网络(UANs)。这与Baluja和Fischer[1]的工作类似，他们研究了模型学习制作对抗例子的能力。我们证明了一个UAN能够从噪音中采样并产生扰动，当应用到数据集的任何输入时，它将导致目标模型中的错误分类。此外，我们还展示了UANs产生的扰动:改进了最先进的uap制作方法(IV-A部分)，具有强大的可转移特性(IV-D部分)，并降低了最近提出的防御[19]的成功(V部分)。

II. B ACKGROUND

2背景

We define adversarial examples and UAPs along with some terminology and notation. We then introduce the threat model considered, and the datasets we use to evaluate the attack.

我们定义了对抗的例子和uap以及一些术语和符号。然后介绍了考虑的威胁模型，以及用于评估攻击的数据集。

A. Adversarial Examples

A. 敌对的例子

Szegedy et al. [33] casts the construction of adversarial examples as an optimization problem. Given a target model, f, and a source input x, which is classified correctly by f as c, the attacker aims to find a perturbation, δ, such that x + δ is perceptually identical to x but f(x + δ) = c. The attacker tries to minimize the distance between the source image and adversarial image under an appropriate measure. The problem space can be framed to find a specific misclassification in a targeted attack, or any misclassification, referred to as a non-targeted attack.

Szegedy等人[33]将对抗性例子的构造转换为优化问题。给定一个目标模型,f,源输入x,正确分类的f c,攻击者的目标是找到一个扰动,δ,x +δ是感知相同但f (x +δ)= c。攻击者试图最小化源图像之间的距离和敌对的图像在一个适当的措施。可以对问题空间进行构造，以查找有目标攻击中的特定错误分类，或任何错误分类，称为非目标攻击。

In the absence of a distance measure that accurately captures the perceptual differences between a source and adversarial image, the p metric is usually minimized [33]. Related work commonly uses the 2 and ∞ metrics [3], [4], [6], [10], [14], [16], [20], [21], [40]. The 2 metric measures the Euclidean distance between two images, while the ∞ metric measures the largest pixel-wise difference between two images (Chebyshev distance). We follow this practice here and construct attacks optimizing under both metrics.

在缺乏一个距离测量，准确地捕获了一个源和敌对图像之间的感知差异，p度量通常最小化[33]。相关工作通常使用2和指标[3]，[4]，[6]，[10]，[14]，[16]，[20]，[21]，[40]。2度量两幅图像之间的欧氏距离，而度量两幅图像之间的最大像素差(切比雪夫距离)。我们在这里遵循这种实践，并在这两个度量标准下构造优化的攻击。

A UAP is an adversarial perturbation that is independent of the source image. Given a target model, f, and a dataset, X, a UAP is a perturbation, δ, such that ∀x ∈ X, x + δ is a valid input and $Pr(f(x + δ) \neq f(x)) = 1 − τ$ , where $0 < τ << 1$ .

UAP是一个与源图像无关的对抗性扰动。给定一个目标模型,f,和一个数据集,X, UAP是扰动,δ,这样∀X∈X, X +δ是一个有效的输入和公关(f (X +δ)= f (X) = 1−τ,0 <τ< < 1。

B. Threat Model

b .威胁模型

We consider an attacker whose goal is to craft UAPs against a target model, f. The adversarial image constructed by the attacker should be visually indistinguishable to a source image, evaluated through either the $\ell_{2}$ or $\ell_{\infty}$ metric.

假设攻击者的目标是针对目标模型f生成UAPs，攻击者所构造的敌对图像应该与源图像在视觉上没有区别，可以通过 $\ell_{2}$ 或 $\ell_{\infty}$ 度量进行评估。

Our attacks assume white-box access to f, as we backpropagate the error of the target model back to the UAN. In line with related work on UAPs [20], we consider a worst-case scenario with respect to data access, assuming that the attacker has knowledge of, and shares access to, any training data samples. We will not discuss the real-world limitations of that assumption here, but will follow that practice.

我们的攻击假设白盒访问f，因为我们将目标模型的错误反向传播回UAN。根据UAPs[20]的相关工作，我们考虑了关于数据访问的最坏情况，假设攻击者知道并共享对任何训练数据样本的访问。在这里，我们将不讨论这种假设的实际限制，但将遵循这种实践。

图1:攻击概况。一个正态分布的随机样本被送入一个UAN。这输出一个扰动，然后缩放并添加到图像中。然后，新的图像被裁剪并送入目标模型中。重要的是，我们没有对训练集的分布做任何假设——产生的扰动对应用它的图像是不可知的。

C. Datasets

c .数据集

We evaluate attacks using two popular datasets in adversarial examples research, CIFAR-10 [15] and ImageNet [28].

我们使用两个在对抗实例研究中流行的数据集——CIFAR-10[15]和ImageNet[28]来评估攻击。

The CIFAR-10 dataset consists of 60,000, 32×32 RGB images of different objects in ten classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. This is split into 50,000 training images and 10,000 validation images. Our pre-trained models: VGG-19 [30], ResNet-101 [9], and DenseNet [11], used as the target models, score 91.19%, 93.75%, and 95.00% test accuracy, respectively. State-of-the-art models on CIFAR-10 are approximately 95% accurate.

CIFAR-10数据集包含6万，32,32 RGB图像的不同对象，10类:飞机，汽车，鸟，猫，鹿，狗，青蛙，马，船，卡车。它被分成50,000张训练图像和10,000张验证图像。我们的预训练模型:VGG-19[30]、ResNet-101[9]和DenseNet[11]作为目标模型，测试准确率分别为91.19%、93.75%和95.00%。最先进的模型在CIFAR-10大约95%的准确性。

We use the validation dataset of ImageNet, which consists of 50,000 RGB images, scaled to 224×224. The images contain 1,000 classes. The 50,000 images are split into 40,000 training set images and 10,000 validation set images. We ensure classes are balanced, such that any class contains 40 images in the training set and 10 images in the validation set. Our pre-trained models: VGG-19 [30], ResNet-152 [9], and Inception-V3 [32], used as the target models, score 71.03%, 78.40%, and 77.22% top-1 test accuracy, respectively.

我们使用ImageNet的验证数据集，它包含 5万张RGB图像，缩放到224×224。图像包含 1000类。将50,000张图像分成40,000张训练集图像和10,000张验证集图像。我们保证类是均衡的，每个类在训练集中包含40幅图像，在验证集中包含10幅图像。我们的预训练模型:VGG-19[30]、ResNet-152[9]和incepin - v3[32]作为目标模型，分别获得71.03%、78.40%和77.22%的top-1测试准确率。

III. UNIVERSAL ADVERSARIAL NETWORKS

III. 普遍的对抗性的网络

A. Attack Description

A. 攻击描述

An overview of the attack is given in Figure 1. Let a UAN model be denoted by U, and a target model by f. U takes as input a vector, z, sampled from a normal distribution N (0, 1)100, and outputs a perturbation, δ. This is then scaled by a factor $\omega\in(0,\frac{\epsilon}{\lVert \delta \rVert_{p}}]$ , where $\epsilon$ is the maximum permitted perturbation and p = 2 or ∞. In practice, we start with a small $\omega(e.g. \omega=\frac{\epsilon}{\lVert \delta \rVert_{p}})$ and increment this value whenever the training loss plateaus. The scaled perturbation $δ^{\prime} = ω · δ$ , is added to an image x from a dataset X, to produce an adversarial image. This is then clipped into the target model’s input range before being fed into the target model, f, which outputs a probability vector, $ρ^{1}$ . If $arg ~max_{i} f(x) \neq arg ~max_{i} f(δ^{\prime} + x)$ , a successful adversarial example has been found. Since U(z) is not conditioned on any image in the dataset, U learns how to construct image independent adversarial perturbations, namely universal adversarial perturbations.

图1给出了该攻击的概述。让一个用U表示UAN模型，f的目标模型，U以一个采样自正态分布的向量z作为输入 N(0,1)100，并输出一个扰动，令值。然后把它按比例缩放 $\omega\in(0,\frac{\epsilon}{\lVert \delta \rVert_{p}}]$ ,然而 $\epsilon$ 为最大允许摄动，且p = 2或∞。实际上，我们从一个小的开始 $\omega(e.g. \omega=\frac{\epsilon}{\lVert \delta \rVert_{p}})$ 当训练损失达到顶峰时，增加这个值。缩小的扰动 $δ^{\prime} = ω · δ$ ，添加到来自数据集x的图像x中，产生敌对的形象然后它被剪切到目标模型的输入范围内，然后被输入到目标模型f中，f输出一个概率向量， $ρ^{1}$ 。如果 $arg ~max_{i} f(x) \neq arg ~max_{i} f(δ^{\prime} + x)$ 一个找到了成功的对抗性实例。由于U(z)不以数据集中的任何图像为条件，U学习如何构造图像独立的对敌摄动，即通用对敌摄动。

Given an input x ∈ X, let the class label predicted by f be c0. For non-targeted attacks, any misclassification in the target model suffices, thus, the non-targeted attack aims to maximize the most probable predicted class other than c0. Our non-targeted loss function is adapted from works by Carlini and Wagner [4] and Chen et al. [5], and is given by:

给定输入x∈x，设f预测的类标签为c0。对于非目标攻击，目标模型中的任何错误分类都是足够的，因此非目标攻击的目标是最大可能的预测类别，而不是c0。我们的非靶向损失函数改编自Carlini和Wagner[4]以及Chen等人[5]的作品，给出如下:

The first term in (1), Lf s, is minimized when the adversarial predicted class is not c0. This is adapted from the Carlini and Wagner loss function [4] that introduces a confidence threshold, κ. If we want universal adversarial perturbations that cause misclassifications with high confidence, we stop minimizing only when:

当预测的对敌类不为c0时，(1)中的第一项Lf s最小。这是由Carlini和引入置信阈值的瓦格纳损失函数[4]， κ。如果我们想要引起高度可靠的错误分类的普遍对抗性扰动，只有当:

In specifying a confidence threshold for adversarial examples, (1) becomes:

在为敌对示例指定置信阈值时， (1)就变成:

In all experiments we set κ = 0, and so stop optimizing once an adversarial example is found. To minimize the perturbation applied to an image, Lf s is summed with a distance loss, $L_{dist} = α · \lVert \delta^{\prime} \rVert_{p}$ , where α ∈ R+; this minimizes the norm of the universal adversarial perturbation. The logarithmic term in $L_{fs}$ is necessary since most target models have a skewed probability distribution, with one class prediction dominating all others, thus the logarithmic term reduces the effect of this dominance.

在所有的实验中我们将κ= 0,所以停止优化一旦发现一个敌对的例子。为使对图像的摄动最小， $L_{fs}$ 与距离损失相加， $L_{dist} = α · \lVert \delta^{\prime} \rVert_{p}$ 这使普遍对抗性扰动的规范最小化。 $L_{fs}$ 中的对数项是必要的，因为大多数目标模型的概率分布是偏态的，一类预测占主导地位，因此对数项降低了这种优势地位的影响。

For a targeted attack, we compute a universal adversarial perturbation that transforms any image to a chosen class, c. Under this setting, we optimize using the follow loss function:

对于一个有目标的攻击，我们计算一个通用的对抗性扰动，它将任何图像转换为一个选定的类c。在此设置下，我们使用以下损失函数进行优化:

The full description of the UAN model is given in Table I and hyperparameters used in experiments are given in Table II. We define the relative perturbation, $ζ_{p} = \frac{\lVert \delta^{\prime} \rVert_{p}}{\lVert x \rVert_{p}}$ ; the value of the norm of $\delta^{\prime}$ over the norm of the original image, x. We set $ζ_{p} = 0.04$ in all experiments 2 3. For all experiments in Section IV, we report the error rate of the target model on adversarial images; a perfect attack would achieve an error rate of 1.00, while a perfect classifier achieves an error rate of 0.00.

Table I给出了对UAN模型的完整描述，Table II给出了实验中使用的超参数。我们定义相对扰动， $ζ_{p} = \frac{\lVert \delta^{\prime} \rVert_{p}}{\lVert x \rVert_{p}}$ ; $\delta^{\prime}$ 的模的值大于原始图像的模的值，我们 $ζ_{p} = 0.04$ 在所有实验中设置2 3。对于第四节中的所有实验，我们报告了目标模型对敌对图像的错误率;完美攻击的错误率为1.00，而完美分类器的错误率为0.00。

I V. E VA L UAT I O N

IV. 评价

A. Comparison with previous work

A. 与以往工作比较

We now compare our method for crafting UAPs with two state-of-the-art methods:

现在我们用两种最先进的方法来比较我们制作uap的方法:

Moosavi-Dezfooli et al. [20] constructs a UAP iteratively; at each step an input is combined with the current constructed UAP, if the combination does not fool the target model, a new perturbation with minimal norm is found that does fool the target model. The attack terminates when a threshold error rate is met.
- moosavi - dez愚i等[20]迭代构造UAP;在每一步，输入与当前构造的UAP相结合，如果该组合没有欺骗目标模型，则发现一个新的具有最小范数的扰动确实欺骗了目标模型。当达到阈值错误率时，攻击终止。
Mopuri et al. [22] develop a method for finding a UAP for a target model that is independent of the dataset. They construct a UAP by first starting with random noise and iteratively update it to over-saturate features learned at successive layers in the target model, causing neurons at each layer to output useless information to cause the desired misclassification. They optimize the UAP by adjusting it with respect to the loss term:
- Mopuri等人[22]开发了一种为独立于数据集的目标模型寻找UAP的方法。他们构造一个UAP，首先从随机噪声开始，然后迭代更新它，使目标模型中连续层学习到的特征过于饱和，导致每一层的神经元输出无用的信息，导致所需的错误分类。他们通过根据损失项调整UAP来优化UAP:
- where, $\bar{l}_{i}(\delta)$ is the average of the output at layer i for perturbation δ, and γ is the maximum permitted perturbation.
- 其中， $\bar{l}_{i}(\delta)$ 是扰动对管控的第i层输出的平均值，而准确是允许的最大管控。

Table III compares our UAN method of generating UAPs against the two attacks described above for both CIFAR-10 and ImageNet, in a non-targeted attack setting. We consistently outperform Mopuri et al.’s [22] attack and outperform the Moosavi-Dezfooli et al. [20] attack in ten of the twelve experiments.

表III比较了在非目标攻击设置下，我们生成UAPs的UAN方法与上述针对CIFAR-10和ImageNet的两种攻击的对比。我们的表现一直优于Mopuri等人的[22]攻击，也优于 moosavie - dez愚蠢等人在12个实验中的10个[20]攻击。

B. Transferability

b可转移性

An adversarial image is transferable if it successfully fools a model that was not its original target. Transferability is a yardstick for the robustness of adversarial examples, and is the main property used by Papernot et al. [24], [25] to construct black-box adversarial examples. They construct a white-box attack on a local target model that has been trained to replicate the intended target models decision boundaries, and show that the adversarial examples can successfully transfer to fool the black-box target model.

一个敌对的形象是可以转移的，如果它成功地愚弄了一个不是它最初的目标的模型。可转移性是对抗算例鲁棒性的一个衡量标准，也是Papernot等[24]、[25]构造黑箱对抗算例的主要特性。他们构造了一个对局部目标模型的白盒攻击，该局部目标模型经过训练可以复制预定目标模型的边界决策，并证明了对抗性的例子可以成功地转移到黑盒目标模型上。

To measure the transferability properties of perturbations crafted by a UAN, we create 10,000 adversarial images (constructed via the $\ell_{\infty}$ metric) - one for each image in the CIFAR-10 validation set - and apply them to a target model that was not used to train the UAN. Table IV presents results for transferability of a non-targeted attack on three target modelsVGG-19, ResNet-101, and DenseNet. We find that UAPs crafted using a UAN do transfer to other models. For example, a UAN trained on VGG-19, and evaluated on ResNet-101, the error rate is 61.2%, a drop of just 5.4% from evaluating on the original target model (VGG-19).

为了测量由一个UAN制作的扰动的可转移性，我们创建了10,000张敌对的图像(通过 $\ell_{\infty}$ 度量构建)——CIFAR-10验证集中的每张图像各一张——并将它们应用到一个未用于训练UAN的目标模型上。表IV给出了对三种目标模型(vggg -19、ResNet-101和DenseNet)的非目标攻击的可转移性结果。我们发现使用一个UAN制作的UAPs确实可以转移到其他模型上。例如，一个在vgr -19上接受培训并在ResNet-101上评估的UAN，错误率为61.2%，仅比最初的目标模型(vgr -19)的评估降低了5.4%。

We also measure the capacity for a UAN to learn to fool an ensemble of target models. We trained a UAN against VGG19, ResNet-101, and DenseNet, simultaneously, on CIFAR-10, where the UAN loss function is a linear combination of the losses of each target model. From Table IV, we see that a UAN trained against an ensemble of target models is able to fool at comparable rates to single target models.

我们还测量了一个UAN去欺骗一组目标模型的能力。我们同时在ci远远10上针对VGG19、ResNet-101和DenseNet训练了一个UAN，其中的UAN损失函数是每个目标模型损失的线性组合。由表四可知，a 针对一组目标模型训练的UAN能够以与单一目标模型相当的比率愚弄。

C. GeneralizabilityC. Generalizability

c .普遍性

Moosavi-Dezfooli et al. [20] have shown that UAPs are not unique; there exists many candidates that perform equally well against a target model. If a UAN is truly modeling the distribution of UAPs the output should not be unique. In Figure 3, we measure the MSE (mean square error) and SSIM (structural similarity index) [37] of U(z1), U(z2) for $z1, z2 ← N (0, 1)^{100}, z1 \neq z2$ , at successive training steps, for the ImageNet dataset. Since we expect a high degree of structure in a UAP, SSIM is measured in addition to MSE, as it has been argued that MSE does not map well to a human’s perception of image structure [26], [37].

Moosavi-Dezfooli等人的[20]表明UAPs不是唯一的;有许多候选人在目标模型中表现得同样好。如果一个UAN真的对UAPs的分布进行建模，那么输出就不应该是唯一的。在图3中，我们测量了ImageNet数据集在连续训练步骤中，U(z1)， U(z2)的MSE(均方误差)和SSIM(结构相似度指数)[37]， $z1, z2 ← N (0, 1)^{100}, z1 \neq z2$ 。由于我们预期UAP中存在高度的结构，因此除了MSE之外还测量了SSIM，因为有人认为MSE不能很好地映射到人类对图像结构[26]，[37]的感知。

Fig. 2: CIFAR-10 2 targeted attack. Each figure shows the error rate as the size of the adversarial perturbation is increased. This can be interpreted as the success rate of fooling the target model into classifying any image in CIFAR-10 as the chosen class.

图2:CIFAR-10 2定点攻击。每个图都显示了敌对扰动的大小增加时的错误率。这可以解释为欺骗目标模型将任何CIFAR-10中的图像分类为所选类的成功率。

TABLE IV: Error rates for non-targeted CIFAR-10 attack, under the ∞ metric. UAPs are constructed using row models and tested against pre-trained column models.

表四:在∞度量下，非目标CIFAR-10攻击的错误率。UAPs使用行模型构建，并根据预先训练好的列模型进行测试。

At the beginning of training, there is litle structural similarity between U(z1) and U(z2). Throughout training the SSIM score never increases beyond 0.8, while the MSE continually increases. While the structural similary of UAPs learned by a UAN is high, it does learn to generalize to multiple UAPs that are unique from one another. Similar effects, albeit scaled down due to the smaller image size, were found for the CIFAR-10 dataset.

在训练开始时，U(z1)和U(z2)在结构上没有什么相似之处。在整个培训过程中，SSIM的分数从未超过0.8，而MSE却在持续增加。虽然一个UAN学到的uap的结构相似度很高，但它确实学会了将其推广到多个彼此独特的uap。在CIFAR-10数据集中也发现了类似的效果，尽管由于图像尺寸较小而缩小了。

Does a UAN that learns to generalize to multiple UAPs do so to the detriment of attack accuracy? We verify this is not the case by training a UAN on a fixed noise vector and comparing to a UAN trained with non-fixed noise vectors. We found similar error rates for the two settings (see Table V); there is no loss in accuracy by extending a UAN to output multiple adversarial perturbations.

一个学会了推广到多个uap的UAN这样做会损害攻击的准确性吗?我们通过在固定噪声向量上训练一个UAN，并与在非固定噪声向量上训练的UAN进行比较，来验证这不是事实。我们发现这两种设置的错误率相似(见表V);通过扩展一个UAN来输出多个对抗性扰动，在准确性上没有损失。

D. Targeted Attacks

d .针对性的攻击

表V:对CIFAR-10的∞攻击的错误率。我们比较了固定噪声向量训练的UAN和固定噪声向量训练的UAN UAN对非固定噪声向量进行训练。

We follow the same experimental set-up as in Section IV-A, however now the attacker chooses a class, c, they would like the target model to classify an adversarial example as, and success is calculated as the probability that an adversarial example is classified as c. Figure 2 shows, for each class in CIFAR-10, the error rate of the target model as we allow larger perturbations. For nearly every class, attacks on ResNet-101 are most successful, while attacks on VGG-19 are least successful. This is in agreement with our findings in a non-targeted attack setting (cf. Table III). Despite VGG-19 being the most difficult target model to attack, it is the most well calibrated; the error rate on the training set is nearly identical to the error rate on the validation set for all classes, while there are small deviations between these two scores for ResNet-101 and DenseNet.

iv节中我们遵循相同的试验装置,然而现在攻击者选择一个类,c,他们希望目标模型分类一个敌对的例子,和成功的概率计算是一个敌对的例子是图2所示,分为c。在CIFAR-10每个类,目标模型的错误率,我们允许更大的扰动。对于几乎所有的职业，对ResNet-101的攻击最成功，而对VGG-19的攻击最不成功。这与我们在非目标攻击设置中的发现相一致(参见表3)。尽管VGG-19是最难以攻击的目标模型，但它是最精确校准的;对于所有类，训练集的错误率与验证集的错误率几乎相同，而ResNet-101和DenseNet的这两个得分之间有很小的偏差。

By looking only at results on VGG-19, one may infer that the choice of target class heavily influences the error rate (e.g. crafting UAP’s for the dog and ship classes is more difficult than others). However, this is not replicated with ResNet-101 or DenseNet. We do not observe any dependencies between attack success and the target class; the attack success at different perturbation rates is similar for all classes. Figure 4 shows this attack applied to a DenseNet target model for the CIFAR-10 dataset for all source/target class pairs. Nearly all attacks are indistinguishable from the source image.

通过只查看VGG-19上的结果，可以推断出目标类的选择严重影响了错误率(例如，为dog和ship类制作UAP比其他类更难)。但是，ResNet-101或DenseNet并没有复制它。我们没有观察到攻击成功与目标类之间的任何依赖关系;在不同的扰动率下，所有职业的攻击成功率都是相似的。图4显示了这种攻击应用于所有源/目标类对的CIFAR-10数据集的DenseNet目标模型。几乎所有的攻击都与源图像无法区分。

Interestingly, all targeted attacks follow a sigmoidal curve shape. Empirically, we found that for all three target models, there existed images that were weakly classified correctly (there was almost no difference between the largest probability score and probability score at the target class) and strongly classified correctly (there was three to four orders of magnitude difference between the probability score at the largest class and the probability score at the target class). At the beginning of training, the UAN discovers a perturbation that causes misclassifications when applied to the weakly classified images, but takes longer to find adversarial perturbations for the majority of images, resulting in a long tail at the beginning of training. With a similar effect taking place at the end of training to find adversarial perturbations for strongly classified images.

有趣的是，所有目标攻击都遵循s形曲线形状。经验,我们发现所有三个目标模型,存在图像弱分类正确的(几乎没有区别的最大概率评分和概率得分在目标类)和强烈的正确分类(有3到4个数量级差异得分的概率最大的类和目标类的概率评分)。在训练的开始，当应用到弱分类的图像时，UAN发现了一个导致错误分类的扰动，但是对于大多数图像需要更长的时间来发现敌对的扰动，从而导致在训练的开始出现一个长尾。在训练结束时对强分类图像的敌对干扰也有类似的效果。

E. Importance of training set size

E.训练集大小的重要性

So far, we have assumed the attacker shares full access to any images that were used to train the target model. However in practice, this may not be the case - an attacker may only have access to the type or a subsample of the training data. We therefore evaluate our non-targeted ∞ attack under stronger assumptions of attacker access to training data.

到目前为止，我们假设攻击者共享对用于训练目标模型的任何图像的完全访问权。但是在实践中，情况可能并非如此——攻击者可能只能访问训练数据的类型或子样本。因此，我们在攻击者能够访问训练数据的较强假设下评估我们的非目标∞攻击。

Figure 6 shows the error rate caused by a UAN trained on subsets of the CIFAR-10 training set. As expected, training on more data samples improves the success of the attack; perturbations from a UAN trained on only 50 images (5 from each class) fools 17.1% of validation set images in ResNet101. The attack is successful when applied to nearly a fifth of images while only learning from 0.1% of the training set. The attack succeeds in 80.2% of cases when trained on 20% of the training set - in other words, there is virtually no difference in test accuracy when training on between 80-100% of the training set.

图6显示了在CIFAR-10训练集的子集上训练的一个UAN造成的误码率。正如预期的那样，在更多的数据样本上训练可以提高攻击的成功率;在ResNet101中，来自一个只训练了50张图像(每个类5张)的UAN的干扰愚弄了17.1%的验证集图像。攻击成功时应用于近五分之一的图片只有学习训练集的0.1%。此次袭击成功80.2%的病例在训练训练集的20%——换句话说,几乎没有不同的测试精度,当训练训练集的80 - 100%之间。

We find no significant difference in error rates between a UAN that has been trained on many data samples and few data samples. The amount of data samples provided to the UAN does not significantly impact its ability to learn to craft adversarial perturbations, all that must be known is the structure of the dataset on which the target model was trained. We note that this is in agreement with Papernot et al.’s [25] findings on the number of source images required to launch attacks on black-box models.

我们发现a和b之间的错误率没有显著差异在大量数据样本和少量数据样本上训练过的UAN。提供给的数据样本的数量 UAN并不会显著影响它学习制造对抗性扰动的能力，所有必须知道的是目标模型所训练的数据集的结构。我们注意到，这与Papernot等人关于发起对黑盒模型攻击所需源图像数量的[25]研究结果一致。

In addition to measuring attacker success for different training set sizes, we experimented with different batch sizes, ranging from 16 to 128, for the CIFAR-10 dataset. However, we did not observe any significant deviations in the error rate.

除了测量针对不同训练集大小的攻击成功度，我们还对CIFAR-10数据集进行了不同批处理大小的实验，批处理大小从16到128不等。但是，我们没有观察到错误率的任何显著偏差。

V. ATTACKING ADVERSARIAL TRAINING

V.攻击对抗性训练

7:在CIFAR10上对vga -19目标模型进行无目标∞攻击和对抗性训练的猫捉老鼠游戏。上绿点为对敌图像经过对敌训练后的目标模型精度，下红叉为攻击后对敌图像的目标模型精度。虚线表示目标模型在源图像上的精度。

Adversarial training [7], [16] modifies the training of a model in order to make it more robust to adversarial examples. During training, the loss function L(θ, x, y) is replaced by $α · L(θ, x, y) + (1 − α) · L(θ, x + δ^{\prime} , y)$ . By augmenting the original data to include adversarial counterparts, the model learns to classify adversarial examples correctly. Non-generative attacks have shown to be successful against adversarially trained models, however, recent work [19] suggested that this may not be the case for UAPs. In [19], adversarial training is successfully applied to a CIFAR-10 classifier, effectively eliminating the adversarial effect of UAPs.

对抗性训练[7]，[16]修改了模型的训练以使它对对抗性例子更健壮。在训练过程中，将损失函数L(cents, x, y)替换为 $α · L(θ, x, y) + (1 − α) · L(θ, x + δ^{\prime} , y)$ 。通过对原始数据进行扩充，包括对敌的对敌，模型学会正确地对对敌的例子进行分类。非生成攻击已经证明是成功的对抗对抗训练模型，然而，最近的工作[19]表明，这可能不是UAPs的情况。在[19]中，对一个CIFAR-10分类器成功地进行了对抗性训练，有效地消除了UAPs的对抗性效果。

In our work, we verified that this is case; adversarial training eliminates UAP success. However, we find that adversarially trained models are still vulnerable to a UAN trained against the defended model.

在我们的工作中，我们证实了这一点;对抗性训练会消除UAP的成功。然而，我们发现反向训练的模型仍然容易受到针对防御模型训练的UAN的攻击。

Similarly to Hamm [8], we play a cat-and-mouse game where (1) a UAN is trained against a target model, and (2) the target model is retrained with adversial examples crafted from (1) (denoted ADV TM). This generates a sequence: UAN1 → ADV TM1 → UAN2 → ADV TM2 → UAN3 → .... We let this game play out for many rounds, and claim that if adversarial training is a defense against UAPs, over many rounds the classification error on adversarial examples should tend to zero.

与Hamm[8]类似，我们玩了一个猫捉老鼠的游戏，其中(1)针对目标模型训练一个UAN，(2)使用由(1)(表示ADV TM)制作的反面例子对目标模型进行再训练。这将生成一个序列: $UAN1\to ADV~ TM1\to UAN2 \to~ADV ~TM2\to UAN3…$ 我们让这个游戏进行了很多回合，并声称如果对抗性训练是对抗UAPs的防御，那么经过很多回合对抗性例子的分类错误应该趋于零。

Figure 7 shows such a cat-and-mouse game over 20 rounds of (1) and 20 rounds of (2). An adversarially trained target model is able to classify nearly all adversarial examples correctly, at any given round. However, attacks against adversarially retrained models are only somewhat mitigated; there is a 25% reduction is attack success between the first and final round. After this, the cycle reaches an equilibrium, with no improvement in successive attacks or defended models. We note, however, that the experimental set-up in [19] is slightly different to ours. They perform adversarial training with a strong adversary that generates data-specific perturbations and found that this makes the model robust against universal perturbations.

图7展示了这样一个超过20轮(1)和20轮(2)的猫捉老鼠游戏。一个经过对抗性训练的目标模型能够在任何一轮中正确地分类几乎所有的对抗性例子。然而，对反向再训练模型的攻击只是稍微减轻;在第一轮和最后一轮之间有25%的攻击成功率降低。在此之后，循环达到一个平衡，连续攻击或防御模型都没有改善。然而，我们注意到[19]中的实验设置与我们的略有不同。他们与一个强大的对手进行对抗性训练，从而产生数据特定的扰动，并发现这使得模型对普遍的扰动具有鲁棒性。

VI. CONCLUSION

VI. 结论

We presented a first-of-its-kind universal adversarial example attack that uses machine learning at the heart of its construction. We comprehensively evaluated the attack under many different settings, showing that it produces quality adversarial examples capable of fooling a target model in both targeted and non-targeted attacks. The attack transfers to many different target models, and improves on other state-of-the-art universal adversarial perturbation construction methods.

我们提出了一种史无前例的通用对抗性示例攻击，它使用机器学习作为其构造的核心。我们综合评估了许多不同设置下的攻击，显示它产生了高质量的对抗性例子，能够在有目标和非目标攻击中欺骗目标模型。攻击转移到许多不同的目标模型，并改进了其他先进的通用对敌摄动构造方法。

VII.Acknowledgements

VII.致谢

Jamie Hayes is funded by a Google PhD Fellowship in Machine Learning.

杰米·海斯是由谷歌博士奖学金资助的机器学习。

REFERENCES

参考文献

[1] S. Baluja and I. Fischer. Learning to attack: Adversarial transformation networks. In Proceedings of AAAI-2018, 2018.

[2] A. L. Buczak and E. Guven. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2):1153–1176, 2016.

[3] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. arXiv preprint arXiv:1705.07263, 2017.

[4] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.

[5] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh. ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models. ArXiv e-prints, Aug. 2017.

[6] A. Demontis, P. Russu, B. Biggio, G. Fumera, and F. Roli. On Security and Sparsity of Linear Classifiers for Adversarial Settings. ArXiv e-prints, Aug. 2017.

[7] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.

[8] J. Hamm. Machine vs Machine: Defending Classifiers Against Learningbased Adversarial Attacks. ArXiv e-prints, Nov. 2017.

[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[10] W. He, J. Wei, X. Chen, N. Carlini, and D. Song. Adversarial Example Defenses: Ensembles of Weak Defenses are not Strong. ArXiv e-prints, June 2017.

[11] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

[12] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.

[13] S.-J. Kim and S. Boyd. A minimax theorem with applications to machine learning, signal processing, and finance. SIAM Journal on Optimization, 19(3):1344–1367, 2008.

[14] J. Kos, I. Fischer, and D. Song. Adversarial examples for generative models. ArXiv e-prints, Feb. 2017.

[15] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.

[16] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.

[17] T. D. Lane. Machine learning techniques for the computer security domain of anomaly detection. 2000.

[18] W.-Y. Lin, Y.-H. Hu, and C.-F. Tsai. Machine learning in financial crisis prediction: a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4):421–436, 2012.

[19] J. H. Metzen. Universality, robustness, and detectability of adversarial perturbations under adversarial training, 2018.

[20] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. arXiv preprint arXiv:1610.08401, 2016.

[21] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.

[22] K. R. Mopuri, U. Garg, and R. V. Babu. Fast feature fool: A data independent approach to universal adversarial perturbations. arXiv preprint arXiv:1707.05572, 2017.

[23] Z. Obermeyer and E. J. Emanuel. Predicting the futurebig data, machine learning, and clinical medicine. The New England journal of medicine, 375(13):1216, 2016.

[24] N. Papernot, P. McDaniel, and I. Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.

[25] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against deep learning systems using adversarial examples. arXiv preprint arXiv:1602.02697, 2016.

[26] T. N. Pappas and R. J. Safranek. Perceptual criteria for image quality evaluation.

[27] E. Rosten and T. Drummond. Machine learning for high-speed corner detection. Computer Vision–ECCV 2006, pages 430–443, 2006.

[28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.

[29] M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. S. Pinkus, et al. Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine, 8(1):68–74, 2002.

[30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[31] S. Sivaraman and M. M. Trivedi. Active learning for on-road vehicle detection: A comparative study. Machine vision and applications, pages 1–13, 2014.

[32] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.

[33] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

[34] T. B. Trafalis and H. Ince. Support vector machine for regression and applications to financial forecasting. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, volume 6, pages 348–353. IEEE, 2000.

[35] F. Tramer, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. ` The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453, 2017.

[36] F. Tramer, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart. Stealing ` machine learning models via prediction apis. In USENIX Security Symposium, pages 601–618, 2016.

[37] Z. Wang and A. C. Bovik. Mean squared error: Love it or leave it? a new look at signal fidelity measures. IEEE signal processing magazine, 26(1):98–117, 2009.

[38] X. Wen, L. Shao, Y. Xue, and W. Fang. A rapid learning algorithm for vehicle classification. Information Sciences, 295:395–406, 2015.

[39] Q.-H. Ye, L.-X. Qin, M. Forgues, P. He, J. W. Kim, A. C. Peng, R. Simon, Y. Li, A. I. Robles, Y. Chen, et al. Predicting hepatitis b virus-positive metastatic hepatocellular carcinomas using gene expression profiling and supervised machine learning. Nature medicine, 9(4):416, 2003.

[40] F. Zhang, P. P. Chan, B. Biggio, D. S. Yeung, and F. Roli. Adversarial feature selection against evasion attacks. IEEE transactions on cybernetics, 46(3):766–777, 2016.

PreviousDefense- gan:使用生成模型保护分类器免受敌方攻击 Next抵御普遍的对抗性干扰

Last updated 4 years ago

Was this helpful?