在物理世界中的对抗例子
ADVERSARIAL EXAMPLES IN THE PHYSICAL WORLD
Last updated
Was this helpful?
ADVERSARIAL EXAMPLES IN THE PHYSICAL WORLD
Last updated
Was this helpful?
https://arxiv.org/pdf/1607.02533.pdf
GB/T 7714 Kurakin A, Goodfellow I, Bengio S. Adversarial examples in the physical world[J]. arXiv preprint arXiv:1607.02533, 2016.
MLA Kurakin, Alexey, Ian Goodfellow, and Samy Bengio. "Adversarial examples in the physical world." arXiv preprint arXiv:1607.02533 (2016).
APA Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533.
Most existing machine learning classifiers are highly vulnerable to adversarial examples. An adversarial example is a sample of input data which has been modified very slightly in a way that is intended to cause a machine learning classifier to misclassify it. In many cases, these modifications can be so subtle that a human observer does not even notice the modification at all, yet the classifier still makes a mistake. Adversarial examples pose security concerns because they could be used to perform an attack on machine learning systems, even if the adversary has no access to the underlying model. Up to now, all previous work has assumed a threat model in which the adversary can feed data directly into the machine learning classifier. This is not always the case for systems operating in the physical world, for example those which are using signals from cameras and other sensors as input. This paper shows that even in such physical world scenarios, machine learning systems are vulnerable to adversarial examples. We demonstrate this by feeding adversarial images obtained from a cell-phone camera to an ImageNet Inception classifier and measuring the classification accuracy of the system. We find that a large fraction of adversarial examples are classified incorrectly even when perceived through the camera.
大多数现有的机器学习分类器都是非常脆弱的对抗性例子。一个相反的例子是一个输入数据的样本,它经过了非常轻微的修改,目的是导致机器学习分类器对它进行错误的分类。在许多情况下,这些修改可能非常细微,以至于人类观察者根本没有注意到这些修改,但是分类器仍然会犯错误。敌对的例子会带来安全问题,因为它们可能被用来对机器学习系统执行攻击,即使对手无法访问底层模型。到目前为止,所有的工作都假设了一个威胁模型,在这个威胁模型中,对手可以直接将数据输入机器学习分类器。在物理世界中运行的系统并不总是如此,例如那些使用来自摄像机和其他传感器的信号作为输入的系统。本文表明,即使在这样的物理世界场景中,机器学习系统也容易受到敌对实例的攻击。我们通过将从手机摄像头获取的敌对图像输入ImageNet初始分类器并测量系统的分类精度来证明这一点。我们发现,即使在通过相机感知的情况下,也有很大一部分对抗性的例子分类错误。
Figure 1: Demonstration of a black box attack (in which the attack is constructed without access to the model) on a phone app for image classification using physical adversarial examples. We took a clean image from the dataset (a) and used it to generate adversarial images with various sizes of adversarial perturbation . Then we printed clean and adversarial images and used the TensorFlow Camera Demo app to classify them. A clean image (b) is recognized correctly as a “washer” when perceived through the camera, while adversarial images (c) and (d) are misclassified. See video of full demo at https://youtu.be/zQ_uMenoBCk.
图1:在一个用于图像分类的手机应用程序上演示黑盒攻击(在这种攻击中,攻击是在不访问模型的情况下构建的),使用物理对抗的例子。我们从数据集(a)中取出干净的图像,并用它生成具有不同大小的敌对扰动的敌对图像 . 然后打印干净和敌对的图像,使用TensorFlow Camera Demo app进行分类。当通过相机感知时,干净的图像(b)被正确地识别为垫圈,而敌对的图像(c)和(d)被错误地分类。完整演示视频请访问https://youtu.be/zQ_uMenoBCk。
However, machine learning models are often vulnerable to adversarial manipulation of their input intended to cause incorrect classification (Dalvi et al., 2004). In particular, neural networks and many other categories of machine learning models are highly vulnerable to attacks based on small modifications of the input to the model at test time (Biggio et al., 2013; Szegedy et al., 2014; Goodfellow et al., 2014; Papernot et al., 2016b).
然而,机器学习模型往往容易受到对其输入的敌对操纵,从而导致不正确的分类(Dalvi等,2004)。特别是神经网络和许多其他类别的机器学习模型,在测试时对模型的输入进行小的修改,极易受到攻击(Biggio et al., 2013;Szegedy等,2014;Goodfellow等,2014年;Papernot等,2016b)。
The problem can be summarized as follows. Let’s say there is a machine learning system M and input sample C which we call a clean example. Let’s assume that sample C is correctly classified by the machine learning system, i.e. M(C) = ytrue. It’s possible to construct an adversarial example A which is perceptually indistinguishable from C but is classified incorrectly, i.e. M(A) 6= ytrue. These adversarial examples are misclassified far more often than examples that have been perturbed by noise, even if the magnitude of the noise is much larger than the magnitude of the adversarial perturbation (Szegedy et al., 2014).
这个问题可以总结如下。假设有一个机器学习系统M和输入样本C,我们称之为干净的例子。假设样本C被机器学习系统正确分类,即M(C) = ytrue。有可能构造一个与C在感知上无法区分但分类错误的对抗性例子A,即M(A) 6= ytrue。这些对抗性的例子比那些被噪声干扰的例子更容易被错误分类,即使噪声的大小远远大于对抗性扰动的大小(Szegedy et al., 2014)。
Adversarial examples pose potential security threats for practical machine learning applications. In particular, Szegedy et al. (2014) showed that an adversarial example that was designed to be misclassified by a model M1 is often also misclassified by a model M2. This adversarial example transferability property means that it is possible to generate adversarial examples and perform a misclassification attack on a machine learning system without access to the underlying model. Papernot et al. (2016a) and Papernot et al. (2016b) demonstrated such attacks in realistic scenarios.
相反的例子对实际的机器学习应用程序构成了潜在的安全威胁。特别地,Szegedy等人(2014)指出,设计用于被模型M1误分类的对敌例子,通常也会被模型M2误分类。这种对抗性示例的可转移性意味着,在不访问底层模型的情况下,有可能生成对抗性示例并对机器学习系统执行误分类攻击。Papernot等人(2016a)和Papernot等人(2016b)在现实场景中演示了这种攻击。
However all prior work on adversarial examples for neural networks made use of a threat model in which the attacker can supply input directly to the machine learning model. Prior to this work, it was not known whether adversarial examples would remain misclassified if the examples were constructed in the physical world and observed through a camera.
然而,之前所有针对神经网络的对抗性例子的研究都是利用了一个攻击者可以直接向机器学习模型提供输入的威胁模型。在这项工作之前,我们不知道如果在物理世界中构造例子并通过相机观察,敌对的例子是否会保持错误的分类。
Such a threat model can describe some scenarios in which attacks can take place entirely within a computer, such as as evading spam filters or malware detectors (Biggio et al., 2013; Nelson et al.). However, many practical machine learning systems operate in the physical world. Possible examples include but are not limited to: robots perceiving world through cameras and other sensors, video surveillance systems, and mobile applications for image or sound classification. In such scenarios the adversary cannot rely on the ability of fine-grained per-pixel modifications of the input data. The following question thus arises: is it still possible to craft adversarial examples and perform adversarial attacks on machine learning systems which are operating in the physical world and perceiving data through various sensors, rather than digital representation?
这样的威胁模型可以描述一些完全发生在计算机内部的攻击场景,比如避开垃圾邮件过滤器或恶意软件探测器(Biggio et al., 2013;纳尔逊等)。然而,许多实际的机器学习系统是在现实世界中运行的。可能的例子包括但不限于:机器人通过摄像机和其他传感器感知世界,视频监控系统,以及用于图像或声音分类的移动应用程序。在这种情况下,对手不能依赖于对输入数据按像素进行细粒度修改的能力。因此,下面的问题就产生了:对于运行在物理世界中,通过各种传感器而不是数字表示来感知数据的机器学习系统,是否仍然有可能制造对抗性的例子并对其进行对抗性的攻击?
Some prior work has addressed the problem of physical attacks against machine learning systems, but not in the context of fooling neural networks by making very small perturbations of the input. For example, Carlini et al. (2016) demonstrate an attack that can create audio inputs that mobile phones recognize as containing intelligible voice commands, but that humans hear as an unintelligible voice. Face recognition systems based on photos are vulnerable to replay attacks, in which a previously captured image of an authorized user’s face is presented to the camera instead of an actual face (Smith et al., 2015). Adversarial examples could in principle be applied in either of these physical domains. An adversarial example for the voice command domain would consist of a recording that seems to be innocuous to a human observer (such as a song) but contains voice commands recognized by a machine learning algorithm. An adversarial example for the face recognition domain might consist of very subtle markings applied to a person’s face, so that a human observer would recognize their identity correctly, but a machine learning system would recognize them as being a different person. The most similar work to this paper is Sharif et al. (2016), which appeared publicly after our work but had been submitted to a conference earlier. Sharif et al. (2016) also print images of adversarial examples on paper and demonstrated that the printed images fool image recognition systems when photographed. The main differences between their work and ours are that: (1) we use a cheap closed-form attack for most of our experiments, while Sharif et al. (2016) use a more expensive attack based on an optimization algorithm, (2) we make no particular effort to modify our adversarial examples to improve their chances of surviving the printing and photography process. We simply make the scientific observation that very many adversarial examples do survive this process without any intervention. Sharif et al. (2016) introduce extra features to make their attacks work as best as possible for practical attacks against face recognition systems. (3) Sharif et al. (2016) are restricted in the number of pixels they can modify (only those on the glasses frames) but can modify those pixels by a large amount; we are restricted in the amount we can modify a pixel but are free to modify all of them.
一些先前的工作已经解决了对机器学习系统的物理攻击问题,但没有在通过对输入进行非常小的扰动来愚弄神经网络的背景下进行。例如,Carlini等人(2016)演示了一种攻击,它可以创建音频输入,手机识别为包含可理解的语音命令,但人类听到的是不可理解的声音。基于照片的人脸识别系统很容易受到重放攻击,重放攻击是将先前捕获的授权用户的人脸图像而不是真实的人脸呈现给相机(Smith et al., 2015)。对抗性的例子原则上可以应用于这两个物理领域中的任何一个。声音命令领域的一个相反的例子是由一个对人类观察者来说无害的录音(比如一首歌)组成,但是它包含由机器学习算法识别的声音命令。人脸识别领域的一个反面例子可能是将非常细微的标记应用到人脸上,这样人类观察者就能正确地识别出他们的身份,但机器学习系统会将他们识别为另一个人。与本文最相似的工作是Sharif等人(2016)的工作,该工作在我们的工作之后公开发表,但在之前的一次会议上提交。Sharif等人(2016)也将敌对例子的图像打印在纸上,并证明了打印的图像在拍照时欺骗了图像识别系统。他们的工作和我们的之间的主要区别是:(1)我们使用廉价的封闭攻击我们的实验,而谢里夫et al。(2016)使用更昂贵的基于优化算法的攻击,(2)我们没有特别努力修改我们的敌对的例子来改善他们的幸存的印刷和摄影过程的机会。我们只是通过科学的观察发现,在没有任何干预的情况下,许多敌对的例子确实在这个过程中存活了下来。Sharif等人(2016)引入了额外的特征,使他们的攻击在针对人脸识别系统的实际攻击中尽可能有效。(3) Sharif等(2016)可以修改的像素个数有限(仅限镜框上的像素),但可以修改的像素量较大;我们可以修改一个像素的数量受到限制,但可以自由修改所有像素。
To investigate the extent to which adversarial examples survive in the physical world, we conducted an experiment with a pre-trained ImageNet Inception classifier (Szegedy et al., 2015). We generated adversarial examples for this model, then we fed these examples to the classifier through a cellphone camera and measured the classification accuracy. This scenario is a simple physical world system which perceives data through a camera and then runs image classification. We found that a large fraction of adversarial examples generated for the original model remain misclassified even when perceived through a camera.1
为了研究敌对的例子在物理世界中存活的程度,我们使用预先训练好的ImageNet Inception分类器进行了一项实验(Szegedy et al., 2015)。我们对该模型生成敌对的例子,然后通过手机摄像头将这些例子输入到分类器中,测量分类精度。这个场景是一个简单的物理世界系统,通过相机感知数据,然后进行图像分类。我们发现对于原始模型产生的大量敌对的例子仍然是错误的分类,即使是通过相机感知。
Surprisingly, our attack methodology required no modification to account for the presence of the camera—the simplest possible attack of using adversarial examples crafted for the Inception model resulted in adversarial examples that successfully transferred to the union of the camera and Inception. Our results thus provide a lower bound on the attack success rate that could be achieved with more specialized attacks that explicitly model the camera while crafting the adversarial example.
令人惊讶的是,我们的攻击方法不需要修改来解释摄像机的存在,最简单的使用针对在先启模型的对抗例子的攻击导致了成功地转移到摄像机和在先启联合的对抗例子。因此,我们的结果提供了一个攻击成功率的下界,该下界可用于更专门的攻击,明确地建模摄像机,同时制作对抗的例子。
One limitation of our results is that we have assumed a threat model under which the attacker has full knowledge of the model architecture and parameter values. This is primarily so that we can use a single Inception v3 model in all experiments, without having to devise and train a different high-performing model. The adversarial example transfer property implies that our results could be extended trivially to the scenario where the attacker does not have access to the model description (Szegedy et al., 2014; Goodfellow et al., 2014; Papernot et al., 2016b). While we haven’t run detailed experiments to study transferability of physical adversarial examples we were able to build a simple phone application to demonstrate potential adversarial black box attack in the physical world, see fig. 1.
我们的结果的一个局限是,我们假设了一个攻击者完全了解模型架构和参数值的威胁模型。这主要是为了我们可以在所有的实验中使用一个初始v3模型,而不必设计和训练一个不同的高性能模型。对抗性的示例传输特性意味着我们的结果可以被简单地扩展到攻击者无法访问模型描述的场景(Szegedy et al., 2014;Goodfellow等,2014年;Papernot等,2016b)。虽然我们还没有进行详细的实验来研究物理对抗例子的可转移性,但我们能够构建一个简单的电话应用程序来演示物理世界中潜在的对抗黑箱攻击,见图1。
To better understand how the non-trivial image transformations caused by the camera affect adversarial example transferability, we conducted a series of additional experiments where we studied how adversarial examples transfer across several specific kinds of synthetic image transformations.
为了更好地理解由摄像机引起的非平凡图像变换如何影响对抗性示例的可转移性,我们进行了一系列额外的实验,在这些实验中,我们研究了对抗性示例如何在几种特定的合成图像变换之间进行转移。
The rest of the paper is structured as follows: In Section 2, we review different methods which we used to generate adversarial examples. This is followed in Section 3 by details about our “physical world” experimental set-up and results. Finally, Section 4 describes our experiments with various artificial image transformations (like changing brightness, contrast, etc...) and how they affect adversarial examples.
本文的其余部分结构如下:在第2节中,我们回顾了我们用来生成敌对例子的不同方法。在第3节中,我们将详细介绍物理世界的实验设置和结果。最后,第4节描述了我们的各种人工图像转换实验(如改变亮度、对比度等),以及它们如何影响对抗性的例子。
This section describes different methods to generate adversarial examples which we have used in the experiments. It is important to note that none of the described methods guarantees that generated image will be misclassified. Nevertheless we call all of the generated images “adversarial images”.
In the remaining of the paper we use the following notation:
本节将介绍我们在实验中使用的产生对抗性例子的不同方法。需要注意的是,所描述的方法都不能保证生成的图像会被错误分类。然而,我们称所有生成的图像为“对抗性图像”。
在本文的其余部分,我们使用以下符号:
X - an image, which is typically 3-D tensor (width × height × depth). In this paper, we assume that the values of the pixels are integer numbers in the range [0, 255].
一幅图像,通常是三维张量(宽、高、深)。在本文中,我们假设像素的值在[0,255]范围内为整数。
• ytrue - true class for the image X.
ytrue -图像X的true类。
J(X, y) - cross-entropy cost function of the neural network, given image X and class y. We intentionally omit network weights (and other parameters) θ in the cost function because we assume they are fixed (to the value resulting from training the machine learning model) in the context of the paper. For neural networks with a softmax output layer, the cross-entropy cost function applied to integer class labels equals the negative log-probability of the true class given the image: J(X, y) = − log p(y|X), this relationship will be used below.
J (X, y) -叉神经网络的成本函数,给定图像X和y类。我们故意忽略网络权重和其他参数θ的成本函数,因为我们认为他们是固定的(价值造成培训机器学习模型)的上下文中。对于带有softmax输出层的神经网络,应用于整数类标签的交叉熵代价函数等于给定图像的真实类的负对数概率:J(X, y) = logp (y|X),下面使用这种关系。
- function which performs per-pixel clipping of the image , so the result will be in -neighbourhood of the source image X. The exact clipping equation is as follows:
-对图像进行逐像素裁剪的函数,因此结果将在源图像x的-邻域内
where X(x, y, z) is the value of channel z of the image X at coordinates (x, y).
其中X(X, y, z)为图像X在坐标(X, y)处的通道z值。
One of the simplest methods to generate adversarial images, described in (Goodfellow et al., 2014), is motivated by linearizing the cost function and solving for the perturbation that maximizes the cost subject to an L∞ constraint. This may be accomplished in closed form, for the cost of one call to back-propagation:
X^{adv}=X+\epsilon sign(\triangledown_{X}J(X,y_{true}))\tag{1}
在(Goodfellow et al., 2014)中描述的生成敌对图像的最简单方法之一,是通过线性化成本函数和解决在L约束下使成本最大化的扰动。这可以以封闭的形式完成,但需要付出一次反向传播调用的代价:
X^{adv}=X+\epsilon sign(\triangledown_{X}J(X,y_{true}))\tag{1}
where is a hyper-parameter to be chosen
是否需要选择超参数。
In this paper we refer to this method as “fast” because it does not require an iterative procedure to compute adversarial examples, and thus is much faster than other considered methods.
在本文中,我们提到这种方法是快速的,因为它不需要一个迭代过程来计算反例,因此比其他方法快得多。
We introduce a straightforward way to extend the “fast” method—we apply it multiple times with small step size, and clip pixel values of intermediate results after each step to ensure that they are in an -neighbourhood of the original image
X_{0}^{adv}=X,~X_{N+1}^{adv}=Clip_{X,\epsilon}\{ X_{N}^{adv}+\alpha sign(\triangledown_{X}J(X_{N}^{adv}, y_{true})) \}\tag{2}
我们引入了一种简单的扩展快速方法的方法,我们多次使用它的小步长,并剪辑像素值的中间结果后,每一步,以确保他们是在原始图像的 邻域
X_{0}^{adv}=X,~X_{N+1}^{adv}=Clip_{X,\epsilon}\{ X_{N}^{adv}+\alpha sign(\triangledown_{X}J(X_{N}^{adv}, y_{true})) \}\tag{2}
In our experiments we used α = 1, i.e. we changed the value of each pixel only by 1 on each step. We selected the number of iterations to be min( + 4, 1.25 ). This amount of iterations was chosen heuristically; it is sufficient for the adversarial example to reach the edge of the max-norm ball but restricted enough to keep the computational cost of experiments manageable.
在我们的实验中,我们使用了一个= 1的方法,即每一步我们只改变每个像素的值1。我们选择迭代次数min( + 4,1.25 )。迭代次数的选择是启发式的;对抗性算例达到 最大范数球的边缘就足够了,但要保证实验的计算代价可管理
Below we refer to this method as “basic iterative” method.
下面我们将这种方法称为基本的迭代方法。
Both methods we have described so far simply try to increase the cost of the correct class, without specifying which of the incorrect classes the model should select. Such methods are sufficient for application to datasets such as MNIST and CIFAR-10, where the number of classes is small and all classes are highly distinct from each other. On ImageNet, with a much larger number of classes and the varying degrees of significance in the difference between classes, these methods can result in uninteresting misclassifications, such as mistaking one breed of sled dog for another breed of sled dog. In order to create more interesting mistakes, we introduce the iterative least-likely class method. This iterative method tries to make an adversarial image which will be classified as a specific desired target class. For desired class we chose the least-likely class according to the prediction of the trained network on image :
yLL=argmin\{ p(y|X) \}.\tag{3}
到目前为止,我们描述的两种方法都只是试图增加正确类的开销,而没有指定模型应该选择哪个不正确的类。这些方法对于应用于像MNIST和CIFAR-10这样的数据集已经足够了,在这些数据集中,类的数量很少,而且所有的类之间的差异很大。在ImageNet上,由于类的数量大得多,而且类之间的差异具有不同程度的显著性,这些方法可能会导致无趣的错误分类,比如将一个品种的雪橇狗误认为另一个品种的雪橇狗。为了制造更多有趣的错误,我们引入了迭代的最小可能类方法。这种迭代方法试图生成一个敌对的图像,并将其分类为一个特定的期望目标类。对于期望类,我们根据训练过的网络对图像的预测,选择概率最小的类 :
yLL=argmin\{ p(y|X) \}.\tag{3}
For a well-trained classifier, the least-likely class is usually highly dissimilar from the true class, so this attack method results in more interesting mistakes, such as mistaking a dog for an airplane
对于训练有素的分类器来说,概率最小的类通常与真实的类高度不同,因此这种攻击方法会导致更有趣的错误,比如误将一只狗当成一架飞机
To make an adversarial image which is classified as we maximize log by making iterative steps in the direction of sign . This last expression equals for neural networks with cross-entropy loss. Thus we have the following procedure:
X_{0}^{adv}=X,~X_{N+1}^{adv}=Clip_{X,\epsilon}\{ X_{N}^{adv}+\alpha sign(\triangledown_{X}J(X_{N}^{adv}, y_{LL})) \}\tag{4}
使一个敌对的形象被分类 我们通过在符号方向上迭代来最大化log 。对于存在交叉熵损失的神经网络,最后一个表达式等于 。因此,我们有以下步骤:
X_{0}^{adv}=X,~X_{N+1}^{adv}=Clip_{X,\epsilon}\{ X_{N}^{adv}+\alpha sign(\triangledown_{X}J(X_{N}^{adv}, y_{LL})) \}\tag{4}
For this iterative procedure we used the same α and same number of iterations as for the basic iterative method
对于这个迭代过程,我们使用与基本迭代方法相同的迭代周期和迭代次数
Below we refer to this method as the “least likely class” method or shortly “l.l. class”
Figure 2: Top-1 and top-5 accuracy of Inception v3 under attack by different adversarial methods and different compared to “clean images” — unmodified images from the dataset. The accuracy was computed on all 50, 000 validation images from the ImageNet dataset. In these experiments varies from 2 to 128.
图2:在先启v3的前1和前5的精度受到不同对抗方法的攻击,与“干净图像”(数据集中未修改的图像)相比,有不同的 值。精度是在ImageNet数据集的所有50,000张验证图像上计算出来的。在这些实验中, 的值从2到128不等。
As mentioned above, it is not guaranteed that an adversarial image will actually be misclassified— sometimes the attacker wins, and sometimes the machine learning model wins. We did an experimental comparison of adversarial methods to understand the actual classification accuracy on the generated images as well as the types of perturbations exploited by each of the methods.
如上所述,并不能保证敌对的图像会被误分类。有时攻击者获胜,有时机器学习模型获胜。我们对敌对的方法进行了实验比较,以了解生成图像的实际分类精度以及每种方法所利用的扰动类型。
The experiments were performed on all 50, 000 validation samples from the ImageNet dataset (Russakovsky et al., 2014) using a pre-trained Inception v3 classifier (Szegedy et al., 2015). For each validation image, we generated adversarial examples using different methods and different values of . For each pair of method and , we computed the classification accuracy on all 50, 000 images. Also, we computed the accuracy on all clean images, which we used as a baseline
实验是在ImageNet数据集(Russakovsky et al., 2014)的所有50,000个验证样本上进行的,使用的是预先训练好的Inception v3分类器(Szegedy et al., 2015)。对于每一幅验证图像,我们使用不同的方法和不同的 值生成了敌对的示例。对于每一对方法和 ,我们计算了所有5万幅图像的分类精度。此外,我们计算了所有干净图像的精度,我们使用它作为基线。
Top-1 and top-5 classification accuracy on clean and adversarial images for various adversarial methods are summarized in Figure 2. Examples of generated adversarial images could be found in Appendix in Figures 5 and 4.
图2总结了各种敌对方法对干净和敌对图像的Top-1和top-5分类精度。在图5和图4的附录中可以找到生成的敌对图像的示例。
As shown in Figure 2, the fast method decreases top-1 accuracy by a factor of two and top-5 accuracy by about 40% even with the smallest values of . As we increase , accuracy on adversarial images generated by the fast method stays on approximately the same level until = 32 and then slowly decreases to almost 0 as grows to 128. This could be explained by the fact that the fast method adds -scaled noise to each image, thus higher values of essentially destroys the content of the image and makes it unrecognisable even by humans, see Figure 5.
如图2所示,快速方法减少(精度因素的两个甚至前五名精度约40%的最小值X当我们增加 ,精度在敌对的图像生成的快速方法保持在大约相同的水平直到 = 32,然后慢慢减少到几乎0 增长到128。这可能是由于快速方法在每张图像上都加入了 尺度的噪声,因此 值越高,实质上就破坏了图像的内容,甚至使图像无法被人识别,如图5所示。
On the other hand iterative methods exploit much finer perturbations which do not destroy the image even with higher and at the same time confuse the classifier with higher rate. The basic iterative method is able to produce better adversarial images when < 48, however as we increase it is unable to improve. The “least likely class” method destroys the correct classification of most images even when is relatively small.
另一方面,迭代方法利用更细的扰动,即使在较高的X值下也不会破坏图像,同时也会使分类器产生较高的混淆率。基本的迭代方法能够在 <48 时产生较好的对抗性图像;48、然而,我们增加了 ,它无法改善。最不可能的类方法破坏大多数图像的正确分类,即使当 是相对较小的。
We limit all further experiments to ≤ 16 because such perturbations are only perceived as a small noise (if perceived at all), and adversarial methods are able to produce a significant number of misclassified examples in this -neighbourhood of clean images
我们将所有的进一步实验限制为 16,因为这样的扰动只被认为是一个小的噪音(如果完全被感知),并且敌对的方法能够在这个干净图像的 邻域产生大量的错误分类的例子
To study the influence of arbitrary transformations on adversarial images we introduce the notion of destruction rate. It can be described as the fraction of adversarial images which are no longer misclassified after the transformations. The formal definition is the following:
d=\frac{\sum_{k=1}^{n}C(X^{K},y_{true}^{k})\bar{C(X_{adv}^{K},y_{true}^{k})C(T(X^{K}),y_{true}^{k})}}{\sum_{k=1}^{n}C(X^{K},y_{true}^{k})\bar{C(X_{adv}^{K},y_{true}^{k})}}\tag{1}
为了研究任意变换对敌对象的影响,引入了破坏率的概念。它可以描述为对抗性图像经过变换后不再被误分类的分数。正式的定义如下:
d=\frac{\sum_{k=1}^{n}C(X^{K},y_{true}^{k})\bar{C(X_{adv}^{K},y_{true}^{k})C(T(X^{K}),y_{true}^{k})}}{\sum_{k=1}^{n}C(X^{K},y_{true}^{k})\bar{C(X_{adv}^{K},y_{true}^{k})}}\tag{1}
where n is the number of images used to comput the destruction rate, is an image from the dataset, is the true class of this image, and Xk adv is the corresponding adversarial image. The function T(•) is an arbitrary image transformation—in this article, we study a variety of transformations, including printing the image and taking a photo of the result. The function C(X, y) is an indicator function which returns whether the image was classified correctly:
C(X,y)=\{_{0,~~~otherwise.}^{1, ~~~ if~ image~X~is~classified~as~y;}\tag{2}
其中n为用于统计破坏率的图像数目, 为数据集中的图像, 为该图像的true类,Xk adv为对应的敌对图像。函数T()是一个任意的图像变换,在本文中,我们研究了多种变换,包括打印图像和对结果拍照。函数C(X, y)是一个指示函数,它返回图像是否被正确分类:
C(X,y)=\{_{0,~~~otherwise.}^{1, ~~~ if~ image~X~is~classified~as~y;}\tag{2}
We denote the binary negation of this indicator value as C(X, y), which is computed as C(X, y) = 1 − C(X, y).
我们将该指标值的二值否定表示为C(X, y),计算得到C(X, y) = 1−C(X, y)
Figure 3: Experimental setup: (a) generated printout which contains pairs of clean and adversarial images, as well as QR codes to help automatic cropping; (b) photo of the printout made by a cellphone camera; (c) automatically cropped image from the photo.
图3:实验设置:(a)生成打印输出,打印输出包含对干净和敌对的图像,以及帮助自动裁剪的二维码;(b)手机摄像头打印出来的照片;(c)自动从照片裁剪图像。
To explore the possibility of physical adversarial examples we ran a series of experiments with photos of adversarial examples. We printed clean and adversarial images, took photos of the printed pages, and cropped the printed images from the photos of the full page. We can think of this as a black box transformation that we refer to as “photo transformation”.
为了探索物理上的敌对例子的可能性,我们用敌对例子的照片进行了一系列的实验。我们打印干净和敌对的图像,对打印的页面进行拍照,并将打印的图像从整个页面的照片中裁剪出来。我们可以把它想象成一个黑盒子变换,我们称之为照片变换。
We computed the accuracy on clean and adversarial images before and after the photo transformation as well as the destruction rate of adversarial images subjected to photo transformation.
计算了光解图像和对敌图像在光解变换前后的精度,以及对敌图像经过光解变换后的销毁率。
The experimental procedure was as follows:
实验过程如下:
1、Print the image, see Figure 3a. In order to reduce the amount of manual work, we printed multiple pairs of clean and adversarial examples on each sheet of paper. Also, QR codes were put into corners of the printout to facilitate automatic cropping.
打印图像,参见图3a。为了减少手工工作量,我们在每张纸上打印了多对干净和对抗性的例子。此外,打印输出的角落都有二维码,便于自动裁剪。
All generated pictures of printouts (Figure 3a) were saved in lossless PNG format.
所有生成的打印输出图片(图3a)均保存为无损PNG格式。
Batches of PNG printouts were converted to multi-page PDF file using the convert tool from the ImageMagick suite with the default settings: convert *.png output.pdf
使用ImageMagick套件中的convert工具,使用默认设置:convert *.png output.pdf,批量的PNG打印输出被转换为多页的PDF文件
Generated PDF files were printed using a Ricoh MP C5503 office printer. Each page of PDF file was automatically scaled to fit the entire sheet of paper using the default printer scaling. The printer resolution was set to 600dpi.
生成的PDF文件打印使用理光MP C5503办公室打印机。使用默认的打印机缩放功能,PDF文件的每一页都被自动缩放以适应整张纸。打印机分辨率设置为600dpi。
2、Take a photo of the printed image using a cell phone camera (Nexus 5x), see Figure 3b.
2、用手机摄像头(Nexus 5x)对打印出来的图像拍照,见图3b。
3、Automatically crop and warp validation examples from the photo, so they would become squares of the same size as source images, see Figure 3c:
3、从照片中自动裁剪和扭曲验证示例,使其变成与源图像相同大小的正方形,见图3c:
Detect values and locations of four QR codes in the corners of the photo. The QR codes encode which batch of validation examples is shown on the photo. If detection of any of the corners failed, the entire photo was discarded and images from the photo were not used to calculate accuracy. We observed that no more than 10% of all images were discarded in any experiment and typically the number of discarded images was about 3% to 6%.
检测照片角落四个QR码的值和位置QR码编码图片上显示的验证示例。如果任何一个角检测失败,则放弃整张照片,并且不使用照片中的图像来计算精度。我们观察到,在任何实验中,被丢弃的图像不超过10%,通常丢弃的图像数量约为3%至6%。
Warp photo using perspective transform to move location of QR codes into pre-defined coordinates.
使用透视变换将二维码的位置移动到预先定义的坐标中。
After the image was warped, each example has known coordinates and can easily be cropped from the image.
在图像被扭曲后,每个示例都有已知的坐标,可以很容易地从图像中剪裁出来。
4、Run classification on transformed and source images. Compute accuracy and destruction rate of adversarial images.
4、对转换后的图像和源图像进行分类。计算对敌图像的精度和破坏率。
This procedure involves manually taking photos of the printed pages, without careful control of lighting, camera angle, distance to the page, etc. This is intentional; it introduces nuisance variability that has the potential to destroy adversarial perturbations that depend on subtle, fine co-adaptation of exact pixel values. That being said, we did not intentionally seek out extreme camera angles or lighting conditions. All photos were taken in normal indoor lighting with the camera pointed approximately straight at the page.
这个过程包括手动拍摄打印页面的照片,没有仔细控制灯光、相机角度、到页面的距离等。这是有意的;它引入了令人讨厌的可变性,这种可变性有可能摧毁依赖于精确像素值的微妙的、精细的协同适应的敌对干扰。也就是说,我们并没有刻意去寻找极端的拍摄角度或光照条件。所有的照片都是在正常的室内光线下拍摄的,相机几乎是直对着页面。
For each combination of adversarial example generation method and we conducted two sets of experiments:
对于每一种对抗性算例生成方法和 的组合,我们进行两组实验:
Average case. To measure the average case performance, we randomly selected 102 images to use in one experiment with a given and adversarial method. This experiment estimates how often an adversary would succeed on randomly chosen photos—the world chooses an image randomly, and the adversary attempts to cause it to be misclassified.
平均情况。为了测量平均情况性能,我们在给定的 和对抗性方法下随机选择102张图像用于一个实验。这个实验估计了一个对手在随机选择的照片上获胜的频率。世界随机选择一张图片,而对手试图使其被错误分类。
Prefiltered case. To study a more aggressive attack, we performed experiments in which the images are prefiltered. Specifically, we selected 102 images such that all clean images are classified correctly, and all adversarial images (before photo transformation) are classified incorrectly (both top-1 and top-5 classification). In addition we used a confidence threshold for the top prediction: p(ypredicted|X) ≥ 0.8, where ypredicted is the class predicted by the network for image X. This experiment measures how often an adversary would succeed when the adversary can choose the original image to attack. Under our threat model, the adversary has access to the model parameters and architecture, so the attacker can always run inference to determine whether an attack will succeed in the absence of photo transformation. The attacker might expect to do the best by choosing to make attacks that succeed in this initial condition. The victim then takes a new photo of the physical object that the attacker chooses to display, and the photo transformation can either preserve the attack or destroy it.
预滤器。为了研究更具攻击性的攻击,我们进行了对图像进行预滤波的实验。具体来说,我们选择了102张图片,使得所有干净的图片都被正确地分类,而所有敌对的图片(在照片变换之前)都被错误地分类(top-1和top-5的分类)。此外,我们对top预测使用了一个置信度阈值:p(ypredicted|X) 0.8,其中ypredicted是网络对图像X预测的类别。这个实验测量了当对手可以选择原始图像进行攻击时,对手成功攻击的频率。在我们的威胁模型下,攻击者可以访问模型参数和架构,因此攻击者总是可以在没有光变换的情况下运行推断来确定攻击是否成功。攻击者可能希望通过选择在这个初始条件下成功的攻击来达到最好的效果。然后,受害者为攻击者选择显示的物理对象拍摄一张新的照片,照片转换可以保存或销毁攻击。
Results of the photo transformation experiment are summarized in Tables 1, 2 and 3.
光变换实验结果见表1、表2、表3。
We found that “fast” adversarial images are more robust to photo transformation compared to iterative methods. This could be explained by the fact that iterative methods exploit more subtle kind of perturbations, and these subtle perturbations are more likely to be destroyed by photo transformation.
我们发现,相对于迭代方法,“快速”敌对图像对图像变换具有更强的鲁棒性。这可以解释为,迭代方法利用了更微妙的扰动,而这些微妙的扰动更有可能被光变换破坏。
One unexpected result is that in some cases the adversarial destruction rate in the “prefiltered case” was higher compared to the “average case”. In the case of the iterative methods, even the total success rate was lower for prefiltered images rather than randomly selected images. This suggests that, to obtain very high confidence, iterative methods often make subtle co-adaptations that are not able to survive photo transformation.
一个意想不到的结果是,在某些情况下,在预先过滤的情况下,对抗性破坏率比平均情况要高。在迭代方法的情况下,即使是预先过滤的图像总成功率也低于随机选择的图像。这表明,为了获得很高的置信度,迭代方法往往会做出无法在照片变换中生存的微妙的协同适应。
Overall, the results show that some fraction of adversarial examples stays misclassified even after a non-trivial transformation: the photo transformation. This demonstrates the possibility of physical adversarial examples. For example, an adversary using the fast method with = 16 could expect that about 2/3 of the images would be top-1 misclassified and about 1/3 of the images would be top-5 misclassified. Thus by generating enough adversarial images, the adversary could expect to cause far more misclassification than would occur on natural inputs.
总的来说,结果表明,即使在一个非平凡的变换:照片变换之后,仍然有一部分对抗性的例子分类错误。这说明了物理对抗性例子的可能性。例如,在 = 16的情况下,使用快速方法的对手可以预计大约2/3的图像会被top-1误分类,大约1/3的图像会被top-5误分类。因此,通过产生足够多的敌对图像,对手可能会造成比自然输入多得多的误分类。
The experiments described above study physical adversarial examples under the assumption that adversary has full access to the model (i.e. the adversary knows the architecture, model weights, etc . . . ). However, the black box scenario, in which the attacker does not have access to the model, is a more realistic model of many security threats. Because adversarial examples often transfer from one model to another, they may be used for black box attacks Szegedy et al. (2014); Papernot et al. (2016a). As our own black box attack, we demonstrated that our physical adversarial examples fool a different model than the one that was used to construct them. Specifically, we showed that they fool the open source TensorFlow camera demo 2 — an app for mobile phones which performs image classification on-device. We showed several printed clean and adversarial images to this app and observed change of classification from true label to incorrect label. Video with the demo available at https://youtu.be/zQ_uMenoBCk. We also demonstrated this effect live at GeekPwn 2016.
上述实验是在对手完全访问模型(即对手知道模型的架构、模型权重等)的前提下研究物理对抗性实例。然而,黑盒场景(攻击者无法访问该模型)是许多安全威胁的更现实的模型。由于对抗性的例子经常从一个模型转移到另一个模型,因此可以用于黑箱攻击Szegedy等人(2014);Papernot等(2016a)。在我们自己的黑盒攻击中,我们演示了我们的物理对抗示例愚弄了一个与用来构建它们的模型不同的模型。具体来说,我们展示了它们欺骗了开放源码的TensorFlow camera demo 2,这是一款手机应用,可以在设备上进行图像分类。我们在这个app上展示了几个打印出来的干净和敌对的图片,观察了标签从真实到不正确的分类变化。可以在https://youtu.be/zQ_uMenoBCk下载演示视频。我们也在2016年的GeekPwn现场演示了这种效果。
The transformations applied to images by the process of printing them, photographing them, and cropping them could be considered as some combination of much simpler image transformations. Thus to better understand what is going on we conducted a series of experiments to measure the adversarial destruction rate on artificial image transformations. We explored the following set of transformations: change of contrast and brightness, Gaussian blur, Gaussian noise, and JPEG encoding.
通过打印图像、拍摄图像和裁剪图像的过程应用到图像上的转换可以看作是更为简单的图像转换的某种组合。因此,为了更好地理解正在发生的事情,我们进行了一系列的实验来测量人工图像变换的敌对破坏率。我们研究了以下一系列转换:对比度和亮度的变化、高斯模糊、高斯噪声和JPEG编码。
For this set of experiments we used a subset of 1, 000 images randomly selected from the validation set. This subset of 1, 000 images was selected once, thus all experiments from this section used the same subset of images. We performed experiments for multiple pairs of adversarial method and transformation. For each given pair of transformation and adversarial method we computed adversarial examples, applied the transformation to the adversarial examples, and then computed the destruction rate according to Equation (1).
对于这组实验,我们使用了从验证集中随机选取的1000张图像的子集。这1000张图像的子集被选择一次,因此这部分的所有实验都使用相同的图像子集。对多对对抗性方法和变换进行了实验研究。对给定的每一对变换和对抗性方法进行对抗性算例,对对抗性算例进行变换,然后根据式(1)计算破坏率。
Detailed results for various transformations and adversarial methods with = 16 could be found in Appendix in Figure 6. The following general observations can be drawn from these experiments:
对于 = 16的各种转换和敌对方法的详细结果可以在图6的附录中找到。从这些实验中可以得出以下一般观察结果:
Adversarial examples generated by the fast method are the most robust to transformations, and adversarial examples generated by the iterative least-likely class method are the least robust. This coincides with our results on photo transformation.
由快速方法生成的反例对变换最健壮,而由迭代最小可能类方法生成的反例对变换最不健壮。这与我们关于光变换的结果是一致的。
The top-5 destruction rate is typically higher than top-1 destruction rate. This can be explained by the fact that in order to “destroy” top-5 adversarial examples, a transformation has to push the correct class labels into one of the top-5 predictions. However in order to destroy top-1 adversarial examples we have to push the correct label to be top-1 prediction, which is a strictly stronger requirement.
前5名的毁灭率通常比前1名的毁灭率高。这可以通过以下事实来解释:为了摧毁前5个敌对的例子,转换必须将正确的类标签推送到前5个预测中的一个。然而,为了摧毁排名第一的对抗例子,我们必须将正确的标签推到排名第一的预测,这是一个严格的更强的要求。
Changing brightness and contrast does not affect adversarial examples much. The destruction rate on fast and basic iterative adversarial examples is less than 5%, and for the iterative least-likely class method it is less than 20%.
改变亮度和对比度对对抗性的例子影响不大。对快速和基本的迭代对抗性实例的破坏率小于5%,对迭代最小似然类方法的破坏率小于20%。
Blur, noise and JPEG encoding have a higher destruction rate than changes of brightness and contrast. In particular, the destruction rate for iterative methods could reach 80% − 90%. However none of these transformations destroy 100% of adversarial examples, which coincides with the “photo transformation” experiment.
模糊、噪声和JPEG编码比亮度和对比度的变化具有更高的破坏率。特别是迭代法的破坏率可达80% - 90%。然而,所有这些变换都不能完全摧毁敌对的例子,这与光变换实验是一致的。
In this paper we explored the possibility of creating adversarial examples for machine learning systems which operate in the physical world. We used images taken from a cell-phone camera as an input to an Inception v3 image classification neural network. We showed that in such a set-up, a significant fraction of adversarial images crafted using the original network are misclassified even when fed to the classifier through the camera. This finding demonstrates the possibility of adversarial examples for machine learning systems in the physical world. In future work, we expect that it will be possible to demonstrate attacks using other kinds of physical objects besides images printed on paper, attacks against different kinds of machine learning systems, such as sophisticated reinforcement learning agents, attacks performed without access to the model’s parameters and architecture (presumably using the transfer property), and physical attacks that achieve a higher success rate by explicitly modeling the phyiscal transformation during the adversarial example construction process. We also hope that future work will develop effective methods for defending against such attacks.
在这篇论文中,我们探索了为运行在物理世界中的机器学习系统创建敌对例子的可能性。我们使用从手机摄像头拍摄的图像作为输入到Inception v3图像分类神经网络。我们发现,在这样的设置下,使用原始网络制作的敌对图像中,即使是通过摄像机输入到分类器中,也有相当一部分是误分类的。这一发现证明了在物理世界中机器学习系统存在对立例子的可能性。在未来的工作中,我们希望可以演示攻击使用其他类型的物理对象除了图片打印在纸上,攻击不同的机器学习系统,比如复杂的强化学习代理、攻击没有执行模型s参数和架构(大概使用转让财产),以及物理攻击,通过在对抗性示例构建过程中显式地建模物理转换,从而获得更高的成功率。我们还希望,今后的工作将发展出有效的方法来防御这种攻击。
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR’2015, arXiv:1409.0473, 2015.
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndi ˇ c, Pavel Laskov, Gior- ´ gio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 387– 402. Springer, 2013.
Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. Hidden voice commands. In 25th USENIX Security Symposium (USENIX Security 16), Austin, TX, August 2016. USENIX Association. URL https://www.usenix.org/conference/usenixsecurity16/ technical-sessions/presentation/carlini.
Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, Deepak Verma, et al. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 99–108. ACM, 2004.
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2014. URL http://arxiv.org/abs/1412.6572.
Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine, 2012.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012). 2012.
Blaine Nelson, Marco Barreno, Fuching Jack Chi, Anthony D Joseph, Benjamin IP Rubinstein, Udam Saini, Charles A Sutton, J Doug Tygar, and Kai Xia. Exploiting machine learning to subvert your spam filter.
N. Papernot, P. McDaniel, and I. Goodfellow. Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples. ArXiv e-prints, May 2016b. URL http://arxiv.org/abs/1605.07277.
Nicolas Papernot, Patrick Drew McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against deep learning systems using adversarial examples. CoRR, abs/1602.02697, 2016a. URL http://arxiv.org/abs/1602.02697.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575, 2014.
Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K. Reiter. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 23rd ACM SIGSAC Conference on Computer and Communications Security, October 2016. To appear.
Daniel F Smith, Arnold Wiliem, and Brian C Lovell. Face recognition on consumer devices: Reflections on replay attacks. IEEE Transactions on Information Forensics and Security, 10(4): 736–745, 2015.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ICLR, abs/1312.6199, 2014. URL http://arxiv.org/abs/1312.6199.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. URL http://arxiv.org/abs/1512.00567.
Appendix contains following figures:
附录包含以下数字:
Figure 4 with examples of adversarial images produced by different adversarial methods.
图4以不同敌对方法产生的敌对图像为例。
Figure 5 with examples of adversarial images for various values of
图5带有不同值的敌对图像示例
Figure 6 contain plots of adversarial destruction rates for various image transformations.
图6包含了各种图像变换的敌对破坏率图。
Figure 4: Comparison of different adversarial methods with = 32. Perturbations generated by iterative methods are finer compared to the fast method. Also iterative methods do not always select a point on the border of -neighbourhood as an adversarial image.
图4:不同对抗方法比较, = 32。迭代方法产生的扰动比快速方法更细。此外,迭代方法并不总是选择 邻域边界上的一个点作为对抗性图像。
Figure 5: Comparison of images resulting from an adversarial pertubation using the “fast” method with different size of perturbation . The top image is a “washer” while the bottom one is a “hamster”. In both cases clean images are classified correctly and adversarial images are misclassified for all considered .
图5:在不同扰动大小的情况下,使用“快速”方法进行对抗式输卵管拔管所得到的图像比较。顶部图像为“垫圈”,而底部图像为“仓鼠”。在这两种情况下,干净的图像被正确地分类,而敌对的图像被错误地分类。
Figure 6: Comparison of adversarial destruction rates for various adversarial methods and types of transformations. All experiments were done with = 16.
图6:各种敌对方法和转换类型的敌对毁灭率比较。所有实验都在 = 16的情况下进行。