在敌对环境下的深度学习的局限性

The Limitations of Deep Learning in Adversarial Settings

原文链接:

https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7467366

GB/T 7714 Papernot N, McDaniel P, Jha S, et al. The limitations of deep learning in adversarial settings[C]//2016 IEEE European symposium on security and privacy (EuroS&P). IEEE, 2016: 372-387.

MLA Papernot, Nicolas, et al. "The limitations of deep learning in adversarial settings." 2016 IEEE European symposium on security and privacy (EuroS&P). IEEE, 2016.

APA Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2016, March). The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P) (pp. 372-387). IEEE.

Abstract

摘要

Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.

深度学习利用大数据集和计算效率高的训练算法,在各种机器学习任务中胜过其他方法。然而,深度神经网络在训练阶段的不完善使得它们容易受到敌对样本的攻击:敌人的输入意图导致深度神经网络分类错误。在这项工作中,我们形式化了对抗深度神经网络(DNNs)的敌人空间,并引入了一种新的算法,以基于对DNNs输入和输出之间映射的精确理解来制造对手样本。在计算机视觉的应用中,我们的算法能够可靠地产生被人类主体正确分类的样本,但在特定目标上被DNN错误分类的样本,对抗成功率为97%,而每个样本平均只修改4.02%的输入特征。然后,我们通过定义硬度度量来评估不同样本类在对抗干扰下的脆弱性。最后,我们描述了通过定义一个良性输入和目标分类之间的距离的预测度量来防范敌对样本的初步工作。

1. Introduction

1. 介绍

Large neural networks, recast as deep neural networks (DNNs) in the mid 2000s, altered the machine learning landscape by outperforming other approaches in many tasks. This was made possible by advances reducing the computational complexity of training [19]. For instance, Deep Learning (DL) can now take advantage of large datasets to achieve accuracy rates higher than previous classification techniques. In short, DL is transforming computational processing of data in many domains such as vision [24], [36], speech recognition [14], [32], language processing [12], financial fraud detection [23], and malware detection [13].

大型神经网络在2000年代中期被重塑为深度神经网络(DNNs),通过在许多任务上优于其他方法,改变了机器学习的格局。这是通过降低训练[19]的计算复杂度而实现的。例如,深度学习(DL)现在可以利用大数据集来达到比以前的分类技术更高的准确率。简而言之,DL在许多领域转换数据的计算处理,如视觉[24]、[36]、语音识别[14]、[32]、语言处理[12]、金融欺诈检测[23]和恶意软件检测[13]。

This increasing use of deep learning is creating incentives for adversaries to manipulate deep neural networks so as to force misclassification of inputs. For instance, applications of deep learning use image classifiers to distinguish inappropriate from appropriate content, and text and image classifiers to differentiate between SPAM and non-SPAM email. An adversary able to craft misclassified inputs would profit from evading detection–indeed such attacks occur today on non-DL classification systems [5], [6], [21]. In the physical domain, consider a driverless car that uses deep learning to identify traffic signs [11]. If slightly altering “STOP” signs causes DNNs to misclassify them, the car will not stop, thus subverting the car’s safety.

这种对深度学习的越来越多的使用,正在促使对手操纵深度神经网络,从而迫使对输入信息进行错误分类。例如,深度学习的应用使用图像分类器来区分不合适和合适的内容,使用文本和图像分类器来区分垃圾邮件和非垃圾邮件。能够制造错误分类输入的对手将从逃避检测中获利,事实上,这种攻击现在发生在非dl分类系统[5],[6],[21]上。在物理领域,考虑一辆使用深度学习识别交通标志[11]的无人驾驶汽车。如果轻微改变停车标志导致DNNs误将其分类,汽车将不会停车,从而破坏汽车的安全性。

An adversarial sample is an input crafted to cause learning algorithms to misclassify. Note that adversarial samples are created at test time, after the DNN has been trained by the defender, and do not require any alteration of the training process. Figure 1 shows examples of adversarial samples taken from our validation experiments. It shows how an image originally showing a digit can be altered to force a DNN to classify it as another digit. Adversarial samples are created from benign samples by adding distortions exploiting the imperfect generalization learned by DNNs from finite training sets [3], and the underlying linearity of most components used to build DNNs [17]. Previous work explored DNN properties that could be used to craft adversarial samples [17], [30], [35]. Simply put, these techniques exploit gradients computed by training algorithms: instead of using these gradients to update DNN parameters as would normally be done, gradients are used to update the original input itself, which is subsequently misclassified by DNNs.

对抗性样本是一种导致学习算法分类错误的输入。请注意,对抗样本是在测试时创建的,在DNN被防御者训练之后,不需要对训练过程进行任何更改。图1显示了从我们的验证实验中获得的敌对样本的示例。它展示了如何改变图像最初显示的数字,以迫使DNN分类它作为另一个数字。通过利用DNNs从有限的训练集[3]中学习到的不完善泛化以及用于构建DNNs[17]的大多数成分的潜在线性,通过添加扭曲,从良性样本中创建敌对样本。之前的工作探索了DNN的特性,可以用来制作对抗样本[17],[30],[35]。简单地说,这些技术利用了训练算法计算出的梯度:不像通常那样使用这些梯度来更新DNN参数,而是使用梯度来更新原始输入本身,随后被DNNs误分类。

In this paper, we describe a new class of algorithms for adversarial sample creation against any feedforward (acyclic) DNN [31] and formalize the threat model space of deep learning with respect to the integrity of output classification. Unlike previous approaches mentioned above, we compute a direct mapping from the input to the output to achieve an explicit adversarial goal. Furthermore, our approach only alters a (frequently small) fraction of input features leading to reduced perturbation of the source inputs. It also enables adversaries to apply heuristic searches to find perturbations leading to targeted misclassifications (perturbing inputs to result in a specific output classification).

在本文中,我们描述了一种针对任何前馈(无环)DNN[31]的对敌样本创建的新算法,并形式化了深度学习关于输出分类完整性的威胁模型空间。与前面提到的方法不同,我们计算从输入到输出的直接映射,以实现明确的对抗目标。此外,我们的方法只改变了一部分(通常是很小的一部分)输入特性,从而减少了对源输入的扰动。它还使对手能够应用启发式搜索来发现导致定向错误分类的干扰(扰动输入以导致特定的输出分类)。

More formally, a DNN models a multidimensional function F : X → Y where X is a (raw) feature vector and Y is an output vector. We construct an adversarial sample X∗ from a benign sample X by adding a perturbation vector δX solving the following optimization problem:

更正式地说,DNN建模一个多维函数F: X→Y,其中X是(原始)特征向量,Y是输出向量。我们从一个良性的样本X构建一个对抗性样本X∗,通过添加一个扰动向量载体X解决以下优化问题:

where X=X+δXX∗ = X + δX is the adversarial sample, YY∗ is the desired adversarial output, and \lVert \cdot \rVert a norm appropriate to compare the DNN inputs. Solving this problem is non-trivial, as properties of DNNs make it non-linear and non-convex [25]. Thus, we craft adversarial samples by constructing a mapping from input perturbations to DNN output variations. Note that all research mentioned above took the opposite approach: they used output variations to find corresponding input perturbations. Our understanding of how changes made to inputs affect a deep neural network’s output stems from the forward derivative: a matrix we introduce and define as the Jacobian of the function learned by the DNN. The forward derivative is used to construct adversarial saliency maps indicating input features to include in perturbation δX in order to produce adversarial samples inducing the desired output from the DNN.

其中 X=X+δXX∗ = X + δX 是对抗性样本, YY∗ 是期望的对抗性输出, \lVert \cdot \rVert 一个适合于比较DNN输入的范数。解决这个问题是不平凡的,因为DNNs的性质使其非线性和非凸[25]。因此,我们通过构造一个从输入扰动到DNN输出变化的映射来构造对抗性样本。注意,上面提到的所有研究都采用了相反的方法:他们使用输出变化来找到相应的输入扰动。我们对输入变化如何影响深度神经网络输出的理解源于前向导数:我们引入一个矩阵并定义为DNN学习的函数的雅可比矩阵。前向导数用于构造敌对显著性映射,指示在扰动扰动的小波X中包括的输入特征,以产生敌对样本,诱导DNN的期望输出。

Approaches based on the forward derivative are much more powerful than gradient descent techniques used in prior systems. They are applicable to both supervised and unsupervised architectures and allow adversaries to generate information for broad families of adversarial samples. Indeed, adversarial saliency maps are versatile tools based on the forward derivative and designed with adversarial goals in mind, giving greater control to adversaries with respect to the choice of perturbations. In our work, we consider the following questions to formalize the security of deep learning: (1) “What is the minimal knowledge required to perform attacks against deep neural networks?”, (2) “How can vulnerable or resistant samples be identified?”, and (3) “How are adversarial samples perceived by humans?”

基于前向导数的方法比以前系统中使用的梯度下降技术强大得多。它们既适用于监督架构,也适用于非监督架构,并允许对手为广泛的对抗性样本家族生成信息。事实上,对抗突出地图是基于前向导数的多功能工具,并在设计时考虑到对抗的目标,在干扰的选择方面给予对手更大的控制。在我们的工作中,我们考虑了以下问题来形式化深度学习的安全性:(1)对深度神经网络进行攻击所需的最小知识是什么?,(2)如何识别脆弱或耐药样本?,以及(3)人类如何感知对抗性样本?

The adversarial sample generation algorithms are validated using the widely studied LeNet architecture (a pioneering DNN used for hand-written digit recognition [26]) and MNIST dataset [27]. We show that any input sample from any source class can be perturbed to be misclassified as any target class given by the adversary with 97.10% success while perturbing on average 4.02% of the input features per sample. The computational costs of the sample generation are modest; samples were each generated in less than a second in our setup. Lastly, we study the impact of our algorithmic parameters on distortion and human perception of samples. This paper makes the following contributions:

对敌样本生成算法的验证使用了广泛研究的LeNet架构(一种开创性的DNN,用于手写数字识别[26])和MNIST数据集[27]。我们表明,任何源类的任何输入样本都可以被干扰误分类为对手给出的任何目标类,成功率为97.10%,而每个样本平均干扰4.02%的输入特征。样本生成的计算成本适中;在我们的设置中,每个示例都在不到一秒的时间内生成。最后,我们研究了算法参数对样本失真和人眼感知的影响。本文的贡献如下:

  • We formalize the space of adversaries against classifier DNNs with respect to adversarial goal and capabilities. Here, we provide a better understanding of how attacker capabilities constrain attack strategies and goals.

    • 我们形式化了对手与分类器的空间 在对抗目标和能力方面的DNNs。 在这里,我们将更好地理解攻击者能力是如何约束攻击策略和目标的。

  • We introduce a new class of algorithms for crafting adversarial samples solely by using knowledge of the DNN architecture. These algorithms (1) exploit forward derivatives that inform the learned behavior of DNNs, and (2) build adversarial saliency maps enabling efficient explorations of the adversarial-samples space.

    • 我们引入了一种新的算法,仅通过使用DNN体系结构的知识来制作对抗性样本。这些算法(1)利用前向导数来告知DNNs的习得行为,(2)建立对抗性显著性映射,使对抗性样本空间的有效探索成为可能。

  • We validate the algorithms using a widely used computer vision DNN. We define and measure sample distortion and source-to-target hardness, and explore defenses against adversarial samples. We conclude by studying human perception of distorted samples.

    • 我们使用广泛使用的计算机视觉DNN对算法进行了验证。我们定义和测量样品畸变和源靶硬度,并探讨对抗样品的防御措施。我们通过研究人类对扭曲样本的感知得出结论。

2. Threat Model Taxonomy in Deep Learning

2. 深度学习中的威胁模型分类

Classical threat models enumerate the goals and capabilities of adversaries in a target domain [22]. This section taxonimizes threat models in deep learning systems and positions several previous works with respect to the strength of the modeled adversary. We begin by providing an overview of DNNs highlighting their inputs, outputs, and function. We then consider the taxonomy presented in Figure 2.

经典的威胁模型列举了目标领域[22]中对手的目标和能力。本节对深度学习系统中的威胁模型进行分类,并针对建模对手的实力对之前的几项工作进行了阐述。我们首先提供DNNs的概述,突出说明它们的输入、输出和功能。 然后我们考虑图2中所示的分类法。

2.1. About Deep Neural Networks

2.1. 关于深度神经网络

Deep neural networks are large neural networks organized into layers of neurons, corresponding to successive representations of the input data. A neuron is an individual computing unit transmitting to other neurons the result of the application of its activation function on its input. Neurons are connected by links with different weights and biases characterizing the strength between neuron pairs. Weights and biases can be viewed as deep neural network parameters used for information storage. We define a deep neural network architecture to include knowledge of the neural network topology, neuron activation functions, as well as weight and bias values. Weights and biases are determined during training by finding values that minimize a cost function c evaluated over the training dataset T. Deep Neural Network training is traditionally done by gradient descent using techniques derived from backpropagation [31].

深度神经网络是由神经元层组成的大型神经网络,与输入数据的连续表示相对应。一个神经元是一个独立的计算单元,它将其激活函数应用于其输入的结果传递给其他神经元。神经元由不同权重和偏差的连接连接,这些偏差表征了神经元对之间的强度。权值和偏差可以看作是用于信息存储的深度神经网络参数。我们定义了一个深度神经网络架构,包括神经网络拓扑知识,神经元激活函数,以及权值和偏差值。权值和偏差在训练期间通过寻找最小化代价函数的值来确定,c在训练数据集t上进行评估。深度神经网络训练传统上是通过使用由backpropagation[31]派生的技术通过梯度下降来完成的。

Deep learning can be partitioned in two categories, depending on whether DNNs are trained in a supervised or unsupervised manner [29]. Supervised training leads to models that map unseen samples to a predefined set of outputs using a function inferred from labeled training data. On the contrary, unsupervised training learns representations of unlabeled training data, and resulting DNNs can be used to generate new samples, or to automate feature engineering by acting as a pre-processing layer for larger DNNs. We restrict ourselves to the problem of learning multi-class classifiers in supervised settings. These DNNs are given an input X and output a class probability vector Y. Note that our work remains valid for unsupervised DNNs, and leaves a detailed study of this issue for future work.

深度学习可以分为两类,取决于DNNs是以监督或非监督方式[29]进行训练的。监督训练的模型使用从标记的训练数据推断出的函数将未看的样本映射到预定义的输出集。相反,无监督训练学习未标记训练数据的表示,得到的DNNs可以用来生成新的样本,或通过充当较大的DNNs的预处理层来自动化特征工程。我们限制自己的问题学习多类分类器监督设置。给这些DNNs一个输入X,输出一个类概率向量y。注意,我们的工作对于无监督的DNNs仍然有效,并为今后的工作留下了详细的研究。

Figure 3 illustrates an example shallow feedforward neural network.1 The network has two input neurons x1x_{1} and x2x_{2} , a hidden layer with two neurons h1h_{1} and h2h_{2} , and a single output neuron o. In other words, it is a simple multi-layer perceptron. Both input neurons x1x_{1} and x2x_{2} take real values in [0, 1] and correspond to the network input: a feature vector X=(x1,x2)[0,1]2X = (x1, x2) ∈ [0, 1]^{2} . Hidden layer neurons each use the logistic sigmoid function φ:x11+exφ : x → \frac{1}{1+e^{−x}} as their activation function. This function is frequently used in neural networks because it is continuous (and differentiable), demonstrates linear-like behavior around 0, and saturates as the input goes to ±∞. Neurons in the hidden layers apply the sigmoid to the weighted input layer: for instance, neuron h1 computes h1(X)=φ(zh1(X))h_{1}(X) = φ (z_{h1} (X)) with zh1(X)=w11x1+w12x2+b1zh1 (X) = w11x1 + w12x2 + b1 where w11w_{11} and w12w_{12} are weights and b1 a bias. Similarly, the output neuron applies the sigmoid function to the weighted output of the hidden layer where zo(X)=w31h1(X)+w32h2(X)+b3z_{o}(X) = w_{31}h_{1}(X) + w_{32}h_{2}(X) + b_{3} . Weight and bias values are determined during training. Thus, the overall behavior of the network learned during training can be modeled as a function: F:Xφ(zo(X))F : X → φ (zo(X)) .

图3给出了一个浅层前馈神经网络的例子。1网络有两个输入神经元x1和x2,一个隐含层有两个神经元h1和h2,一个单独的输出神经元o,即一个简单的多层感知器。输入神经元x1和x2都取实值[0,1],对应于网络输入:特征向量 X=(x1,x2)[0,1]2X = (x1, x2) ∈ [0, 1]^{2} 。隐层神经元均采用logistic sigmoid函数 φ:x11+exφ : x → \frac{1}{1+e^{−x}} 这个函数经常在神经网络中使用,因为它是连续的(和可微的),在0左右显示线性的行为,并且当输入趋于饱和。隐层神经元将sigmoid应用于加权输入层,例如,神经元h1计算h1(X) = zh1 (X)), zh1 (X) = w11x1 + w12x2 + b1,其中w11和w12为权重,b1为偏倚。同样,输出神经元对 zo(X)=w31h1(X)+w32h2(X)+b3z_{o}(X) = w_{31}h_{1}(X) + w_{32}h_{2}(X) + b_{3} 隐含层的加权输出应用sigmoid函数。在训练时确定权重和偏差值。因此,在训练过程中学习到的网络的整体行为可以建模为一个函数: F:Xφ(zo(X))F : X → φ (zo(X))

2.2. Adversarial Goals

2.2. 对抗的目标

Threats are defined with a specific function to be protected/defended. In the case of deep learning systems, the integrity of the classification is of paramount importance. Specifically, an adversary of a deep learning system seeks to provide an input X∗ that results in an incorrect output classification. The nature of the incorrectness represents the adversarial goal, as identified in the X-axis of Figure 2. Consider four goals that impact classifier output integrity:

威胁被定义为需要保护/防御的特定功能。在深度学习系统中,分类的完整性是至关重要的。具体来说,一个深度学习系统的对手试图提供一个输入X,从而导致错误的输出分类。错误的本质代表了相反的目标,如图2的x轴所示。考虑影响分类器输出完整性的四个目标:

  1. Confidence reduction - reduce the output confidence classification (thereby introducing class ambiguity)

    1. 置信减少——减少输出置信分类(从而引入类的模糊性)

  2. Misclassification - alter the output classification to any class different from the original class

    1. 错误分类——将输出分类更改为与原始类不同的任何类

  3. Targeted misclassification - produce inputs that force output classification into a specific target class. Continuing the example illustrated in Figure 1, the adversary would create a set of speckles classified as a digit.

    1. 有针对性的错误分类-产生输入,迫使分类输出到一个特定的目标类。继续图1所示的示例,对手将创建一组分类为数字的斑点。

  4. Source/target misclassification - force the output classification of a specific input to be a specific target class. Continuing the example from Figure 1, adversaries take an existing image of a digit and add a small set of speckles to classify the resulting image as another digit.

    1. 源/目标错误分类——将特定输入的输出分类强制为特定的目标类。继续图1的示例,对手取一个已有的数字图像并添加一小组斑点,将得到的图像分类为另一个数字。

The scientific community recently started exploring adversarial deep learning. Previous work on other machine learning techniques is referenced later in Section 7.

最近,科学界开始探索对抗性深度学习。之前关于其他机器学习技术的工作将在后面的第7节中引用。

Szegedy et al. introduced a system that generates adversarial samples by perturbing inputs in a way that creates source/target misclassifications [17], [35]. The perturbations made by their work, which focused on a computer vision application, are not distinguishable by humans – for example, small but carefully-crafted perturbations to an image of a vehicle resulted in the DNN classifying it as an ostrich. The authors named this modified input an adversarial image, which can be generalized as part of a broader definition of adversarial samples. When producing adversarial samples, the adversary’s goal is to generate inputs that are correctly classified (or not distinguishable) by humans or other classifiers, but are misclassified by the targeted DNN.

Szegedy等人介绍了一个系统,该系统通过扰动输入来生成对抗性样本,从而产生源/目标误分类[17]、[35]。他们的工作主要集中在一个计算机视觉应用程序上,所造成的干扰是无法被人类识别的。例如,对一幅车辆图像的微小但精心制作的干扰导致DNN将其归类为鸵鸟。作者将这种修正输入称为对抗性图像,并将其推广为对抗性样本的广义定义。当产生对抗性样本时,对手的目标是生成输入,这些输入被人类或其他分类器正确分类(或无法区分),但被目标DNN分类错误。

Another example is due to Nguyen et al., who presented a method for producing images that are unrecognizable to humans, but are nonetheless labeled as recognizable objects by DNNs [30]. For instance, they demonstrated how a DNN will classify a noise-filled image constructed using their technique as a television with high confidence. They named the images produced by this method fooling images. Here, a fooling image is one that does not have a source class but is crafted solely to perform a targeted misclassification attack.

另一个例子是Nguyen等人,他们提出了一种方法,可以生成人类无法识别的图像,但通过DNNs[30]标记为可识别的物体。例如,他们演示了一个DNN将如何分类一个充满噪声的图像用他们的技术构建一个高可信度的电视。他们将这种方法产生的图像命名为“愚弄图像”。在这里,一个欺骗的图像没有源类,但是被专门设计来执行有针对性的误分类攻击。

2.3. Adversarial Capabilities

2.3. 对抗的能力

Adversaries are defined by the information and capabilities at their disposal. The following (and the Y-axis of Figure 2) describes a range of adversaries loosely organized by decreasing adversarial strength (and increasing attack difficulty). Note that we only consider attacks conducted at test time: any tampering of the training procedure is outside the scope of this paper and left as future work.

对手是由他们所掌握的信息和能力来定义的。下面(以及图2的y轴)描述了一系列通过减少对手力量(增加攻击难度)来松散组织的敌人。注意,我们只考虑在测试时进行的攻击:任何对训练过程的篡改都不在本文的讨论范围之内,留待以后再做。

Training data and network architecture - This adversary has perfect knowledge of the DNN used for classification. The attacker has access to the training data T, functions and algorithms used for network training, and is able to extract knowledge about the DNN’s architecture F. This includes the number and type of layers, the activation functions of neurons, as well as weight and bias. It also knows which algorithm was used for network training, including the associated loss function c. This is the strongest adversary that can analyze the training data and simulate the DNN in toto.

训练数据和网络架构——这个对手完全了解用于分类的DNN。攻击者可以访问训练数据T、用于网络训练的函数和算法,并能够提取DNN s架构f的相关知识,包括层的数量和类型、神经元的激活函数以及权值和偏差。它还知道网络训练使用的是哪一种算法,包括相关的损失函数c,这是能够在toto中分析训练数据和模拟DNN的最强的对手。

Network architecture - This adversary has knowledge of the network architecture F and its parameter values. For instance, this corresponds to an adversary who can collect information about both (1) the layers and activation functions used to design the neural network, and (2) the weights and biases resulting from the training phase. This gives the adversary enough information to simulate the network. Our algorithms assume this threat model, and show a new class of algorithms that generate adversarial samples for supervised and unsupervised feedforward DNNs.

网络架构——这个对手知道网络架构F及其参数值。例如,这相当于对手可以收集关于(1)用于设计神经网络的层和激活函数的信息,以及(2)训练阶段产生的权重和偏差的信息。这就给了对手足够的信息来模拟网络。我们的算法假设这个威胁模型,并展示了一类新的算法,为监督和非监督前馈神经网络生成对抗性样本。

Training data - This adversary is able to collect a surrogate dataset, sampled from the same distribution as the original dataset used to train the DNN. However, the attacker is not aware of the architecture used to design the neural network. Thus, typical attacks conducted in this model would likely include training commonly deployed deep learning architectures using the surrogate dataset to approximate the model learned by the legitimate classifier.

训练数据——这个对手能够从训练DNN所用的原始数据集的相同分布中收集一个代理数据集。然而,攻击者并不知道用于设计神经网络的架构。因此,在该模型中进行的典型攻击可能包括使用代理数据集来训练常用部署的深度学习架构来近似合法分类器学习的模型。

Oracle - This adversary has the ability to use the neural network (or a proxy of it) as an “oracle”. Here the adversary can obtain output classifications from supplied inputs (much like a chosen-plaintext attack in cryptography). This enables differential attacks, where the adversary can observe the relationship between changes in inputs and outputs (continuing with the analogy, such as used in differential cryptanalysis) to adaptively craft adversarial samples. This adversary can be further parameterized by the number of absolute or rate-limited input/output trials they may perform.

Oracle -这个对手有能力使用神经网络(或它的代理)作为Oracle。在这里,对手可以从提供的输入获得输出分类(很像密码学中的选择明文攻击)。这使得差分攻击成为可能,在这种情况下,攻击者可以观察输入和输出变化之间的关系(继续进行类比,例如在差分密码分析中使用的类比),从而自适应地创建对抗性样本。这个对手可以通过他们可能执行的绝对或速率限制输入/输出试验的数量进一步参数化。

Samples - This adversary has the ability to collect pairs of input and output related to the neural network classifier. However, it cannot modify these inputs to observe the difference in the output. To continue the cryptanalysis analogy, this threat model would correspond to a known plaintext attack. These pairs are labeled output data, and intuition states that they would most likely only be useful when available in very large quantities.

样本——这个对手有能力收集与神经网络分类器相关的输入和输出对。但是,它不能通过修改这些输入来观察输出中的差异。继续进行密码分析类比,这个威胁模型将对应于已知的明文攻击。这些对被标记为输出数据,直觉认为它们最有可能只在大量可用时才有用。

3. Approach

3.方法

In this section, we present a general algorithm for modifying samples so that a DNN yields any desired adversarial output. We later validate this algorithm by having a classifier misclassify samples from a source class into a chosen target class. This algorithm captures adversaries crafting samples in the setting corresponding to the upper right-hand corner of Figure 2. We show that knowledge of the architecture and weight parameters is sufficient to derive adversarial samples against acyclic feedforward DNNs. This requires evaluating the DNN’s forward derivative in order to construct an adversarial saliency map that identifies the set of input features relevant to the adversary’s goal. Perturbing the features identified in this way quickly leads to the desired adversarial output, for instance, misclassification. Although we describe our approach with supervised DNNs used as classifiers, it also applies to unsupervised architectures.

在这一节中,我们提出一个修改样本的一般算法,使DNN产生任何期望的对抗性输出。之后,我们通过一个分类器将样本从源类错分类到一个选定的目标类来验证这个算法。该算法捕获在图2右上角对应的设置中制作样本的对手。我们证明了结构和权值参数的知识足以针对无环前馈神经网络推导对抗性样本。这需要评估DNN的前向导数,以便构建一个对抗显著性映射,以识别与对手目标相关的输入特征集。扰乱以这种方式识别的特征会很快导致所需的敌对输出,例如错误分类。虽然我们使用监督的DNNs作为分类器来描述我们的方法,但它也适用于无监督的架构。

3.1. Studying a Simple Neural Network

3.1. 研究一个简单的神经网络

Recall the simple architecture introduced previously in section 2 and illustrated in Figure 3. Its low dimensionality allows us to better understand the underlying concepts behind our algorithms. We indeed show how small input perturbations found using the forward derivative can induce large variations of the neural network output. Assuming that input biases b1b_{1} , b2b_{2} , and b3b_{3} are null, we train this toy network to learn the Boolean AND function: the desired output is F(X)=x1x2with X=(x1,x2)F(X) = x_{1}∧x_{2} with~X = (x_{1}, x_{2}) . Note that non-integer inputs are rounded up to the closest integer, thus we have for instance 0.70.3=0or0.80.6=10.7 ∧ 0.3=0 or 0.8 ∧ 0.6=1 . Using backpropagation on a set of 1,000 samples, corresponding to each case of the function (11=1,10=0,01=0,and00=0)(1∧1=1, 1∧0=0, 0∧1=0, and 0 ∧ 0=0) , we train for 100 epochs using a learning rate η=0.0663η = 0.0663 . The overall function learned by the neural network is plotted on Figure 4 for input values X[0,1]2X ∈ [0, 1]_{2} . The horizontal axes represent the 2 input dimensions x1 and x2 while the vertical axis represents the network output F(X) corresponding to X=(x1,x2)X = (x1, x2) .

回想一下前面第2节中介绍的简单体系结构,如图3所示。它的低维数使我们能够更好地理解算法背后的基本概念。我们确实展示了使用前向导数发现的小输入扰动如何能引起神经网络输出的大变化。假设输入偏差 b1b_{1} , b2b_{2} ,和 b3b_{3} 为零,我们训练这个玩具网络来学习布尔值和函数:期望的输出是 F(X)=x1x2with X=(x1,x2)F(X) = x_{1}∧x_{2} with~X = (x_{1}, x_{2}) 。注意,非整数输入被四舍五入到最接近的整数,因此我们有 0.70.3=0or0.80.6=10.7 ∧ 0.3=0 or 0.8 ∧ 0.6=1 。针对函数的每一种情况 (11=1,10=0,01=0,and00=0)(1∧1=1, 1∧0=0, 0∧1=0, and 0 ∧ 0=0) ,在一组1,000个样本上使用反向传播,我们使用一个学习率(培训率为递进 η=0.0663η = 0.0663 训练100个epoch。对于输入值X[0,1]2,神经网络学习的总体函数如图4所示。横轴表示2个输入维度x1和x2,纵轴表示 X=(x1,x2)X = (x1, x2) 对应的网络输出F(X)。

We are now going to demonstrate how to craft adversarial samples on this neural network. The adversary considers a legitimate sample X, classified as F(X)=YF(X) = Y by the network, and wants to craft an adversarial sample X∗ very similar to X, but misclassified as F(X)=YYF(X∗)=Y^{\star}\neq Y . Recall, that we formalized this problem as:

现在我们将演示如何在这个神经网络上制作对抗性的样本。对手认为一个合法的样本X,网络将其分类为 F(X)=YF(X) = Y ,并想制造一个与X非常相似但被误分类为 F(X)=YYF(X∗)=Y^{\star}\neq Y 的对抗性样本X。回想一下,我们把这个问题形式化为:

where X=X+δXX∗ = X + δX is the adversarial sample, YY∗ is the desired adversarial output, and \lVert \cdot \rVert is a norm appropriate to compare points in the input domain. Informally, the adversary is searching for small perturbations of the input that will induce a modification of the output into YY∗ . Finding these perturbations can be done using optimization techniques or even brute force. However such solutions are hard to implement for deep neural networks because of non-convexity and non-linearity [25]. Instead, we propose a systematic approach stemming from the forward derivative.

其中, X=X+δXX∗ = X + δX 是对抗性样本, YY∗ 是期望的对抗性输出, \lVert \cdot \rVert 是用于比较输入域中点的范数。非正式地说,对手是在寻找输入的小扰动来引起输出 YY∗ 的修改。找到这些干扰可以使用优化技术,甚至是蛮力。但由于[25]的非凸性和非线性,这种方法在深度神经网络中难以实现。相反,我们提出一种源自远期衍生品的系统方法。

We define the forward derivative as the Jacobian matrix of the function F learned by the neural network during training. For this example, the output of F is one dimensional, the matrix is therefore reduced to a vector:

我们将前向导数定义为神经网络在训练过程中学习的函数F的雅可比矩阵。对于这个例子,F的输出是一维的,因此矩阵被简化为一个向量:

Both components of this vector are computable using the adversary’s knowledge, and later we show how to compute this term efficiently. The forward derivative for our example network is illustrated in Figure 5, which plots the gradient for the second component F(X)x2\frac{∂F(X)}{∂x_{2}} on the vertical axis against x1 and x2 on the horizontal axes. We omit the plot for F(X)x1\frac{∂F(X)}{∂x_{1}} because F is approximately symmetric on its two inputs, making the first component redundant for our purposes. This plot makes it easy to visualize the divide between the network’s two possible outputs in terms of values assigned to the input feature x2: 0 to the left of the spike, and 1 to its right. Notice that this aligns with Figure 4, and gives us the information needed to achieve our goal: find input perturbations that drive the output closer to a desired value.

这个向量的两个分量使用对手的知识都是可计算的,稍后我们将展示如何有效地计算这一项。我们的示例网络的前向导数如图5所示,它绘制了垂直轴上的第二个分量 F(X)x2\frac{∂F(X)}{∂x_{2}} 与水平轴上的x1和x2的梯度。我们省略了 F(X)x1\frac{∂F(X)}{∂x_{1}} 的图形,因为F对于它的两个输入是近似对称的,这使得第一个分量对于我们的目的是多余的。这个图可以很容易地可视化网络的两个可能的输出之间的划分,根据分配给输入特征的值x2: 0在尖峰的左边,1在它的右边。请注意,这与图4是一致的,并为我们提供了实现目标所需的信息:找到导致输出接近期望值的输入干扰。

Consulting Figure 5 alongside our example network, we can confirm this intuition by looking at a few sample points. Consider X = (1, 0.37) and X∗ = (1, 0.43), which are both located near the spike in Figure 5. Although they only differ by a small amount (δx2 = 0.05), they cause a significant change in the network’s output, as F(X)=0.11 and F(X∗)=0.95. Recalling that we round the inputs and outputs of this network so that it agrees with the Boolean AND function, we see that X* is an adversarial sample: after rounding, X∗ = (1, 0) and F(X∗)=1. Just as importantly, the forward derivative tells us which input regions are unlikely to yield adversarial samples, and are thus more immune to adversarial manipulations. Notice in Figure 5 that when either input is close to 0, the forward derivative is small. This aligns with our intuition that it will be more difficult to find adversarial samples close to (1, 0) than to (1, 0.4). This tells the adversary to focus on features corresponding to larger forward derivative values in a given input when constructing a sample, making its search more efficient and ultimately leading to smaller overall distortions.

参考图5以及我们的示例网络,我们可以通过查看几个样本点来确认这种直觉。考虑X =(1,0.37)和X =(1,0.43),它们都位于图5中的峰值附近。虽然它们之间的差异很小(误差值为x2 = 0.05),但是它们对网络的输出造成了显著的变化,如F(X)=0.11, F(X)= 0.95。回想一下,我们对这个网络的输入和输出进行了四舍五入,使其与布尔值和函数保持一致,我们可以看到X*是一个对立的样本:四舍五入后,X =(1,0)和F(X)=1。同样重要的是,前向导数告诉我们哪些输入区域不太可能产生对抗性样本,因此对对抗性操作更有免疫力。请注意,在图5中,当任意一个输入接近0时,前向导数都很小。这与我们的直觉是一致的,接近(1,0)的对抗性样本比接近(1,0。4)的对抗性样本更难找到。这就告诉对手在构造样本时,关注于给定输入中较大的前导数值对应的特征,使其搜索更有效,最终导致更小的整体扭曲。

The takeaways of this example are thereby: (1) small input variations can lead to extreme neural network output variations, (2) not all regions from the input domain are conducive to find adversarial samples, and (3) the forward derivative reduces the adversarial-sample search space.

本例的结论是:(1)较小的输入变化会导致神经网络输出的极端变化;(2)并非所有的输入区域都有利于找到对抗性样本;(3)前向导数减小了对抗性样本的搜索空间。

3.2. Generalizing to Deep Neural Networks

3.2. 推广到深度神经网络

We now generalize this approach to any feedforward DNN, using assumptions and adversary model identical to Section 3.1. The only assumptions made on the architecture are that its neurons form an acyclic DNN, and each uses a differentiable activation function. Note that this last assumption is not limiting because differentiability is required for training. In Figure 6, we give an example of a feedforward deep neural network architecture and define some notations used throughout the remainder of the paper. Most importantly, the N-dimensional function F learned by the DNN during training assigns an output Y = F(X) when given an M-dimensional input X. We denote by n the number of hidden layers. Layers are indexed by k0..n+1k ∈ 0..n + 1 such that k = 0 is the index of the input layer, k1..nk ∈ 1..n corresponds to hidden layers, and k = n + 1 indexes the output layer.

我们现在将此方法推广到任何前馈DNN,使用与3.1节相同的假设和对手模型。该架构的唯一假设是,它的神经元形成一个无环的DNN,并且每个神经元都使用可微的激活函数。请注意,最后一个假设并不具有局限性,因为训练需要可微性。在图6中,我们给出了一个前馈深度神经网络架构的例子,并定义了一些在本文其余部分中使用的符号。最重要的是,当给定一个m维输入X时,DNN在训练中学习的n维函数F赋值一个输出Y = F(X)。我们用n表示隐含层的数量。层被 k0..n+1k ∈ 0..n + 1 索引。n + 1使k = 0为输入层的索引, k1..nk ∈ 1..n 为隐含层,k = n + 1为输出层。

Algorithm 1 shows our process for constructing adversarial samples. As input, the algorithm takes a benign sample X, a target adversarial output Y∗, an acyclic feedforward DNN F, a maximum distortion parameter Υ, and a feature variation parameter θ. It returns new adversarial sample X∗ such that F(X∗) = Y∗, and proceeds in three basic steps: (1) compute the forward derivative JF(X∗), (2) construct a saliency map S based on the forward derivative, and (3) modify an input feature imax by θ. This process is repeated until the network outputs Y∗ or the maximum distortion Υ is reached. We now successively detail each step.

算法1展示了我们构建对抗性样本的过程。作为输入,算法需要一个良性的样本X,对抗的目标输出Y,一个无环前馈款F,Υ最大变形参数和特性参数θ变化。它返回新的对抗性样本X,使F(X) = Y,并进行三个基本步骤:(1)计算前向导数JF(X),(2)构建基于前向导数的显著性映射S,(3)修改一个输入特性imax by重复这个过程,直到网络输出Y或最大失真Υ。现在我们依次详细说明每一步。

3.2.1. Forward Derivative of a Deep Neural Network.

3.2.1之上。深度神经网络的前向导数。

The first step is to compute the forward derivative for the given sample X. As introduced previously, this is given by:

第一步是计算给定样本x的正导数,如前面介绍的,有:

This is essentially the Jacobian of the function corresponding to what the neural network learned during training. The forward derivative computes gradients that are similar to those computed for backpropagation, but with two important distinctions: we take the derivative of the network directly, rather than of its cost function, and we differentiate with respect to the input features rather than the network parameters. As a consequence, instead of propagating gradients backwards, we choose in our approach to propagate them forward, as this allows us to find input components that lead to significant changes in network outputs.

这本质上是函数的雅可比矩阵对应于神经网络在训练中学习到的东西。前向导数计算的梯度与反向传播计算的梯度相似,但有两个重要区别:我们直接对网络求导,而不是对其成本函数求导;我们根据输入特征而不是网络参数求导。因此,我们选择将梯度向前传播,而不是向后传播,因为这样可以找到导致网络输出发生重大变化的输入组件。

Our goal is to express JF(X)J_{F}(X^{∗}) in terms of X and constant values only. To simplify our expressions, we now consider one element (i,j)[1..M]×[1..N](i, j) ∈ [1..M]×[1..N] of the M ×N forward derivative matrix defined in Equation 3: that is the derivative of one output neuron Fj according to one input dimension xi. Of course our results are true for any matrix element. We start at the first hidden layer of the DNN. We can differentiate the output of this first hidden layer in terms of the input components. We then recursively differentiate each hidden layer k ∈ 2..n in terms of the previous one:

我们的目标是用X和常数值来表示 JF(X)J_{F}(X^{∗}) 。为了简化表达式,我们现在考虑一个元素 (i,j)[1..M]×[1..N](i, j) ∈ [1..M]×[1..N] 式3中定义的M N正导数矩阵的N]:即一个输出神经元Fj根据一个输入维xi的导数。当然,我们的结果对任何矩阵元素都成立。我们从DNN的第一个隐藏层开始。我们可以用输入组件来区分第一个隐藏层的输出。然后递归微分每个隐层k2 ..n乘以前一个:

where HkH_{k} is the output vector of hidden layer k and fk,j is the activation function of neuron j in layer k. Each neuron p on a hidden or output layer indexed k ∈ 1..n+1 is connected to the previous layer k − 1 using weights defined in vector Wk,p. By defining the weight matrix accordingly, we can define fully or sparsely connected interlayers, thus modeling a variety of architectures. Similarly, we write bk,p the bias for neuron p of layer k. By applying the chain rule, we can write a series of formulae for k ∈ 2..n:

其中 HkH_{k} 为隐含层k和fk的输出向量,j为第k层神经元j的激活函数,每一个隐含层或输出层上的神经元p都以k1为索引。利用向量Wk,p中定义的权值,将n+1与上一层k1连接。通过定义相应的权值矩阵,我们可以定义完全连接或稀疏连接的中间层,从而建模各种架构。类似地,我们写bk,p是第k层神经元p的偏置。通过应用链式法则,我们可以写出关于k2 ..n的一系列公式:

We are thus able to express HnXi\frac{\partial H_{n}}{\partial X_{i}} . We know that output neuron j computes the following expression:

这样就可以表达 HnXi\frac{\partial H_{n}}{\partial X_{i}} 习。我们知道输出神经元j计算如下表达式:

Thus, we apply the chain rule again to obtain JF[i,j](X)J_{F}[i,j](X) :

因此,我们再次运用链式法则得到 JF[i,j](X)J_{F}[i,j](X) :

图7:一个784维输入到LeNet体系结构的显著性映射(cf. validation部分)。所述784个输入尺寸被安排为对应于28x28图像像素对齐。较大的绝对值表示扰动时对输出有显著影响的特征。

In this formula, according to our threat model, all terms are known but one: Hnxi\frac{∂Hn}{∂xi} . This is precisely the term we computed recursively. By plugging these results for successive layers back in Equation 6, we get an expression for component (i, j) of the DNN’s forward derivative. Hence, the forward derivative JF(X) of a network F can be computed for any input X by successively differentiating layers from the input layer to the output layer. We later discuss in our methodology evaluation the computability of JF(X) for state-of-the-art DNN architectures. Notably, the forward derivative can be computed using symbolic differentiation.

在这个公式中,根据我们的威胁模型,所有的术语都是已知的,只有一个: Hnxi\frac{∂Hn}{∂xi} 。这正是我们递归计算的项。通过将这些结果代入到方程6中,我们得到了DNN s前向导数的分量(i, j)的表达式。因此,对于任意输入X,网络F的前向导数JF(X)可以通过从输入层到输出层的逐层微分来计算。我们稍后将在我们的方法论评估中讨论JF(X)用于最先进的DNN架构的可计算性。值得注意的是,可以用符号微分法计算正导数。

3.2.2. Adversarial Saliency Maps.

3.2.2. 敌对的凸起的地图。

We extend saliency maps previously introduced as visualization tools [33] to construct adversarial saliency maps. These maps indicate which input features an adversary should perturb in order to effect the desired changes in network output most efficiently. They are thus versatile tools that allow adversaries to generate broad classes of adversarial samples.

我们扩展了前面作为可视化工具[33]介绍的显著映射,以构建对抗性显著映射。这些地图表明,为了最有效地影响网络输出的期望变化,对手应该干扰哪些输入特征。因此,它们是通用工具,允许对手生成广泛的对抗性样本。

Adversarial saliency maps are defined to suit problemspecific adversarial goals. For instance, we later study a network used as a classifier, its output is a probability vector across classes, where the final predicted class value corresponds to the component with the highest probability:

敌对显著性映射的定义是为了适应特定问题的敌对目标。例如,我们后来研究了一个作为分类器的网络,它的输出是一个跨类的概率向量,其中最终预测的类值对应概率最大的分量:

In our case, the saliency map is therefore based on the forward derivative, as this gives the adversary the information needed to cause the neural network to misclassify a given sample. More precisely, the adversary wants to misclassify a sample X such that it is assigned a target class tlabel(X)t \neq label(X) . To do so, probability Ft(X) of target class t assigned by F must be increased while the probabilities Fj (X) of all other classes jtj \neq t decrease, until t=argmaxjFj(X)t = arg max_{j }F_{j} (X) . The adversary can accomplish this by increasing input features using the following saliency map S(X, t):

在我们的例子中,显著性映射因此是基于前向导数的,因为这给对手提供了导致神经网络对给定样本进行错误分类所需的信息。更准确地说,对手想对样本X进行错误分类,从而给它分配一个目标类。要做到这一点,必须使F分配给目标类t的概率Ft(X)增大,而其他所有类的概率Fj (X)减小,直到。对手可以通过使用以下显著性映射S(X, t)增加输入特征来实现这一点:

where i is an input feature, and Jij(X)J_{ij} (X) denotes JF[i,j]=Fj(X)XiJ_{F}[i, j] = \frac{∂Fj (X)}{∂Xi} . The condition specified on the first line rejects input components with a negative target derivative or an overall positive derivative on other classes. Indeed, Jij(X)J_{ij} (X) should be positive in order for Ft(X) to increase when feature Xi increases. Similarly, jtJij(X)\sum_{j\neq t}Jij (X) needs to be negative to decrease or stay constant when feature Xi is increased. The product on the second line allows us to consider all other forward derivative components together in such a way that we can easily compare S(X, t)[i] for all input features. In summary, high values of S(X, t)[i] correspond to input features that will either increase the target class, or decrease other classes significantly, or both. By increasing these input features, the adversary eventually misclassifies the sample into the target class. A saliency map example is shown in Figure 7.

其中i为输入特征, Jij(X)J_{ij} (X) 表示 JF[i,j]=Fj(X)XiJ_{F}[i, j] = \frac{∂Fj (X)}{∂Xi} 。第一行中指定的条件拒绝具有负目标导数或对其他类的总体导数为正的输入组件。事实上, Jij(X)J_{ij} (X) 应该是正的,以便当特征Xi增加时Ft(X)增加。同理,当特征Xi增加时, jtJij(X)\sum_{j\neq t}Jij (X) 需要为负才能减小或保持不变。第二行上的乘积允许我们考虑所有其他的前向导数分量,这样我们就可以很容易地比较S(X, t)[i]的所有输入特征。综上所述,S(X, t)[i]的高值对应的输入特征要么会显著增加目标类,要么会显著减少其他类,或者两者都有。通过增加这些输入特征,对手最终将样本错误地分类到目标类中。显著性映射示例如图7所示。

It is possible to define other adversarial saliency maps using the forward derivative, and the quality of the map can have a large impact on the amount of distortion that Algorithm 1 introduces; we will study this in more detail later. Before moving on, we introduce an additional map that acts as a counterpart to the one given in Equation 8 by finding features that the adversary should decrease to achieve misclassification. The only difference lies in the constraints placed on the forward derivative values and the location of the absolute value in the second line:

使用前向导数可以定义其他对抗性显著性地图,地图的质量对算法1引入的失真量有很大影响;稍后我们将对此进行更详细的研究。在继续之前,我们引入了一个附加的映射,它与公式8中给出的映射类似,通过寻找对手应该减少的特征来实现错误分类。唯一的区别在于对前导数值的约束以及绝对值在第二行中的位置:

3.2.3. Modifying samples.

3.2.3. 修改样品。

Once an input feature has been identified by an adversarial saliency map, it needs to be perturbed to realize the adversary’s goal. This is the last step in each iteration of Algorithm 1, and the amount by which the selected feature is perturbed (θ in Algorithm 1) is also problem-specific. We discuss in Section 4 how this parameter should be set in an application to computer vision. Lastly, the maximum number of iterations, which is equivalent to the maximum distortion allowed in a sample, is specified by parameter Υ. It limits the number of features changed to craft an adversarial sample and can take any positive integer value smaller than the number of features. Finding the right value for Υ requires considering the impact of distortion on humans’ perception of adversarial samples – too much distortion or specific distortion patterns might cause adversarial samples to be easily identified by humans.

一旦一个输入特征被敌方显著性图识别,就需要对其进行扰动以实现敌方的目标。这是算法1每次迭代的最后一步,所选择的特征被扰动的量(算法1中的常动量)也是针对特定问题的。我们将在第4节中讨论如何在计算机视觉应用程序中设置该参数。最后,迭代的最大数量,相当于样品中允许的最大变形,是Υ指定的参数。它限制改变特征的数量来制作一个对抗性样本,可以取任何小于特征数量的正整数。Υ找到合适的值需要考虑畸变的影响对人类的看法敌对的样品太多扭曲或特定的变形模式可能导致对抗的样品很容易被人类。

4. Application of the Approach

4. 方法的应用

We formally described a class of algorithms for crafting adversarial samples misclassified by DNNs using three tools: the forward derivative, adversarial saliency maps, and the crafting algorithm. We now apply these tools to a DNN used for a computer vision classification task: handwritten digit recognition. We show that our algorithms successfully craft adversarial samples from any source class to any given target class, which for this application means that any digit can be perturbed so that it is misclassified as any other digit.

我们正式描述了一类算法,用于制作被DNNs误分类的对抗性样本,使用三种工具:前向导数、对抗性显著性映射和制作算法。现在我们将这些工具应用到用于计算机视觉分类任务的DNN上:手写数字识别。我们证明了我们的算法成功地从任何源类到任何给定的目标类生成了对抗性样本,对于这个应用程序来说,这意味着任何数字都可能受到干扰,从而被错误地分类为其他任何数字。

We investigate a DNN based on the well-studied LeNet architecture, which has proven to be an excellent classifier for handwritten digits [26]. Recent architectures like AlexNet [24] or GoogLeNet [34] heavily rely on convolutional layers introduced in the LeNet architecture, thus making LeNet a relevant DNN to validate our approach. We have no reason to believe that our method will not perform well on larger architectures. The network input is black and white images (28x28 pixels) of handwritten digits, which are flattened as vectors of 784 features, where each feature corresponds to a pixel intensity taking normalized values between 0 and 1. This input is processed by a succession of a convolutional layer (20 then 50 kernels of 5x5 pixels) and a pooling layer (2x2 filters) repeated twice, a fully connected hidden layer (500 neurons), and a softmax output layer (10 neurons). The output is a 10 class probability vector, where each class corresponds to a digit from 0 to 9, as shown in Figure 8. The deep neural network then labels the input image with the class assigned the maximum probability, as shown in Equation 7. We train our network using the MNIST training dataset of 60,000 samples [27].

我们研究了一种基于充分研究过的莱奈体系结构的DNN,它被证明是一种非常好的手写数字[26]分类器。最近的架构如AlexNet[24]或GoogLeNet[34]严重依赖于在LeNet架构中引入的卷积层,因此LeNet成为一个相关的DNN来验证我们的方法。我们没有理由相信我们的方法不会在更大的体系结构上执行得很好。网络输入为手写数字的黑白图像(28x28像素),将其平化为784个特征的向量,每个特征对应一个像素强度,取0 - 1之间的归一化值。这个输入由一连串的卷积层(5x5像素的20到50个核)和重复两次的池化层(2x2滤波器)、完全连接的隐藏层(500个神经元)和softmax输出层(10个神经元)处理。输出是一个10类的概率向量,其中每个类对应于0到9之间的一位数字,如图8所示。然后,深度神经网络用赋值概率最大的类对输入图像进行标签,如式7所示。我们使用包含60000个样本[27]的MNIST训练数据集来训练我们的网络。

We attempt to determine whether, using the framework introduced in previous sections, we can effectively craft adversarial samples misclassified by the DNN. For instance, if we have an image X of a handwritten digit 0 classified by the network as label(X)=0 and the adversary wishes to craft an adversarial sample X∗ based on this image classified as label(X∗)=7, the source class is 0 and the target class is 7. Ideally, the crafting process must find the smallest perturbation δX required to construct the adversarial sample X∗ = X+δX. A perturbation is a set of pixel intensities – or input feature variations – that are added to X in order to craft X∗. Note that perturbations introduced to craft adversarial samples must remain indistinguishable to humans.

我们试图确定,使用前面介绍的框架,我们是否可以有效地制造被DNN误分类的对抗性样本。例如,如果我们有一张手写数字0的图像X,网络将其分类为label(X)=0,而对手希望根据这张分类为label(X)= 7的图像制作一个敌对样本X,源类为0,目标类为7。理想情况下,制作过程必须找到构建对抗性样本X = X+样本X所需要的最小扰动值。扰动是为了制作X而添加到X上的一组像素强度或输入特征变化。注意,在制造对抗样品时引入的扰动必须对人类保持不可区分。

4.1. Crafting algorithm

4.1。设计算法

Algorithm 2 shows the crafting algorithm used in our experiments, which we implemented in Python (see Appendix A for more information regarding the implementation). It is based on Algorithm 1, but several details have been changed to accommodate our digit recognition problem. Given a network F, Algorithm 2 iteratively modifies a sample X by perturbing two input features (i.e., pixel intensities) p1 and p2 selected by saliency_map. The saliency map is constructed and updated between each iteration of the algorithm using the DNN’s forward derivative JF(X∗). The algorithm halts when one of the following conditions is met: (1) the adversarial sample is classified by the DNN with the target class t, (2) the maximum number of iterations max_iter has been reached, or (3) the feature search domain Γ is empty.

算法2展示了在我们的实验中使用的制作算法,这是我们用Python实现的(有关实现的更多信息,请参阅附录A)。它是基于算法1,但是一些细节已经改变,以适应我们的数字识别问题。给定一个网络F,算法2通过扰动saliency_map选择的两个输入特征(即像素强度)p1和p2来迭代修改一个样本X。利用DNN的前向导数JF(X)在算法的每次迭代之间构造显著性映射并进行更新。算法停止时满足下列条件之一:(1)款的对抗性的样本分类与目标类t(2)的最大迭代数max_iter已经达到,或(3)特性的搜索域Γ是空的。

The crafting algorithm is fine-tuned by three parameters:

手工艺算法通过三个参数进行微调:

  • Maximum distortion Υ: this defines when the algorithm should stop modifying the sample in order to reach the adversarial target class. The maximum distortion, expressed as a percentage, corresponds to the maximum number of pixels to be modified when crafting the adversarial sample. Assuming two additional pixels are modified per iteration, the maximum number of iterations max_iter is as follows:

    • 最大变形Υ:这个定义算法时应停止修改示例以达到对抗的目标类。以百分比表示的最大失真,对应于制作对抗性样本时要修改的最大像素数。假设每次迭代额外修改两个像素,最大迭代次数max_iter为

  • where 784 = 28 × 28 is the dimension of a sample. • Saliency map: subroutine saliency_map generates a map defining which input features will be modified at each iteration. Policies used to generate saliency maps vary with the nature of the data handled by the considered DNN, as well as the adversarial goals. We provide a subroutine example later in Algorithm 3.

    • 其中784 = 28×28是样本的维数。 显著映射:子例程saliency_map生成一个映射,定义在每次迭代中将修改哪些输入特性。用于生成显著性地图的政策会因被考虑的DNN处理的数据的性质以及敌对的目标而有所不同。我们稍后将在算法3中提供一个子程序示例。

  • Feature variation per iteration θ: once input features have been selected using the saliency map, they must be modified. The variation θ introduced to these features is another parameter that the adversary must set, in accordance with the saliency maps she uses.

    • 每次迭代的特征变化:一旦使用显著性映射选择了输入特征,就必须对它们进行修改。针对这些特征引入的变异系数是敌手根据其使用的显著性图必须设置的另一个参数。

The problem of finding good values for these parameters is a goal of our current evaluation, and is discussed later in Section 5. For now, note that human perception is a limiting factor as it limits the acceptable maximum distortion and feature variation introduced. We now show the application of our framework with two different adversarial strategies.

为这些参数找到合适值的问题是我们当前评估的目标,将在第5节稍后讨论。目前,请注意,人类的感知是一个限制因素,因为它限制了引入的可接受的最大失真和特征变化。现在我们用两种不同的对抗性策略展示我们的框架的应用。

4.2. Crafting by increasing pixel intensities

4.2. 通过增加像素强度来制作

The first strategy to craft adversarial samples is based on increasing the intensity of some pixels. To achieve this purpose, we consider 10 samples of handwritten digits from the MNIST test set, one from each digit class 0 to 9. We use this small subset of samples to illustrate our techniques. We scale up the evaluation to the entire dataset in Section 5. Our goal is to report whether we can reach any adversarial target class for a given source class. For instance, if we are given a handwritten 0, we increase some of the pixel intensities to produce 9 adversarial samples respectively classified in each of the classes 1 to 9. All pixel intensities changed are increased by θ = +1. We discuss this choice in section 5. We allow for an unlimited maximum distortion Υ = ∞. We simply measure for each of the 90 source-target class pairs whether an adversarial sample can be produced or not.

制作对抗性样本的第一个策略是基于增加一些像素的强度。为了达到这个目的,我们考虑MNIST测试集的10个手写体数字样本,每个数字类别从0到9各一个。我们使用这一小部分样本来说明我们的技术。在第5节中,我们将评估扩展到整个数据集。我们的目标是报告对于给定的源类是否可以达到任何敌对的目标类。例如,如果我们得到一个手写的0,我们增加一些像素强度,以产生9个敌对的样本,分别分类在类1到类9中。所有像素强度的变化值都增加了,且增加幅度为:我们将在第5节讨论这个选择。我们允许无限最大失真Υ=。我们只需对90个源-目标类对中的每一个进行度量,以确定是否能够生成一个对抗性的样本。

The adversarial saliency map used in the crafting algorithm to select pixel pairs that can be increased is an application of the map introduced in the general case of classification in Equation 8. The map aims to find pairs of pixels (p1, p2) using the following heuristic:

在手工艺算法中用于选择可增加的像素对的对抗显著性地图是对公式8中分类一般情况下引入的地图的应用。该地图的目标是找到像素对(p1, p2),使用以下启发式:

where t is the index of the target class, the left operand of the multiplication operation is constrained to be positive, and the right operand of the multiplication operation is constrained to be negative. This heuristic, introduced in the previous section of this manuscript, searches for pairs of pixels increasing the target class output while reducing the summed output of all other classes when simultaneously increased. The pseudocode of the corresponding subroutine saliency_map is given in Algorithm 3.

其中t是目标类的索引,乘法运算的左操作数被限制为正的,乘法运算的右操作数被限制为负的。这个启发式,在本手稿的前一节中介绍过,搜索增加目标类输出的像素对,同时减少所有其他类输出的总和。算法3给出了相应子例程saliency_map的伪代码。

The saliency map considers pairs of pixels and not individual pixels because selecting pixels one at a time is too strict, and very few pixels would meet the heuristic search criteria described in Equation 8. Searching for pairs of pixels is more likely to match the condition: one pixel can compensate a minor flaw of the other pixel. Let’s consider an example: p1 has a target derivative of 5 but a sum of other class derivatives equal to 0.1, while p2 as a target derivative equal to −0.5 and a sum of other classes derivatives equal to −6. Individually, these pixels do not match the saliency map’s criteria stated in Equation 8, but combined, the pair does match the saliency criteria defined in Equation 10. One would also envision considering larger groups of input features to define saliency maps. However, this comes at increased computational costs as more combinations need to be considered when the group size is increased.

显著性映射考虑的是像素对而不是单个像素,因为一次选择一个像素过于严格,只有很少的像素会满足公式8中描述的启发式搜索条件。搜索像素对更有可能匹配条件:一个像素可以弥补另一个像素的小缺陷。让我们考虑一个例子:p1的目标导数是5,但是其他类别的导数的和等于0.1,而p2作为目标导数等于0.5,其他类别的导数的和等于6。单独来看,这些像素不匹配公式8中定义的显著性映射标准,但结合起来,这对像素匹配公式10中定义的显著性映射标准。我们还可以考虑考虑更大的输入特性组来定义显著性映射。但是,当组的大小增加时,需要考虑更多的组合,这样会增加计算成本。

In our algorithm implementation, we compute the DNN forward derivative using the last hidden layer instead of the output probability layer. This is justified by the extreme variations introduced by the logistic regression computed between these two layers to ensure probabilities sum up to 1, leading to extreme derivative values. This reduces the quality of information on how the neurons are activated by different inputs and causes the forward derivative to loose accuracy when generating saliency maps. Better results are achieved when working with the last hidden layer, also made up of 10 neurons, each corresponding to one digit class 0 to 9. This justifies enforcing constraints on the forward derivative. Indeed, as the output layer used for computing the forward derivative does not sum up to 1, increasing Ft(X) does not imply that

jtFj(X)\sum_{j\neq t}\partial F_{j}(X) will decrease, and vice-versa.

在我们的算法实现中,我们使用最后的隐藏层来计算DNN的前向导数,而不是使用输出概率层。在这两层之间计算的逻辑回归引入了极端变化,从而确保概率总和为1,从而导致极端导数值,这证明了这一点。这降低了神经元如何被不同输入激活的信息质量,并导致生成显著性图时的前向导数精度不高。最后一层隐含层也由10个神经元组成,每个神经元对应一位数0到9。这为限制远期衍生品提供了理由。实际上,由于用于计算正导数的输出层的和不等于1,Ft(X)增加并不意味着 jtFj(X)\sum_{j\neq t}\partial F_{j}(X) 会减少,反之亦然。

The algorithm is able to craft successful adversarial samples for all 90 source-target class pairs. Figure 1 shows the 90 adversarial samples obtained as well as the 10 original samples used to craft them. The original samples are found on the diagonal. A sample on row i and column j, when ij i \neq j, is a sample crafted from an image originally classified as source class i to be misclassified as target class j.

该算法能够为所有90个源-目标类对制作成功的对抗性样本。图1显示了获得的90个对抗性样本以及用于制作它们的10个原始样本。在对角线上发现了原始样品。当 ij i \neq j ,第i行和第j列上的样本,是由原本分类为源类i的图像被误分类为目标类j而精心制作的样本。

To verify the validity of our algorithms, and of our adversarial saliency maps, we run a simple experiment. We run the crafting algorithm on an empty input (all pixel intensities initially set to 0) and craft one adversarial sample for each class from 0 to 9. The different samples shown in Figure 9 demonstrate how adversarial saliency maps are able to identify input features relevant to classification in a class.

为了验证我们的算法和我们的对抗显著性映射的有效性,我们运行了一个简单的实验。我们在一个空输入(所有像素强度最初设置为0)上运行手工艺算法,并为每个职业从0到9制作一个对抗性样本。不同的样本 图9演示了对抗显著性映射如何识别与类中的分类相关的输入特性。

4.3. Crafting by decreasing pixel intensities

4.3.通过降低像素强度来制作

Instead of increasing pixel intensities to achieve the adversarial targets, the second adversarial strategy decreases pixel intensities by θ = −1. The implementation is identical to the exception of adversarial saliency maps. The formula is the same as previously written in Equation 10 but the constraints are different: the left operand of the multiplication operation is now constrained to be negative, and the right operand to be positive. This heuristic, also introduced in Section 3, searches for pairs of pixels producing an increase in the target class output while reducing the sum of the output of all other classes when simultaneously decreased.

第二种对抗性策略不是通过增加像素强度来实现对抗性目标,而是通过使像素强度减小,取的值为该实现与对抗显著性映射的例外是相同的。公式与之前在方程10中写的相同,但约束不同:乘法运算的左操作数现在被限制为负,而右操作数被限制为正。这个启发式,也在第3节中介绍,搜索在同时减少目标类输出的同时减少所有其他类输出总和的同时增加目标类输出的对像素。

The algorithm is once again able to craft successful adversarial samples for all source-target class pairs. Figure 10 shows the 90 adversarial samples obtained as well as the 10 original samples used to craft them. One observation made is that the distortion introduced by reducing pixel intensities seems harder to detect by the human eye. We address the human perception aspect with a study later in Section 5.

该算法再次能够为所有源-目标类对创建成功的对抗性样本。图10显示了获得的90个对抗性样本以及用于制作它们的10个原始样本。一项观察表明,降低像素强度所带来的失真似乎很难被人眼察觉。我们将在后面的第5节讨论人类感知方面的研究。

5. Evaluation

5. 评价

We now use our experimental setup to answer the following questions: (1) “Can we exploit any sample?”, (2) “How can we identify samples more vulnerable than others?” and (3) “How do humans perceive adversarial samples compared to DNNs?”. Our primary result is that adversarial samples can be crafted reliably for our validation problem with a 97.10% success rate by modifying samples on average by 4.02%. We define a hardness measure to identify sample classes easier to exploit than others. This measure is necessary for designing robust defenses. We also found that humans cannot perceive the perturbation introduced to craft adversarial samples misclassified by the DNN: they still correctly classify adversarial samples crafted with a distortion smaller than 14.29%.

我们现在使用我们的实验设置来回答以下问题:(1)我们可以利用任何样本吗?,(2)如何识别比其他样本更脆弱的样本?(3)与dna相比,人类是如何感知敌对样本的?。我们的主要结果是,通过平均修改4.02%的样本,对抗样本可以为我们的验证问题可靠地制作出97.10%的成功率。我们定义了一个硬度度量来识别比其他类更容易利用的样本类。这种措施对于设计坚固的防御是必要的。我们还发现,人类无法察觉被DNN误分类的工艺对抗样本所引入的扰动:他们仍然能够正确地分类失真小于14.29%的工艺对抗样本。

5.1. Crafting large amounts of adversarial samples

5.1.制作大量的对抗样本

Now that we previously showed the feasibility of crafting adversarial samples for all source-target class pairs, we seek to measure whether the crafting algorithm can successfully handle large quantities of distinct samples of hand-written digits. That is, we now design a set of experiments to evaluate whether or not all legitimate samples in the MNIST dataset can be exploited by an adversary to produce adversarial samples. We run our crafting algorithm on three sets of 10,000 samples each extracted from one of the three MNIST training, validation, and test subsets2. For each of these samples, we craft 9 adversarial samples, each of them classified in one of the 9 target classes distinct from the original legitimate class. Thus, we generate 90,000 samples for each set, leading to a total of 270,000 adversarial samples. We set the maximum distortion to Υ = 14.5% and pixel intensities are increased by θ = +1. The maximum distortion was fixed after studying the effect of increasing it on the success rate τ . We found that 97.1% of the adversarial samples could be crafted with a distortion of less than 14.5% and observed that the success rate did not increase significantly for larger maximum distortions. Parameter θ was set to +1 after observing that decreasing it or giving it negative values increased the number of features modified, whereas we were interested in reducing the number of features altered during crafting. One will also notice that because features are normalized between 0 and 1, if we introduce a variation of θ = +1, we always set pixels to their maximum value 1. This justifies why in Algorithm 2, we remove modified pixels from the search space at the end of each iteration. The impact on performance is beneficial, as we reduce the size of the feature search space at each iteration. In other words, our algorithm performs a best-first heuristic search without backtracking.

既然我们之前已经展示了为所有源-目标类对制作敌对样本的可行性,那么我们试图衡量制作算法是否能够成功地处理手写数字的大量不同样本。也就是说,我们现在设计一组实验来评估对手是否可以利用MNIST数据集中的所有合法样本来产生对抗性样本。我们在三个MNIST训练、验证和测试subsets2中的一个抽取的三组10,000个样本上运行手工艺算法。对于每个样本,我们设计了9个对抗样本,每个样本被分类到9个不同于原始合法类的目标类中。因此,我们对每个集合生成90,000个样本,总共产生270,000个对立样本。我们设置了最大失真Υ= 14.5%和像素强度增加了θ= + 1。在研究了增加最大失真对手术成功率的影响后,确定了最大失真。我们发现有97.1%的对抗性样本的失真度小于14.5%,并且观察到对于较大的最大失真度,成功率没有显著提高。在观察到减少参数或给它一个负数会增加被修改的特性的数量后,参数被设置为+1,而我们感兴趣的是减少在制作过程中被修改的特性的数量。我们还会注意到,由于特征是在0和1之间标准化的,如果我们引入一个变化的值,即,我们总是将像素的最大值设置为1。这说明了为什么在算法2中,我们在每次迭代结束时从搜索空间中移除修改过的像素。这对性能的影响是有益的,因为我们在每次迭代时减小了特征搜索空间的大小。换句话说,我们的算法在没有回溯的情况下执行了一个最佳优先的启发式搜索。

We measure the success rate τ and distortion of adversarial samples on the three sets of 10,000 samples. The success rate τ is defined as the percentage of adversarial samples that were successfully classified by the DNN as the adversarial target class. The distortion is defined to be the percentage of pixels modified in the legitimate sample to obtain the adversarial sample. In other words, it is the percentage of input features modified in order to obtain adversarial samples. We compute two average distortion values: one taking into account all samples and a second one, denoted by ε, only taking into account successful samples. Figure 11 presents the results for the three sets from which the original samples were extracted. Results are consistent across all sets. On average, the success rate is τ = 97.10%, the average distortion of all adversarial samples is 4.44%, and the average distortion of successful adversarial samples is ε = 4.02%. This means that on average 32 out of 784 pixels are modified to craft a successful adversarial sample. The first distortion is higher because it includes unsuccessful samples, for which the crafting algorithm used the maximum distortion Υ, but was unable to induce a misclassification.

我们在三组10,000个样本中测量了对抗性样本的成功率和失真。成功率为成功被DNN归类为对抗目标类的对抗样本的百分比。失真度定义为在合法样本中修改像素以得到对抗性样本的百分比。换句话说,它是为了获得对抗性样本而对输入特征进行修改的百分比。我们计算两个平均失真值:一个是考虑所有样本,另一个是用永续值表示,只考虑成功样本。图11展示了提取原始样本的三个集合的结果。所有集合的结果是一致的。平均成功率为(早间)= 97.10%,所有对抗性样本平均失真为4.44%,成功对抗性样本平均失真为(早间)=(早间)= 4.02%。这意味着在784个像素中平均有32个被修改以制作出一个成功的对抗性样本。第一个扭曲更高,因为它包括失败的样品,制作的算法的最大失真Υ,但未能引起误分类。

We also studied crafting of 9, 000 adversarial samples using the decreasing saliency map. We found that the success rate τ = 64.7% was lower and the average distortion ε = 3.62% slightly lower. Again, decreasing pixel intensities is less successful at producing the desired adversarial behavior than increasing pixel intensities. Intuitively, this can be understood because removing pixels reduces the information entropy in an already sparse image, thus making it harder for DNNs to extract the information required to classify the sample. Greater absolute values of intensity variations are more confidently misclassified by the DNN.

我们还研究了使用递减显著性图制作9,000个对抗性样本。我们发现成功率为64.7%,平均失真率为3.62%略低。同样,降低像素强度比增加像素强度更不容易产生想要的敌对行为。从直观上看,这是可以理解的,因为去除像素会降低原本就稀疏的图像中的信息熵,从而使得DNNs更难提取出分类样本所需的信息。强度变化的绝对值越大,DNN就越有把握地错误分类。

5.2. Hardness and defense mechanisms

5.2.硬度和防御机制

Looking at the previous experiment, about 2.9% of the 270, 000 adversarial samples were not successfully crafted. This suggests that some samples are harder to exploit than others. Furthermore, the distortion figures reported are averaged on all adversarial samples produced but not all samples require the same distortion to be misclassified. Thus, we now study the hardness of different samples in order to quantify these phenomena. Our aim is to identify which source-target class pairs are easiest to exploit, as well as similarities between distinct source-target class pairs. A class pair is a pair of a source class s and a target class t. This hardness metric allows us to lay ground for defense mechanisms.

看看之前的实验,27万份对抗样本中约有2.9%没有制作成功。这表明有些样本比其他样本更难利用。此外,报告的失真数字是对产生的所有对抗性样本的平均值,但不是所有样本都需要相同的失真来进行错误分类。因此,我们现在研究不同样品的硬度,以量化这些现象。我们的目标是确定哪些源-目标类对最容易利用,以及不同的源-目标类对之间的相似性。类对是源类s和目标类t的一对。这种硬度度量允许我们为防御机制奠定基础。

5.2.1. Class pair study. From this experiment, we obtain a deeper understanding of the crafting success rate and average distortion for different source-target class pairs. We use the 90,000 adversarial samples crafted in the previous experiments from the 10,000 samples of the MNIST test set

5.2.1。类研究。通过这个实验,我们对不同源-目标类对的加工成功率和平均失真有了更深的理解。我们使用先前实验中从MNIST测试集的10,000个样本中精心制作的90,000个对抗样本

We break down the success rate τ reported in Figure 11 by source-target class pairs. This allows us to know, for a given source class, how many samples of that class were successfully misclassified in each of the target classes. In Figure 12, we draw the success rate matrix indicating which pairs are most successful. Darker shades correspond to higher success rates. Rows correspond to success rates per source class while columns correspond to success rates per target class. If one reads the matrix row-wise, it can be perceived that classes 0, 2, and 8 are hard to start with, while classes 1, 7, and 9 are easy to start with. Similarly, reading column-wise, one can observe that classes 1 and 7 are hard to make, while classes 0, 8, and 9 are easy to make.

我们将根据源-目标类对分解图11中所报告的成功率。这允许我们知道,对于给定的源类,在每个目标类中有多少该类的样本被成功地错误分类。在图12中,我们绘制了成功率矩阵,表明哪些对是最成功的。颜色越深,成功率越高。行对应于每个源类的成功率,列对应于每个目标类的成功率。如果按行读取矩阵,可以发现类0、2和8很难开始,而类1、7和9很容易开始。类似地,从列的角度看,可以发现类1和7很难创建,而类0、8和9很容易创建。

In Figure 13, we report the average distortion ε of successful samples by source-target class pair, thus identifying class pairs requiring the most distortion to successfully craft adversarial samples. As expected, classes requiring lower distortions correspond to classes with higher success rates in Figure 12. For instance, the column corresponding to class 1 contains the highest distortions, and it was the column with the least success rates in Figure 12. Indeed, the higher the average distortion of a class pair is, the more likely samples in that class pair are to reach the maximum distortion, and thus produce unsuccessful adversarial samples.

在图13中,我们通过源-目标类对来报告成功样本的平均畸变率,从而识别出需要最大畸变的类对,从而成功地制造出对抗样本。如图12中所示,需要较低失真的类对应于成功率较高的类。例如,对应于类1的列包含了最高的失真,在图12中它是成功率最低的列。的确,一个类对的平均失真度越高,该类对中的样本达到最大失真度的可能性就越大,从而产生不成功的对抗性样本。

To better understand why some class pairs were harder to exploit, we tracked the evolution of class probabilities during the crafting process. We observed that the distortion required to leave the source class was higher for class pairs with high distortions whereas the distortion required to reach the target class, once the source class had been left, remained similar. This correlates with the fact that some source classes are more confidently classified by the DNN then others.

为了更好地理解为什么一些职业对难以挖掘,我们追踪了在制造过程中职业概率的演变。我们观察到,对于具有高失真度的类对,离开源类所需的失真度更高,而一旦离开源类,到达目标类所需的失真度仍然相似。这与以下事实相关联:DNN对一些源类的分类比其他源类更可靠。

5.2.2. Hardness measure. Results indicating that some source-target class pairs are not as easy as others lead us to question the existence of a measure quantifying the distance between two classes. This is relevant to a defender seeking to identify which classes of a DNN are most vulnerable to adversaries. We name this measure the hardness of a target class relatively to a given source class. It normalizes the average distortion of a class pair (s, t) relatively to its success rate:

5.2.2。硬度测量。如果结果表明某些源-目标类对不像其他类对那么容易,那么我们就会怀疑是否存在量化两个类之间距离的度量方法。这与防御者试图识别DNN的哪些类别最容易受到敌人攻击有关。我们将此度量命名为目标类相对于给定源类的硬度。将类对的平均失真(s, t)相对于其成功率进行归一化:

where ε(s, t, τ ) is the average distortion of a set of samples for the corresponding success rate τ . In practice, these two quantities are computed over a finite number of samples by fixing a set of K maximum distortion parameter values Υk in the crafting algorithm where k ∈ 1..K. The set of maximum distortions gives a series of pairs (εk, τk) for k ∈ 1..K. Thus, the practical formula used to compute the hardness of a source-destination class pair can be derived from the trapezoidal rule:

式中,(s, t,匀称)为一组样本对应成功率的平均失真。在实践中,这两个量在计算有限数量的固定一组样本的KΥ最大失真参数值K在制定算法K 1 . . K。最大失真的集合给出了k 1的一系列对(单位k,单位k)。因此,计算源-目标类对硬度的实用公式可以从梯形规则中推导出来:

We computed the hardness values for all classes using a set of K = 9 maximum distortion values Υ ∈ {0.3, 1.3, 2.6, 5.1, 7.7, 10.2, 12.8, 25.5, 38.3}% in the algorithm. Average distortions ε and success rates τ are averaged over 9,000 adversarial samples for each maximum distortion value Υ. Figure 14 shows the hardness values H(s, t) for all pairs (s, t) ∈ {0..9}2. Note that the matrix has a shape similar to the average distortion matrix plotted on Figure 13. However, the hardness measure is more accurate because it is plotted using a series of maximum distortions.

我们计算出所有类使用一组硬度值K = 9最大变形值Υ{0.3,1.3,2.6,5.1,7.7,10.2,12.8,25.5,38.3}%的算法。平均扭曲ε和τ成功率平均超过9000为每个最大变形值Υ敌对的样本。图14显示了所有对(s, t){0..9}2的硬度值H(s, t)。注意,矩阵的形状与图13中绘制的平均失真矩阵相似。然而,硬度测量是更准确的,因为它是绘制使用一系列最大的扭曲。

5.2.3. Adversarial distance. The measure introduced lays ground towards finding defenses against adversarial samples. Indeed, if the hardness measure were to be predictive instead of being computed after adversarial crafting, the defender could identify vulnerable inputs. Furthermore, a predictive measure applicable to a single sample would allow a defender to evaluate the vulnerability of specific samples as well as class pairs. We investigated several complex estimators including convolutional transformations of the forward derivative or Hessian matrices. However, we found that simply using a formula derived from the intuition behind adversarial saliency maps gave good accuracy for predicting the hardness of samples in our experimental setup.

5.2.3。敌对的距离。所采取的措施为找到针对对抗性样品的防御措施奠定了基础。事实上,如果硬度测量是预测性的,而不是在对抗后计算,防御者可以识别脆弱的输入。此外,适用于单个样本的预测措施将允许防御者评估特定样本和类对的脆弱性。我们研究了一些复估计,包括正导数或海森矩阵的卷积变换。然而,我们发现,在我们的实验设置中,简单地使用从敌对显著性图背后的直觉推导出的公式,就可以很好地预测样本的硬度。

We name this predictive measure the adversarial distance of sample X to class t and write it A(X, t). Simply put, it estimates the distance between a sample X and a target class t. We define the distance as:

我们将这种预测测度命名为样本X到类别t的对抗性距离,写为A(X, t),简单地说,它估计了样本X和目标类别t之间的距离,我们将这个距离定义为:

where 1E is the indicator function for event E (i.e., is 1 if E is true). In a nutshell, A(X, t) is the normalized number of non-zero elements in the adversarial saliency map of X computed during the first crafting iteration in Algorithm 2. The closer the adversarial distance is to 1, the more likely sample X is going to be harder to misclassify in target class t. Figure 15 confirms that this formula is empirically wellfounded. It illustrates the value of the adversarial distance averaged over source-destination class pairs, making it easy to compare the average value with the hardness matrix computed previously after crafting samples. To compute it, we altered Equation 13 to sum over pairs of features, reflecting the observations made during our validation process.

其中,1E是事件E的指标函数,如果E为真,则为1。简而言之,a (X, t)是在算法2的第一次加工迭代中计算的X的对抗显著性映射中非零元素的规范化数量。对抗距离越接近1,样本X在目标类t中越难进行误分类。图15证实了该公式在经验上是有根据的。它说明了在源-目标类对上的平均对抗距离的值,从而便于将平均值与之前制作样本后计算的硬度矩阵进行比较。为了计算它,我们改变了方程13,以对特征进行求和,反映在我们的验证过程中所做的观察。

This notion of distance between classes intuitively defines a metric for the robustness of a DNN F against adversarial perturbations. We suggest the following definition:

类间距离的概念直观地定义了DNN F对敌对扰动的鲁棒性度量。我们建议以下定义:

where the set of samples X considered is sufficiently large to represent the input domain of the network. A good approximation of robustness can be computed with the training dataset. Note that the min operator used here can be replaced by other relevant operators, like the statistical expectation. The study of various operators is left as future work.

其中所考虑的样本集合X足够大,可以表示网络的输入域。一个良好的鲁棒近似可以计算与训练数据集。注意,这里使用的最小操作符可以被其他相关操作符替换,比如统计期望。 各种算子的研究留作以后的工作。

5.3. Human perception of adversarial samples

5.3. 人类对敌对样本的感知

Recall that adversarial samples must not only be misclassified as the target class by DNNs, but also visually appear (be classified) as the source class by humans. To evaluate this property, we ran an experiment using 349 human participants on the Mechanical Turk online service. We presented 3 original or adversarially altered samples from the MNIST dataset to human participants. To paraphrase, participants were asked for each sample: (a) ‘is this sample a numeric digit?’, and (b) ‘if yes to (a) what digit is it?’. These two questions were designed to determine how distortion and intensity rates effected human perception of the samples.

回想一下,敌对的样本不仅会被DNNs误分类为目标类,而且还会被人类在视觉上显示为源类。为了评估这一特性,我们在Mechanical Turk在线服务上用349名人类参与者进行了一项实验。我们提出了3原始或反向改变样本从MNIST数据集到人类参与者。换句话说,每个样本都要求参与者:(a)这个样本是数字吗?,及(b)如果是,则是(a)该数字是多少?。设计这两个问题是为了确定失真率和强度率如何影响人类对样本的感知。

The first experiment was designed to identify a baseline perception rate for the input data. The 74 participants were presented 3 of 222 unaltered samples randomly picked from the original MNIST data set. Respondents identified 97.4% as digits and classified correctly 95.3% of the samples.

第一个实验旨在确定输入数据的基线感知率。74名参与者被提供了从原始MNIST数据集中随机抽取的222个未改变样本中的3个。被调查者识别出97.4%的数字,并正确分类了95.3%的样本。

Shown in Figure 16, a second experiment attempted to evaluate how distortion (ε) impacts human perception. Here, 184 participants were presented with a total of 1707 samples with varying levels of distortion (and features altered with an intensity increase θ = +1). The experiments showed that below a threshold distortion (ε = 14.29%), participants were able to identify samples as digits (95%) and correctly classify them (90%) only slightly less accurately than the unaltered samples. The classification rate dropped dramatically (71%) at distortion rates above the threshold.

如图16所示,第二个实验试图评估失真如何影响人类的感知。在这里,184名参与者被提供了总共1707个样本的不同程度的失真(和特征改变的强度增加幅度为+1)。实验表明,在低于阈值失真的情况下,参与者能够将样本识别为数字(95%),并正确地分类(90%),准确率仅比未改变的样本略低。当失真率超过阈值时,分类率急剧下降(71%)。

A final set of experiments evaluate the impact of intensity variations (θ) on perception, as shown Figure 17. The 203 participants were accurate at identifying 5, 355 samples as digits (96%) and classifying them correctly (95%). At higher absolute intensities (θ = −1 and θ = +1), specific digit classification decreased slightly (90.5% and 90%), but identification as digits was largely unchanged.

最后一组实验评估强度变化对感知的影响,如图17所示。203名参与者准确地将5355个样本识别为数字(96%),并正确地将它们分类(95%)。在较高的绝对强度下(和),特异性数字分类略有下降(90.5%和90%),但识别为数字基本没有变化。

While preliminary, these experiments confirm that the overwhelming number of generated samples retain human recognizability. Note that because we can generate samples with less than the distortion threshold for almost all of the input data, (ε ≤ 14.29% for roughly 97% in the MNIST data), we can produce adversarial samples that humans will not detect—thus meeting our adversarial goal. Furthermore, limiting intensity variations provides even better results: at −0.7 ≤ θ ≤ +0.7, humans classified the sample data at essentially the same rates as the original sample data.

虽然是初步的实验,但这些实验证实了绝大多数生成的样本保留了人类的可识别性。注意,因为我们可以为几乎所有的输入数据(MNIST数据中约97%的概率为14.29%)生成小于失真阈值的样本,所以我们可以生成人类无法检测到的对抗样本,从而满足我们的对抗目标。此外,限制强度变化甚至可以提供更好的结果:在0.7 0.7的倍数+0.7时,人类对样本数据的分类率基本上与原始样本数据相同。

6. Discussion

6. 讨论

We introduced a new class of algorithms that systematically craft adversarial samples so as to cause a DNN to misclassify the sample, assuming that the adversary possesses knowledge of the DNN architecture. Although we focused our work on DL techniques used in the context of classification and trained with supervised methods, our approach is also applicable to unsupervised architectures. Instead of achieving a given target class, the adversary achieves a target output Y∗. Because the output space is more complex, it might be harder or impossible to match Y∗. In that case, Equation 1 would need to be relaxed with an acceptable distance between the network output F(X∗) and the adversarial target Y∗. Thus, the only remaining assumption made in this paper is that DNNs are feedforward. In other words, we did not consider recurrent neural networks, as the forward derivative must be adapted to accommodate such networks with cycles.

我们引入了一种新的算法,这种算法系统地制作对抗性样本,假设对手掌握了DNN体系结构的知识,从而导致DNN对样本进行误分类。虽然我们的工作主要集中在分类上下文中使用的DL技术,并使用监督方法进行训练,但我们的方法也适用于无监督架构。对手不是实现给定的目标类,而是实现目标输出Y。因为输出空间更复杂,所以可能更难或不可能匹配Y。在这种情况下,方程1需要放宽,使网络输出F(X)和敌对目标Y之间有一个可接受的距离。因此,本文剩下的唯一假设是,DNNs是前馈的。换句话说,我们没有考虑递归神经网络,因为前向导数必须适应这样的网络周期。

One of our key results is reducing the distortion—the number of features altered—to craft adversarial samples, compared to previous work. We believe this makes adversarial crafting much easier for input domains like malware executables, which are not as easy to perturb as images [10], [15]. This distortion reduction comes with a performance cost. Indeed, more elaborate but accurate saliency map formulae are more expensive to compute for the attacker. We would like to emphasize that our method’s high success rate can be further improved by adversaries only interested in crafting a limited number of samples. Indeed, to lower the distortion of one particular sample, an adversary can use adversarial saliency maps to fine-tune the perturbation introduced. On the other hand, if an adversary wants to craft large amounts of adversarial samples, performance is important. In our evaluation, we balanced these factors to craft adversarial samples against the DNN in less than a second. As far as our algorithm implementation was concerned, the most computationally expensive steps were the matrix manipulations required to construct adversarial saliency maps from the forward derivative matrix. The complexity is dependent on the number of input features. These matrix operations can be made more efficient, notably by making better use of GPU-accelerated computations.

与之前的工作相比,我们的关键结果之一是减少了失真,减少了为了制造对抗样本而改变的特征数量。我们相信这使得敌对的制作更容易的输入域如恶意程序可执行文件,这是不容易扰乱图像[10],[15]。这种失真的减少带来了性能成本。的确,更精细但更准确的显著性映射公式对攻击者来说计算起来更昂贵。我们想强调的是,我们的方法的高成功率可以被那些只对制造有限数量的样本感兴趣的对手进一步提高。实际上,为了降低一个特定样本的失真,对手可以使用对抗性显著性映射来微调引入的扰动。另一方面,如果对手想要制作大量的对抗性样本,性能是很重要的。在我们的评估中,我们平衡了这些因素,在不到一秒的时间内制造出与DNN相对抗的样本。就我们的算法实现而言,最昂贵的计算步骤是从正导数矩阵构造对抗性显著性映射所需的矩阵操作。复杂性取决于输入特征的数量。这些矩阵操作可以变得更有效,特别是通过更好地使用gpu加速计算。

Our efforts so far represent a first but meaningful step towards mitigating adversarial samples: the hardness and adversarial distance metrics lay out bases for defense mechanisms. Although designing such defenses is outside the scope of this paper, we outline two approaches: (1) adversarial sample detection and (2) DNN robustness improvements.

到目前为止,我们的努力代表了减轻对抗性样本的第一步,但有意义:硬度和对抗性距离度量为防御机制奠定了基础。虽然设计这样的防御超出了本文的范围,但我们概述了两种方法:(1)对抗性样本检测和(2)DNN鲁棒性改进。

Developing techniques for adversarial sample detection is a reactive solution. During our experimental process, we noticed that adversarial samples can for instance be detected by evaluating the regularity of samples. More specifically, in our application example, the sum of the squared difference between each pair of neighboring pixels is always higher for adversarial samples than for benign samples. However, there is no a priori reason to assume that this technique will reliably detect adversarial samples in different settings, so extending this approach is one avenue for future work. Another approach was proposed in [18], but it is unsuccessful as by stacking the denoising auto-encoder used for detection with the original DNN, the adversary can again produce adversarial samples.

开发对抗性样品检测技术是一种反应性解决方案。在我们的实验过程中,我们注意到对抗性样本可以通过评估样本的规律性来检测。更具体地说,在我们的应用示例中,对抗性样本中每对相邻像素之间的差的平方和总是高于良性样本。然而,没有一个先验的理由假设这种技术将可靠地检测不同环境下的敌对样本,因此扩展这种方法是未来工作的一个途径。在[18]中提出了另一种方法,但它是不成功的,因为通过将用于检测的去噪自动编码器与原始的DNN叠加,对手可以再次产生对抗性样本。

The second class of solutions seeks to improve training to increase the robustness of DNNs. Interestingly, the problem of adversarial samples is closely linked to training. Work on generative adversarial networks showed that a two player game between two DNNs can lead to the generation of new samples from a training set [16]. Furthermore, adding adversarial samples to the training set can act like a regularizer [17]. We also observed in our experiments that training with adversarial samples makes crafting additional adversarial samples harder. Indeed, by adding 18,000 adversarial samples to the original MNIST training dataset, we trained a new instance of our DNN. We then crafted a set of 9,000 adversarial samples with this newly trained network. Preliminary analysis of these samples crafted showed that the success rate was reduced by 7.2% while the average distortion increased by 37.5%, suggesting that training with adversarial samples makes DNNs more robust.

第二类解决方案旨在改善培训,提高DNNs的鲁棒性。有趣的是,对抗性样本的问题与训练密切相关。对生成式对抗网络的研究表明,两个DNNs之间的两玩家博弈可以导致从训练集[16]生成新的样本。此外,向训练集添加对抗性样本可以起到调节器[17]的作用。我们还在实验中观察到,使用对抗样本进行训练会使制作额外的对抗样本变得更加困难。实际上,通过向原始MNIST训练数据集添加18000个对抗性样本,我们训练了DNN的一个新实例。然后我们用这个新训练的网络制作了一套9000个对抗性样本。对这些样本制作的初步分析表明,成功率降低了7.2%,而平均失真增加了37.5%,这表明使用敌对样本进行训练可以使DNNs更健壮。

7.相关工作

The security of machine learning [1] is an active research topic within the security and machine learning communities. A broad taxonomy of attacks and required adversarial capabilities are discussed in [21] and [2] along with considerations for building defense mechanisms. Biggio et al. studied classifiers in adversarial settings and outlined a framework securing them [7]. However, their work does not consider DNNs but rather other techniques used for binary classification like logistic regression or Support Vector Machines. Generally speaking, attacks against machine learning can be separated into two categories, depending on whether they are executed during training [8], [9] or at test time [5].

机器学习的安全性[1]是安全和机器学习社区中一个活跃的研究课题。在[21]和[2]中讨论了攻击的广泛分类和所需的对抗能力,以及构建防御机制的注意事项。Biggio等人研究了对抗性设置下的分类器,并提出了一个保证分类器[7]安全的框架。然而,他们的工作并没有考虑DNNs,而是考虑了其他用于二值分类的技术,如逻辑回归或支持向量机。一般来说,针对机器学习的攻击可以分为两类,取决于它们是在训练[8]、[9]时执行的,还是在测试时刻[5]时执行的。

Prior work on adversarial sample crafting against DNNs developed a simple technique corresponding to the Architecture and Training Tools threat model, based on gradients used for DNN training [17], [30], [35]. This approach creates adversarial samples by defining an optimization problem based on the DNN’s cost function. In other words, instead of computing gradients to update DNN weights, one computes gradients to update the input, which is then misclassified as the target class by a DNN. The alternative approach proposed in this paper is to identify input regions that are most relevant to its classification by a DNN. This is accomplished by computing the saliency map of a given input, as described by Simonyan et al. in the case of DNNs handling images [33]. We extended this concept to create adversarial saliency maps highlighting input regions that need to be perturbed to accomplish the adversarial goal.

之前针对DNNs的对抗性样本制作工作开发了一种简单的技术,与基于梯度的DNN训练[17],[30],[35]的架构和训练工具威胁模型相对应。这种方法通过定义一个基于DNN代价函数的优化问题来创建对抗性样本。换句话说,不是通过计算梯度来更新DNN权重,而是通过计算梯度来更新输入,然后DNN将输入错误地分类为目标类。本文提出的另一种方法是用DNN来识别与分类最相关的输入区域。正如Simonyan等人在DNNs处理图像[33]时所描述的那样,这是通过计算给定输入的显著性映射来实现的。我们扩展了这一概念,以创建对抗显著性映射,突出显示需要被干扰以实现对抗目标的输入区域。

Previous work by Yosinki et al. investigated how features are transferable between DNNs [37], while Szegedy et al. showed that adversarial samples can indeed be misclassified across models [35]. They report that once an adversarial sample is generated for a given neural network architecture, it is also likely to be misclassified in neural networks designed differently, which explains why the attack is successful. However, the effectiveness of this kind of attack depends on (1) the quality and size of the surrogate dataset collected by the adversary, and (2) the adequateness of the adversarial network used to craft adversarial samples.

Yosinki等人之前的工作研究了特征如何在DNNs[37]之间转移,而Szegedy等人的研究表明,在不同的[35]模型之间,敌对的样本确实可能被错误分类。他们报告说,一旦针对给定的神经网络架构生成了一个对抗性的样本,它也很可能在设计不同的神经网络中被错误地分类,这就是攻击成功的原因。然而,这种攻击的有效性取决于(1)对手收集的代理数据集的质量和大小,以及(2)用于制造对手样本的对手网络的适用性。

8. Conclusions

8.结论

Broadly speaking, this paper has explored adversarial behavior in deep learning systems. In addition to exploring the goals and capabilities of DNN adversaries, we introduced a new class of algorithms to craft adversarial samples based on computing forward derivatives. This technique allows an adversary with knowledge of the DNN architecture to construct adversarial saliency maps identifying features of the input that most significantly impact DNN outputs. These algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample.

广义地说,这篇论文探索了深度学习系统中的对抗行为。除了探索DNN对手的目标和能力之外,我们还引入了一种新的算法来基于计算前向导数来制作对抗性样本。该技术允许了解DNN体系结构的对手构建对抗性显著性映射,识别对DNN输出影响最大的输入特征。这些算法能够可靠地产生被人类主体正确分类的样本,但被DNN在特定目标上分类错误的样本,对抗成功率为97%,而每个样本平均只修改4.02%的输入特征。

Solutions to defend DNNs against adversaries can be divided into two classes: detecting adversarial samples and improving the training phase. The detection of adversarial samples remains an open problem. Interestingly, the universal approximation theorem formulated by Hornik et al. states one hidden layer is sufficient to represent arbitrarily accurately a function [20]. Thus, one can conceive that improving training is key to resisting adversarial samples.

针对DNNs的敌手防御方案可分为两类:检测敌手样本和改进训练阶段。对抗性样本的检测仍然是一个有待解决的问题。有趣的是,由Hornik等人提出的通用逼近定理表明,一个隐含层足以任意准确地表示一个函数[20]。因此,我们可以认为改进训练是抵御对抗样本的关键。

In future work, we plan to address the limitations of DNN trained in an unsupervised manner as well as recurrent neural networks (as opposed to feedforward networks considered throughout this paper). Also, as most models of our taxonomy have yet to be researched, this leaves room for further investigation of DL in various adversarial settings.

在未来的工作中,我们计划解决以无监督方式训练的DNN以及递归神经网络(与本文所考虑的前馈网络相反)的局限性。此外,由于我们分类法的大多数模型还有待研究,这就为在各种敌对设置中进一步研究DL留下了空间。

Acknowledgment

致谢

作者想要热烈感谢Dr. Damien Octeau和Aline Papernot对这项工作的深刻讨论。该研究由陆军研究实验室赞助,并根据编号W911NF-13-2-0045 (ARL网络安全CRA)的合作协议完成。本文件中所包含的观点和结论是作者的观点,不应被解释为代表陆军研究实验室或美国政府的官方政策,无论是明示的还是暗示的。美国政府被授权为政府目的复制和分发重印,尽管这里有任何版权标记。

References

参考文献

[1] M. Barreno, B. Nelson, A. D. Joseph, and J. Tygar. The security of machine learning. Machine Learning, 81(2):121–148, 2010.

[2] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar. Can machine learning be secure? In Proceedings of the 2006 ACM Symposium on Information, computer and communications security, pages 16–25. ACM, 2006.

[3] Y. Bengio. Learning deep architectures for AI. Foundations and trends in Machine Learning, 2(1):1–127, 2009.

[4] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), volume 4, page 3. Austin, TX, 2010.

[5] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Srndi ˇ c, P. Laskov, ´ G. Giacinto, and F. Roli. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases, pages 387–402. Springer, 2013.

[6] B. Biggio, G. Fumera, and F. Roli. Pattern recognition systems under attack: Design issues and research challenges. International Journal of Pattern Recognition and Artificial Intelligence, 28(07):1460002, 2014.

[7] B. Biggio, G. Fumera, and F. Roli. Security evaluation of pattern classifiers under attack. IEEE Transactions on Knowledge and Data Engineering, 26(4):984–996, 2014.

[8] B. Biggio, B. Nelson, and P. Laskov. Support vector machines under adversarial label noise. In ACML, pages 97–112, 2011.

[9] B. Biggio, B. Nelson, and L. Pavel. Poisoning attacks against support vector machines. In Proceedings of the 29th International Conference on Machine Learning, 2012.

[10] B. Biggio, K. Rieck, D. Ariu, C. Wressnegger, I. Corona, G. Giacinto, and F. Roli. Poisoning behavioral malware clustering. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop, pages 27–36. ACM, 2014.

[11] D. Cires¸an, U. Meier, J. Masci, et al. Multi-column deep neural network for traffic sign classification. Neural Networks, 32:333–338, 2012.

[12] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with task learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.

[13] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu. Large-scale malware classification using random projections and neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3422–3426. IEEE, 2013.

[14] G. E. Dahl, D. Yu, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42, 2012.

[15] P. Fogla and W. Lee. Evading network anomaly detection systems: formal reasoning and practical techniques. In Proceedings of the 13th ACM conference on Computer and communications security, pages 59–68. ACM, 2006.

[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.

[17] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the 2015 International Conference on Learning Representations. Computational and Biological Learning Society, 2015.

[18] S. Gu and L. Rigazio. Towards deep neural network architectures robust to adversarial examples. In Proceedings of the 2015 International Conference on Learning Representations. Computational and Biological Learning Society, 2015.

[19] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[20] K. Hornik, M. Stinchcombe, et al. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.

[21] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. Tygar. Adversarial machine learning. In Proceedings of the 4th ACM workshop on security and artificial intelligence, pages 43–58. ACM, 2011.

[22] C. Kaufman, R. Perlman, and M. Speciner. Network security: private communication in a public world. Prentice Hall Press, 2002.

[23] E. Knorr. How paypal beats the bad guys with machine learning. http://www.infoworld.com/article/2907877/machine-learning/howpaypal-reduces-fraud-with-machine-learning.html, 2015.

[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[25] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training deep neural networks. The Journal of Machine Learning Research, 10:1–40, 2009.

[26] Y. LeCun, L. Bottou, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[27] Y. LeCun and C. Cortes. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 1998.

[28] LISA lab. http://deeplearning.net/tutorial/lenet.html, 2010.

[29] K. P. Murphy. Machine learning: a probabilistic perspective. MIT 2012.

[30] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In IEEE Computer Vision and Pattern Recognition, 2015.

[31] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive modeling, 5, 1988.

[32] H. Sak, A. Senior, and F. Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), 2014.

[33] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.

[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.

[35] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In Proceedings of the 2014 International Conference on Learning Representations. Computational and Biological Learning Society, 2014.

[36] Y. Taigman, M. Yang, et al. Deepface: Closing the gap to human-level performance in face verification. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1708, 2014.

[37] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pages 3320–3328, 2014.

Last updated

Was this helpful?