抵御普遍的对抗性干扰
Defense against Universal Adversarial Perturbations
Last updated
Was this helpful?
Defense against Universal Adversarial Perturbations
Last updated
Was this helpful?
GB/T 7714 Akhtar N, Liu J, Mian A. Defense against universal adversarial perturbations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 3389-3398.
MLA Akhtar, Naveed, Jian Liu, and Ajmal Mian. "Defense against universal adversarial perturbations." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
APA Akhtar, N., Liu, J., & Mian, A. (2018). Defense against universal adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3389-3398).
Recent advances in Deep Learning show the existence of image-agnostic quasi-imperceptible perturbations that when applied to ‘any’ image can fool a state-of-the-art network classifier to change its prediction about the image label. These ‘Universal Adversarial Perturbations’ pose a serious threat to the success of Deep Learning in practice. We present the first dedicated framework to effectively defend the networks against such perturbations. Our approach learns a Perturbation Rectifying Network (PRN) as ‘pre-input’ layers to a targeted model, such that the targeted model needs no modification. The PRN is learned from real and synthetic image-agnostic perturbations, where an efficient method to compute the latter is also proposed. A perturbation detector is separately trained on the Discrete Cosine Transform of the input-output difference of the PRN. A query image is first passed through the PRN and verified by the detector. If a perturbation is detected, the output of the PRN is used for label prediction instead of the actual image. A rigorous evaluation shows that our framework can defend the network classifiers against unseen adversarial perturbations in the real-world scenarios with up to 97.5% success rate. The PRN also generalizes well in the sense that training for one targeted network defends another network with a comparable success rate.
深度学习的最新进展表明,图像不可知准微扰的存在,当应用于任何图像时,可以欺骗先进的网络分类器改变其对图像标签的预测。这些普遍的对抗性干扰对深度学习在实践中的成功构成了严重威胁。我们提出了第一个专门的框架,以有效地防御网络的这种干扰。该方法将扰动校正网络作为目标模型的预输入层,使目标模型不需要任何修改。通过对真实扰动和合成扰动的学习,提出了一种有效的计算合成扰动的方法。针对PRN输入输出差的离散余弦变换,分别训练了一个摄动检测器。查询映像首先通过PRN并由检测器验证。如果检测到扰动,PRN的输出用于标签预测,而不是实际图像。严格的评估表明,我们的框架可以保护网络分类器在真实场景中抵抗看不见的敌对干扰,成功率高达97.5%。从某种意义上说,PRN也可以很好地概括,即训练一个目标网络可以成功地保护另一个目标网络。
Deep Neural Networks are at the heart of the current advancements in Computer Vision and Pattern Recognition, providing state-of-the-art performance on many challenging classification tasks [9], [12], [14], [16], [36], [37]. However, Moosavi-Dezfooli et al. [25] recently showed the possibility of fooling the deep networks to change their prediction about ‘any’ image that is slightly perturbed with the Universal Adversarial Perturbations. For a given network model, these image-agnostic (hence universal) perturbations can be computed rather easily [25], [26]. The perturbations remain quasi-imperceptible (see Fig. 1), yet the adversarial examples generated by adding the perturbations to the images fool the networks with alarmingly high probabilities [25]. Furthermore, the fooling is able to generalize well across different network models.
深度神经网络是当前计算机视觉和模式识别进展的核心,在许多具有挑战性的分类任务[9]、[12]、[14]、[16]、[36]、[37]上提供了最先进的性能。然而,moosavie - dez愚i等人[25]最近显示了欺骗深度网络以改变他们对任何图像的预测的可能性,这些图像被普遍的对抗性扰动略微扰乱。对于给定的网络模型,这些图像不可知(因此是通用的)扰动可以很容易地计算出来[25],[26]。扰动仍然是准不可察觉的(见图1),然而通过将扰动加到图像上产生的敌对例子以惊人的高概率[25]欺骗了网络。此外,愚弄能够很好地推广不同的网络模型。
Being image-agnostic, universal adversarial perturbations can be conveniently exploited to fool models on-thefly on unseen images by using pre-computed perturbations. This even eradicates the need of on-board computational capacity that is needed for generating image-specific perturbations [7], [21]. This fact, along the cross-model generalization of universal perturbations make them particularly relevant to the practical cases where a model is deployed in a possibly hostile environment. Thus, defense against these perturbations is a necessity for the success of Deep Learning in practice. The need for counter-measures against these perturbations becomes even more pronounced considering that the real-world scenes (e.g. sign boards on roads) modified by the adversarial perturbations can also behave as adversarial examples for the networks [17].
由于图像不可知,普遍的对敌摄动可以通过使用预先计算的摄动来方便地利用不可见的图像欺骗模型。这甚至消除了产生图像特定扰动[7]、[21]所需的机载计算能力。这一事实,沿着对普遍扰动的跨模型推广,使它们特别适用于在可能敌对环境中部署模型的实际情况。因此,防范这些干扰是深度学习在实践中取得成功的必要条件。考虑到被反向扰动修改的真实场景(例如道路上的标志牌)也可以作为[17]网络的反向例子,对抗这些扰动的应对措施变得更加明显。
This work proposes the first dedicated defense against the universal adversarial perturbations [25]. The major contributions of this paper are as follows:
这项工作提出了第一个专门防御通用对抗性扰动[25]。本文的主要贡献如下:
We propose to learn a Perturbation Rectifying Network (PRN) that is trained as the ‘pre-input’ of a targeted network model. This allows our framework to provide defense to already deployed networks without the need of modifying them.
我们提出学习一个摄动纠偏网络 (PRN)被训练成目标网络模型的“预输入”。这使得我们的框架可以为已经部署的网络提供防御,而不需要修改它们。
We propose a method to efficiently compute synthetic image-agnostic adversarial perturbations to effectively train the PRN. The successful generation of these perturbations complements the theoretical findings of Moosavi-Dezfooli [26].
我们提出了一种有效计算合成图像不可知对抗摄动的方法,以有效地训练PRN。这些扰动的成功产生补充了的理论发现 Moosavi-Dezfooli[26]。
We also propose a separate perturbation detector that is learned from the Discrete Cosine Transform of the image rectifications performed by the PRN for clean and perturbed examples.
我们还提出了一种单独的摄动检测器,它可以从PRN对干净的和摄动的例子进行的图像校正的离散余弦变换中学习到。
Rigorous evaluation is performed by defending the GoogLeNet [37], CaffeNet [16] and VGG-F network [4]1 , demonstrating up to 97.5% success rate on unseen images possibly modified with unseen perturbations. Our experiments also show that the proposed PRN generalizes well across different network models.
严格的评估是通过保卫 GoogLeNet [37], CaffeNet[16]和VGG-F网络[4]1 ,演示了对于可能被不可见扰动修改的不可见图像高达97.5%的成功率。我们的实验也证明了该方法的有效性 PRN可以很好地跨不同的网络模型进行推广。
The robustness of image classifiers against adversarial perturbations has gained significant attention in the last few years [6], [7], [29], [32], [34], [35], [40]. Deep neural networks became the center of attention in this area after Szegedy et al. [39] first demonstrated the existence of adversarial perturbations for such networks. See [1] for a recent review of literature in this direction. Szegedy et al. [39] computed adversarial examples for the networks by adding quasi-imperceptible perturbations to the images, where the perturbations were estimated by maximizing the network’s prediction error. Although these perturbations were imagespecific, it was shown that the same perturbed images were able to fool multiple network models. Szegedy et al. reported encouraging results for improving the model robustness against the adversarial attacks by using adversarial examples for training, a.k.a. adversarial training.
图像分类器对敌对摄动的鲁棒性在过去几年中得到了很大的关注[6],[7],[29],[32],[34],[35],[40]。在Szegedy等人[39]首次证明深度神经网络存在对抗性扰动后,深度神经网络成为该领域关注的焦点。有关这方面的最新文献综述,请参阅[1]。Szegedy等人[39]通过在图像上添加准微扰计算了网络的对抗实例,其中微扰通过最大限度地提高网络的预测误差来估计。虽然这些扰动是特定于图像的,但研究表明,相同的扰动图像能够欺骗多个网络模型。Szegedy等人报告了通过使用对抗性例子进行训练(即对抗性训练)来提高模型对抗对抗性攻击的鲁棒性的令人鼓舞的结果。
Goodfellow et al. [10] built on the findings in [39] and developed a ‘fast gradient sign method’ to efficiently generate adversarial examples that can be used for training the networks. They hypothesized that it is the linearity of the deep networks that makes them vulnerable to the adversarial perturbations. However, Tanay and Griffin [41] later constructed the image classes that do not suffer from the adversarial examples for the linear classifiers. Their arguments about the existence of the adversarial perturbations again point towards the over-fitting phenomena, that can be alleviated by regularization. Nevertheless, it remains unclear how a network should be regularized for robustness against adversarial examples.
Goodfellow等人[10]基于[39]中的发现,开发了一种快速梯度符号方法,有效地生成可用于训练网络的对抗性示例。他们假设,是深层网络的线性特性使它们容易受到敌对扰动的影响。然而,Tanay和Griffin[41]后来构造了不受线性分类器对抗性例子影响的图像类。他们关于对抗性扰动存在的争论再次指向过拟合现象,这种现象可以通过正则化来缓解。尽管如此,仍然不清楚一个网络应该如何正则化,以对抗敌对的例子。
Moosavi-Dezfooli [27] proposed the DeepFool algorithm to compute image-specific adversarial perturbations by assuming that the loss function of the network is linearizable around the current training sample. In contrast to the one-step perturbation estimation [10], their approach computes the perturbation in an iterative manner. They also reported that augmenting training data with adversarial examples significantly increases the robustness of networks against the adversarial perturbations. Baluja and Fischer [2] trained an Adversarial Transformation Network to generate adversarial examples against a target network. Liu et al. [19] analyzed the transferability of adversarial examples. They studied this property for both targeted and nontargeted examples, and proposed an ensemble based approach to generate the examples with better transferability.
Moosavi-Dezfooli[27]提出了DeepFool算法,通过假设网络的损失函数在当前训练样本周围可线性化来计算特定图像的对敌扰动。相对于一步摄动估计[10],他们的方法是用迭代的方式计算摄动。他们还报告说,用对抗性例子来增加训练数据,可以显著提高网络对对抗性扰动的鲁棒性。Baluja和Fischer[2]训练了一个对抗变换网络来生成对抗目标网络的对抗例子。Liu等[19]分析了对抗性实例的可转移性。他们研究了目标样本和非目标样本的这种性质,并提出了一种基于集成的方法来生成具有更好可转移性的样本。
The above-mentioned techniques mainly focus on generating adversarial examples, and address the defense against those examples with adversarial training. In-line with our take on the problem, few recent techniques also directly focus on the defense against the adversarial examples. For instance, Lu et al. [22] mitigate the issues resulting from the adversarial perturbations using foveation. Their main argument is that the neural networks (for ImageNet [33]) are robust to the foveation-induced scale and translation variations of the images, however, this property does not generalize to the perturbation transformations.
上面提到的技巧主要关注于生成对抗性例子,并通过对抗性训练来应对这些例子。与我们对这个问题的看法一致的是,很少有最近的技术也直接关注于对抗例子的防御。例如,Lu等人[22]使用凹点缓解了对抗性扰动造成的问题。他们的主要论点是,神经网络(ImageNet[33])对中心凹引起的图像尺度和平移变化具有鲁棒性,然而,这一特性不能推广到摄动变换。
Papernot et al. [30] used distillation [13] to make the neural networks more robust against the adversarial perturbations. However, Carlini and Wagner [3] later introduced adversarial attacks that can not be defended by the distillation method. Kurakin et al. [18] specifically studied the adversarial training for making large models (e.g. Inception v3 [38]) robust to perturbations, and found that the training indeed provides robustness against the perturbations generated by the one-step methods [10]. However, Tramer et al. [42] found that this robustness weakens for the adversarial examples learned using different networks i.e. for the black-box attacks [19]. Hence, ensemble adversarial training was proposed in [42] that uses adversarial examples generated by multiple networks.
Papernot等人[30]使用蒸馏[13]使神经网络对敌对摄动更鲁棒。然而,Carlini和Wagner[3]后来引入了对抗攻击,不能用蒸馏的方法进行防御。Kurakin等人[18]专门研究了使大型模型(如Inception v3[38])对扰动具有鲁棒性的对敌训练,并发现这种训练确实对一步法[10]产生的扰动具有鲁棒性。然而,Tramer等人[42]发现,对于使用不同网络学习的对抗例子,即黑盒攻击[19],这种鲁棒性会减弱。因此,在[42]中提出了集合对抗性训练,该训练使用由多个网络生成的对抗性例子。
Dziugaite et al. [5] studied the effects of JPG compression on adversarial examples and found that the compression can sometimes revert network fooling. Nevertheless, it was concluded that JPG compression alone is insufficient as a defense against adversarial attacks. Prakash et al. [31] took advantage of localization of the perturbed pixels in their defense. Lu et al. [20] proposed SafetyNet for detecting and rejecting adversarial examples for the conventional network classifiers (e.g. VGG19 [11]) that capitalizes on the late stage ReLUs of the network to detect the perturbed examples. Similarly, a proposal of appending the deep neural networks with detector subnetworks was also presented by Metzen et al. [23]. In addition to the classification, adversarial examples and robustness of the deep networks against them have also been recently investigated for the tasks of semantic segmentation and object detection [8], [21], [43].
Dziugaite等人[5]研究了JPG压缩对敌对示例的影响,发现压缩有时可以恢复网络欺骗。然而,我们得出的结论是,仅靠JPG压缩不足以抵御敌方的攻击。Prakash等人的[31]在他们的防御中利用了不安像素的定位。Lu等人[20]提出了一种安全网络,用于检测和拒绝传统网络分类器(如VGG19[11])的敌对样本,该分类器利用网络的后期ReLUs来检测受扰样本。同样,Metzen等人[23]也提出了在深度神经网络中附加检测器子网的建议。除了分类之外,在语义分割和目标检测[8],[21],[43]等任务中,深度网络对抗它们的例子和鲁棒性也在最近被研究。
Whereas the central topic of all the above-mentioned literature is the perturbations computed for individual images, Moosavi-Dezfooli [25] were the first to show the existence of image-agnostic perturbations for neural networks. These perturbations were further analyzed in [26], whereas Metzen et al. [24] also showed their existence for semantic image segmentation. To date, no dedicated technique exists for defending the networks against the universal adversarial perturbations, which is the topic of this paper.
尽管上述所有文献的中心主题是为单个图像计算的摄动,Moosavi-Dezfooli[25]是第一个表明图像不可知摄动对于神经网络的存在的。这些扰动在[26]中得到了进一步的分析,而Metzen等人的[24]在语义图像分割中也显示了它们的存在。到目前为止,还没有专门的技术存在,以保护网络对抗普遍的对抗性摄动,这是本论文的主题。
P(I_{c}\sim \mathbb{S}_{c})(C(I_{c}+\rho))\geq\delta s.t. \lVert \rho \rVert_{p}\leq \zeta,\tag{1}
P(I_{c}\sim \mathbb{S}_{c})(C(I_{c}+\rho))\geq\delta s.t. \lVert \rho \rVert_{p}\leq \zeta,\tag{1}
where P(.) is the probability, ||.||p denotes the ℓp-norm of a vector such that p ∈ [1, ∞), δ ∈ (0, 1] denotes the fooling ratio and ξ is a pre-defined constant. In the text to follow, we alternatively refer to ρ as the perturbation for brevity.
P(.)是||的概率。| | p表示ℓp-norm向量这样的p∈(∞),δ∈(0,1)表示欺骗比和ξ是一个预定义的常数。在接下来的文章中,为了简洁起见,我们又称“非”为“干扰”。
为了防止C(.)受到干扰,我们寻求防御机制的两个组成部分。(1)一个摄动探测器和(2)一个摄动整流器,其中I怜= Ic +怜。该检测器确定未见图像I浸渍/c是被干扰还是被清除。整流器的目的是计算摄动像的变换bI,使之如此。请注意,整流器并不寻求改进C(.)对校正后的图像的预测,而不仅仅是分类器对干净/原始图像的性能。这确保了R(.)的稳定诱导。此外,公式允许我们计算bI这样。我们利用这个属性以端到端方式学习R(.)作为C(.)的预输入层。
We draw on the insights from the literature reviewed in Section 2 to develop a framework for defending a (possibly) targeted network model against universal adversarial perturbations. Figure 2 shows the schematics of our approach to learn the ‘rectifier’ and the ‘detector’ components of the defense framework. We use the Perturbation Rectifying Network (PRN) as the ‘rectifier’, whereas a binary classifier is eventually trained to detect the adversarial perturbations in the images. The framework uses both real and synthetic perturbations for training. The constituents of the proposed framework are explained below.
我们利用在第2节中回顾的文献中的见解来开发一个框架,以保护一个(可能的)目标网络模型免受普遍的对抗性干扰。图2显示的原理图,我们的方法学习整流和探测器组件的防御框架。我们使用摄动矫正网络(PRN)作为整流器,而二值分类器最终被训练来检测图像中的敌对摄动。该框架使用真实扰动和合成扰动进行训练。下面解释了建议的框架的组成。
At the core of our technique is the Perturbation Rectifying Network (PRN), that is trained as pre-input layers to the targeted network classifier. The PRN is attached to the first layer of the classification network and the joint network is trained to minimize the following cost:
J(\theta_{p},b_{p})=\frac{1}{N}\sum\limits_{i=1}^{N}L(\ell_{i}^{\star},\ell_{i})\tag{2}
我们的技术的核心是摄动矫正网络(PRN),它被训练成目标网络分类器的预输入层。将PRN附加到分类网络的第一层,训练联合网络使其代价最小:
J(\theta_{p},b_{p})=\frac{1}{N}\sum\limits_{i=1}^{N}L(\ell_{i}^{\star},\ell_{i})\tag{2}
where ℓ ∗ i and ℓi are the labels predicted by the joint network and the targeted network respectively, such that ℓi is necessarily computed for the clean image. For the N training examples, L(.) computes the loss, whereas θp and bp denote the PRN weight and bias parameters.
在ℓ ∗我和ℓ标签的预测分别联合网络和目标网络,这样ℓ我必然是计算清廉形象。对于N个训练示例,L(.)计算损失,而在表示PRN权值和偏置参数的循环p和bp。
In Eq. (2) we define the cost over the parameters of PRN only, which ensures that the (already deployed) targeted net work does not require any modification for the defense being provided by our framework. This strategy is orthogonal to the existing defense techniques that either update the targeted model using adversarial training to make the networks more robust [18], [42]; or incorporate architectural changes to the targeted network, which may include adding a subnetwork to the model [23] or tapping into the activations of certain layers to detect the adversarial examples [20]. Our defense mechanism acts as an external wrapper for the targeted network such that the PRN (and the detector) trained to counter the adversarial attacks can be kept secretive in order refrain from potential counter-counter attacks3 . This is a highly desirable property of defense frameworks in the real-world scenarios. Moosavi-Dezfooli [25] noted that the universal adversarial perturbations can still exist for a model even after their adversarial training. The proposed framework constitutionally caters for this problem.
在Eq.(2)中,我们只定义了PRN参数上的成本,这确保了(已经部署的)目标网络不需要对我们框架提供的防御进行任何修改。这种策略与现有的防御技术是正交的,这些技术要么使用对抗性训练来更新目标模型,使网络更加健壮[18],[42];或者将架构更改合并到目标网络,这可能包括向模型[23]添加子网,或者利用某些层的激活来检测与之相反的示例[20]。我们的防御机制就像目标网络的外部包装器,这样PRN(和检测器)就可以被训练来对抗敌对攻击,从而保持秘密,避免潜在的反对抗攻击。这是真实场景中防御框架非常需要的特性。moosavie - dez傻瓜[25]指出,对一个模型来说,即使在进行了对抗性训练之后,普遍的对抗性扰动仍然存在。拟议的框架从本质上解决了这个问题。
图2。训练原理图:从干净的数据,图像不可知的摄动被计算和增加与合成摄动。将干净图像和摄动图像同时输入摄动校正网络。通过将PRN附加到目标网络的第一层来学习PRN,使目标网络的参数在PRN训练过程中保持冻结状态。扰动检测机制从PRN的输入和输出之间的差异中提取鉴别特征,并学习二进制分类器。为了对未见的测试图像I求导/c进行分类,首先计算D(I求导/c) = B(F(I求导/c R(I求导/c)))。如果检测到一个扰动,那么R(I rm /c)被用作分类器c(.)的输入,而不是实际的测试图像。
Figure 2. Training schematics: From the clean data, image-agnostic perturbations are computed and augmented with the synthetic perturbations. Both clean and perturbed images are fed to the Perturbation Rectifying Network (PRN). The PRN is learned by attaching it to the first layer of the targeted network such that the parameters of the targeted network are kept frozen during the PRN training. The perturbation detection mechanism extracts discriminative features from the difference between the inputs and outputs of the PRN and learns a binary classifier. To classify an unseen test image Iρ/c, first D(Iρ/c) = B(F(Iρ/c − R(Iρ/c))) is computed. If a perturbation is detected then R(Iρ/c) is used as the input to the classifier C(.) instead of the actual test image.
We train the PRN using both clean and adversarial examples to ensure that the image transformation learned by our network is not biased towards the adversarial examples. For training, ℓi is computed separately with the targeted network for the clean version of the i th training example. The PRN is implemented as 5-ResNet blocks [12] sandwiched by convolution layers. The 224×224×3 input image is fed to Conv 3 × 3, stride = 1, feature maps = 64, ‘same’ convolution; followed by 5 ResNet blocks, where each block consists of two convolution layers with ReLU activations [28], resulting in 64 feature maps. The feature maps of the last ResNet block are processed by Conv 3 × 3, stride = 1, feature maps = 16, ‘same’ convolution; and then Conv 3 × 3, stride = 1, feature maps = 3, ‘same’ convolution.
我们同时使用干净的和敌对的例子来训练PRN,以确保通过我们的网络学习的图像变换不会偏向于敌对的例子。对于训练,对于第i个训练示例的clean版本,i与目标网络分别计算。PRN被实现为5-ResNet块[12]与卷积层。将224 224 3输入图像输入Conv 3 3, stride = 1, feature map = 64,卷积相同;接下来是5个ResNet块,每个块由两个卷积层组成,ReLU激活[28],生成64个feature map。对最后一个ResNet块的feature map进行Conv 3 3处理,stride = 1, feature map = 16,卷积相同;然后Conv 3 3, stride = 1, feature maps = 3,同样的卷积。
We use the cross-entropy loss [9] for training the PRN with the help of ADAM optimizer [15]. The exponential decay rates for the first and the second moment estimates are set to 0.9 and 0.999 respectively. We set the initial learning rate to 0.01, and decay it by 10% after each 1K iterations. We used mini-batch size of 64, and trained the PRN for a given targeted network for at least 5 epochs.
在ADAM优化器[15]的帮助下,我们使用交叉熵损失[9]来训练PRN。第一和第二矩估计的指数衰减率分别设置为0.9和0.999。我们将初始学习率设置为0.01,每次1K迭代后衰减10%。我们使用64个小批处理,并为给定的目标网络训练了至少5个epoch的PRN。
The PRN is trained using clean images as well as their adversarial counterparts, constructed by adding perturbations to the clean images. We compute the latter by first generating a set of perturbations ρ ∈ P ⊆ R d following Moosavi-Dezfooli et al. [25]. Their algorithm computes a universal perturbation in an iterative manner. In its inner loop (ran over the training images), the algorithm seeks a minimal norm vector [27] to fool the network on a given image. The current estimate of ρ is updated by adding to it the sought vector and back-projecting the resultant vector onto the ℓp ball of radius ξ. The outer loop ensures that the desired fooling ratio is achieved over the complete training set. Generally, the algorithm requires several passes on the training data to achieve an acceptable fooling ratio. We refer to [25] for further details on the algorithm.
PRN训练使用干净的图像和它们的对抗性图像,通过添加扰动到干净的图像。根据moosavi - dez愚i等人的[25],我们首先生成一组微扰P R d来计算后者。他们的算法以迭代的方式计算一个普遍扰动。在其内部循环(遍历训练图像)中,该算法寻找最小范数向量[27]来欺骗给定图像上的网络。更新的当前估计ρ通过添加它寻求向量和back-projecting合成矢量p半径为ξ的球上。外部循环确保在整个训练集上达到期望的蒙混比。通常,算法需要对训练数据进行多次传递才能达到可接受的蒙混比。关于该算法的更多细节,请参阅[25]。
A PRN trained with more adversarial patterns underlying the training images is expected to perform better. However, it becomes computationally infeasible to generate a large (e.g. > 100) number of perturbations using the abovementioned algorithm. Therefore, we devise a mechanism to efficiently generate synthetic perturbations ρs ∈ Ps ⊆ R d to augment the set of available perturbations for training the PRN. The synthetic perturbations are computed using the set P while capitalizing on the theoretical results of [26]. To generate the synthetic perturbations, we compute the vectors that satisfy the following conditions: (c1) ρs ∈ Ψ+ P : Ψ+ P = positive orthant of the subspace spanned by the elements of P. (c2) ||ρs ||2 ≈ E [||ρ||2, ∀ρ ∈ P] and (c3)4 ||ρs ||∞ ≈ ξ. The procedure for computing the synthetic perturbations that are constrained by their ℓ∞-norm is summarized in Algorithm 1. We refer to the supplementary material of the paper for the algorithm to compute the ℓ2-norm perturbations.
在训练图像的基础上,训练出更多对抗性模式的PRN预期会表现得更好。然而,生成一个大的(例如>100)使用上述算法的扰动次数。因此,我们设计了一种机制来有效地生成综合摄动,以增加训练PRN可用的摄动集。利用[26]的理论结果,利用集合P计算合成扰动。生成合成扰动,我们计算出向量满足下列条件:(c1)ρΨ+ P:Ψ+ P =积极象限张成的子空间的元素P (c2) | |ρ年代| | 2 E[| |ρ| | 2,ρP)和(c3) 4 | |ρ年代| |ξ。算法1总结了计算受其-范数约束的合成扰动的程序。对于计算二范数摄动的算法,我们参考了本文的补充材料。
To generate a synthetic perturbation, Algorithm 1 searches for ρs in Ψ+ P by taking small random steps in the directions governed by the unit vectors of the elements of P. The random walk continues until the ℓ∞-norm of ρs remains smaller than ξ. The algorithm selects the found ρs as a valid perturbation if the ℓ2-norm of the vector is comparable to the Expected value of the ℓ2-norms of the vectors in P. For generating the ℓ2-norm perturbations, the corresponding algorithm given in the supplementary material terminates the random walk based on ||ρs ||2 in line-4, and directly selects the computed ρs as the desired perturbation. Analyzing the robustness of the deep networks against the universal adversarial perturbations, Moosavi-Dezfooli [26] showed the existence of shared directions (across different data points) along which a decision boundary induced by a network becomes highly positively curved. Along these vulnerable directions, small universal perturbations exist that can fool the network to change its predictions about the labels of the data points. Our algorithms search for the synthetic perturbations along those directions, whereas the knowledge of the desired directions is borrowed from P.
生成合成微扰算法1在Ψ搜索ρs + P通过小随机步骤方向的单位向量的元素由Pρ年代的随机漫步一直持续到规范仍小于ξ。算法选择发现ρ年代作为一个有效的微扰的2-norm向量与向量的2-norms的期望值在p生成2-norm扰动,补充材料中给出了相应的算法终止随机漫步基于| |ρ年代| | 2,4号线,直接选择ρ年代计算所需的扰动。通过分析深度网络对普遍对抗性扰动的鲁棒性,moosavie - dezdumi[26]显示了共享方向(跨不同数据点)的存在,在共享方向上,由网络诱导的决策边界变得高度正弯曲。在这些脆弱的方向上,存在着一些小的普遍扰动,可以欺骗网络改变其对数据点标签的预测。我们的算法搜索沿这些方向的合成扰动,而所需方向的知识是从P。
图3。图的合成扰动计算为 CaffeNet[16]:也显示了集合P中对应的最近匹配。矢量化扰动之间的点积和自己最亲近的匹配ℓ2分别为0.71和0.83,ℓ∞范数扰动
also shows the corresponding closest matches in the set P for the given perturbations. The fooling ratios for the synthetic perturbations is generally not as high as the original ones, nevertheless the values remain in an acceptable range. In our experiments (Section 5), augmenting the training data with the synthetic perturbations consistently helped in early convergence and better performance of the PRN. We note that the acceptable fooling ratios demonstrated by the synthetic perturbations in this work complement the theoretical findings in [26]. Once the set of synthetic perturbations Ps is computed, we construct P ∗ = P S Ps and use it to perturb the images in our training data.
还显示了给定扰动在集合P中对应的最近匹配。合成扰动的欺骗比一般不像原始扰动那么高,但数值仍在可接受的范围内。在我们的实验(第5节)中,用合成扰动增加训练数据始终有助于早期收敛和更好的PRN性能。我们注意到,在这项工作中由合成摄动所证明的可接受的欺骗比补充了[26]的理论发现。一旦合成扰动集合Ps被计算出来,我们就构造P = P S Ps并使用它来扰动我们训练数据中的图像。
While studying the JPG compression as a mechanism to mitigate the effects of the (image-specific) adversarial perturbations, Dziugaite et al. [5] also suggested the Discrete Cosine Transform (DCT) as a possible candidate to reduce the effectiveness of the perturbations. Our experiments, reported in supplementary material, show that the DCT based compression can also be exploited to reduce the network fooling ratios under the universal adversarial perturbations. However, it becomes difficult to decide on the required compression rate, especially when it is not known whether the image in question is actually perturbed or not. Unnecessary rectification often leads to degraded performance of the networks on the clean images.
Dziugaite等人[5]在研究JPG压缩作为一种缓解(图像特有的)敌对扰动影响的机制时,也提出了离散余弦变换(DCT)作为一种可能的备选方法来降低扰动的有效性。我们的实验(在补充材料中报道)表明,基于DCT的压缩也可以被利用来减少在普遍的对抗式扰动下的网络欺骗比率。然而,很难确定所需的压缩率,特别是当不知道问题图像是否真的受到扰动时。不必要的校正常常导致网络对干净图像的性能下降。
Instead of using the DCT to remove the perturbations, we exploit it for perturbation detection in our approach. Using the training data that contains both clean and perturbed images, say I train ρ/c , we first compute F(I train ρ/c − R(I train ρ/c )) and then learn a binary classifier B(F) → [0, 1] with the data labels denoting the input being ‘clean’ or ‘perturbed’. We implement F(.) to compute the log-absolute values of the 2D-DCT coefficients of the gray-scaled image in the argument, whereas an SVM is learned as B(.). The function D(.) = B(F(.)) forms the detector component of our defense framework. To classify a test image Iρ/c, we first evaluate D(Iρ/c), and if a perturbation is detected then C(R(Iρ/c)) is evaluated for classification instead of C(Iρ/c), where C(.) denotes the targeted network classifier.
在我们的方法中,我们利用DCT来进行扰动检测,而不是使用DCT来去除扰动。使用包含干净图像和扰动图像的训练数据,比如我训练位移/c,我们首先计算F(I训练位移/c R(I训练位移/c)),然后学习一个二元分类器B(F)[0,1],其数据标签表示输入是干净的还是扰动的。我们使用F(.)来计算参数中灰度图像的2D-DCT系数的对数绝对值,而SVM被学习为B(.)。函数D(.) = B(F(.))构成了防御框架的检测器组件。为了对一幅测试图像进行分类,我们首先评估D(I°c /c),如果检测到扰动,则评估c (R(I°c /c))进行分类,而不是评估c (I°c /c),其中c(.)为目标网络分类器。
We evaluated the performance of our technique by defending CaffeNet [16], VGG-F network [4] and GoogLeNet [37] against universal adversarial perturbations. The choice of the networks is based on the computational feasibility of generating the perturbations for our experimental protocol. The same framework is applicable to other networks. Following Moosavi-Dezfooli [25], we used the ILSVRC 2012 [16] validation set of 50, 000 images to perform the experiments.
我们通过对CaffeNet[16]、VGG-F网络[4]和 GoogLeNet[37]对抗普遍对抗性扰动。 网络的选择是基于对我们的实验协议产生扰动的计算可行性。同样的框架也适用于其他网络。在Moosavi-Dezfooli[25]之后,我们使用 ILSVRC 2012[16]验证集5万张图像进行实验。
Setup: From the available images, we randomly selected 10, 000 samples to generate a total of 50 image-agnostic perturbations for each network, such that 25 of those perturbations were constrained to have ℓ∞-norm equal to 10, whereas the ℓ2-norm of the remaining 25 was restricted to 2, 000. The fooling ratio of all the perturbations was lowerbounded by 0.8. Moreover, the maximum dot product between any two perturbations of the same type (i.e. ℓ2 or ℓ∞) was upper bounded by 0.15. This ensured that the constructed perturbations were significantly different from each other, thereby removing any potential bias from our evaluation. From each set of the 25 perturbations, we randomly selected 20 perturbations to be used with the training data, and the remaining 5 were used with the testing data.
设置:从可用的图像中,我们随机选择了10000个样本,为每个网络生成总共50个图像不可知的扰动,这样这些扰动中的25个被限制为-norm等于10,而其余25个中的2-norm被限制为2000。所有扰动的欺骗比的下限为0.8。此外,相同类型的任意两个扰动(即2或)之间的最大点积上限为0.15。这确保了所构造的扰动彼此显著不同,从而从我们的评估中消除了任何潜在的偏差。从每组25个扰动中,我们随机选取20个扰动用于训练数据,其余5个扰动用于测试数据。
We extended the sets of the training perturbations using the method discussed in Section 4.2, such that there were total 250 perturbations in each extended set, henceforth denoted as P ∗ ∞ and P ∗ 2 . To generate the training data, we first randomly selected 40, 000 samples from the available images and performed 5 corner crops of dimensions 224 × 224 × 3 to generate 200, 000 samples. For creating the adversarial examples with the ℓ2-type perturbations, we used the set P ∗ 2 and randomly added perturbations to the images with 0.5 probability. This resulted in ∼ 100, 000 samples each for the clean and the perturbed images, which were used to train the approach for the ℓ2-norm perturbations for a given network. We repeated this procedure using the set P ∗ ∞ to separately train it for the ℓ∞-type perturbations. Note that, for a given targeted network we performed the training twice to evaluate the performance of our technique for both types of perturbations.
我们使用4.2节中讨论的方法扩展了训练扰动集,使每个扩展集总共有250个扰动,因此分别记为P和p2。为了生成训练数据,我们首先从可用的图像中随机抽取40000个样本,并执行5个维度的角作物来生成200,000个样本。为了创建具有2型扰动的对抗性例子,我们使用集合p2并随机地以0.5的概率向图像添加扰动。这就为干净的和受扰动的图像各抽取了10万份样本,用于训练在给定网络中处理2范数扰动的方法。我们使用集合P重复这个过程来分别训练它以适应类型的扰动。注意,对于给定的目标网络,我们执行了两次训练,以评估我们的技术对于两种类型扰动的性能。
For a thorough evaluation, two protocols were followed to generate the testing data. Both protocols used the unseen 10, 000 images that were perturbed with the 5 unseen testing perturbations. Notice that the evaluation has been kept doubly-blind to emulate the real-world scenario for a deployed network. For Protocol-A, we used the whole 10, 000 test images and randomly corrupted them with the 5 test perturbations with a 0.5 probability. For the Protocol-B, we chose the subset of the 10, 000 test images that were correctly classified by the targeted network in their clean form, and corrupted that subset with 0.5 probability using the 5 testing perturbations. The existence of both clean and perturbed images with equal probability in our test sets especially ensures a fair evaluation of the detector.
为了一个彻底的评估,两个协议被遵循来产生测试数据。两种方案都使用了被5种看不见的测试干扰干扰的1万幅看不见的图像。请注意,为了模拟已部署网络的真实场景,评估一直是双盲的。对于protocola,我们使用了整个10000个测试图像,并以0.5的概率用5个测试扰动随机地破坏了它们。对于协议b,我们从10,000个被目标网络以其干净的形式正确分类的测试图像中选择一个子集,并使用5个测试扰动以0.5的概率破坏这个子集。在我们的测试集中,干净图像和扰动图像的存在概率相等,特别保证了检测器的公平评价。
Evaluation metric: We used four different metrics for a comprehensive analysis of the performance of our technique. Let Ic and Iρ denote the sets containing clean and perturbed test images. Similarly, let Ibρ and Ib ρ/c be the sets containing the test images rectified by PRN, such that all the images in Ibρ were perturbed (before passing through the PRN) whereas the images in Ib ρ/c were similarly perturbed with 0.5 probability, as per our protocol. Let ∗ I be the set comprising the test images such that each image is rectified by the PRN only if it were classified as perturbed by the detector D. Furthermore, let acc(.) be the function computing the prediction accuracy of the target network on a given set of images. The formal definitions of the metrics that we used in our experiments are stated below:
评估度量:我们使用了四个不同的度量来全面分析我们的技术的性能。设Ic和I坐标表示包含干净和扰动测试图像的集合。同样,让Ib Ib和Ib Ib Ib /c作为包含PRN校正的测试图像的集合,这样Ib Ib Ib中所有的图像都被扰动(在通过PRN之前),而Ib Ib /c中的图像同样被扰动,按我们的协议,其扰动概率为0.5。设I为测试图像组成的集合,每一幅图像只有在被检测器d扰动时才由PRN进行校正,设acc(.)为计算目标网络对给定图像集的预测精度的函数。我们在实验中使用的度量标准的正式定义如下所述:
The names of the metric are in accordance with the semantic notions associated with them. Notice that the PRNrestoration is defined over the rectification of both clean and perturbed images. We do this to account for any loss in the classification accuracy of the targeted network incurred by the rectification of the clean images by the PRN. It was observed in our experiments that unnecessary rectification of the clean images can sometimes lead to a minor (1 - 2%) reduction in the classification accuracy of the targeted network. Hence, we used a more strict definition of the restoration by PRN for a more transparent evaluation. This definition is also in-line with our underlying assumption of the practical scenarios where we do not know a prior if the test image is clean or perturbed.
指标的名称与与它们相关联的语义概念一致。请注意,PRNrestoration是在对干净图像和扰动图像进行校正之后定义的。我们这样做是为了弥补由于PRN对干净图像进行校正而导致的目标网络分类精度的损失。在我们的实验中观察到,对干净图像进行不必要的校正有时会导致目标网络的分类精度轻微下降(1 - 2%)。因此,我们使用PRN更严格的定义来进行更透明的评估。这个定义也符合我们对实际场景的基本假设,即我们不知道先验是干净的还是有干扰的。
Figure 4. Representative examples to visualize the perturbed images and their rectified version computed by the PRN. The labels predicted by the networks along the prediction confidence are also given. The examples are provided for the ℓ∞-type perturbations. Please refer to the supplementary material of the paper for more examples
图4。代表性的例子,可视化的摄动图像及其修正版本计算的PRN。并给出了网络沿预测置信度预测的标签。给出了该类扰动的例子。更多例子请参考论文的补充材料
Same/Cross-norm evaluation: In Table 1, we summarize the results of our experiments for defending the GoogLeNet [37] against the perturbations. The table summarizes two kinds of experiments. For the first kind, we used the same types of perturbations for testing and training. For instance, we used the ℓ2-type perturbations for learning the framework components (rectifier + detector) and then also used the ℓ2-type perturbations for testing. The results of these experiments are summarized in the left half of the table. We performed the ‘same test/train perturbation type’ experiments for both ℓ2 and ℓ∞ perturbations, for both testing protocols (denoted as Prot-A and Prot-B in the table). In the second kind of experiments, we trained our framework on one type of perturbation and tested for the other. The right half of the table summarizes the results of those experiments. The mentioned perturbation types in the table are for the testing data. The same conventions will be followed in the similar tables for the other two targeted networks below. Representative examples to visualize the perturbed and rectified images (by the PRN) are shown in Fig. 4. Please refer to the supplementary material for more illustrations.
相同/交叉模评价:在表1中,我们总结了保护GoogLeNet[37]抵抗扰动的实验结果。表格总结了两种实验。对于第一类,我们使用相同类型的扰动进行测试和训练。例如,我们使用2型扰动来学习框架组件(整流器+检测器),然后也使用2型扰动来进行测试。表的左半部分总结了这些实验的结果。对于两个测试协议(表中表示为Prot-A和Prot-B),我们对2和扰动进行了相同的测试/火车扰动类型实验。在第二类实验中,我们在一种扰动上训练我们的框架,并测试另一种扰动。表的右半部分总结了这些实验的结果。表中提到的扰动类型是用于测试数据的。下面针对其他两个目标网络的类似表将遵循相同的约定。图4给出了将摄动和校正后的图像可视化的典型例子。请参考补充材料以获得更多的说明。
From Table 1, we can see that in general, our framework is able to defend the GoogLeNet very successfully against the universal adversarial perturbations that are specifically targeted at this network. The Prot-A captures the performance of our framework when an attacker might have added a perturbation to an unseen image without knowing if the clean image would be correctly classified by the targeted network. The Prot-B represents the case where the perturbation is added to fool the network on an image that it had previously classified correctly. Note that the difference in the performance of our framework for Prot-A and Prot-B is related to the accuracy of the targeted network on clean images. For a network that is 100% accurate on clean images, the results under Prot-A and Prot-B would match exactly. The results would differ more for the less accurate classifiers, as also evident from the subsequent tables.
从表1中,我们可以看到,总的来说,我们的框架能够非常成功地保护GoogLeNet,使其免受专门针对这个网络的普遍对抗性干扰。当攻击者可能在不知道目标网络是否正确地分类干净映像的情况下向未看的映像添加干扰时,Prot-A捕获了我们的框架的性能。Prot-B表示添加扰动以欺骗网络对之前正确分类的图像的情况。请注意,我们的Prot-A和Prot-B框架的性能差异与目标网络在干净图像上的准确性有关。对于一个在干净图像上100%准确的网络,在Prot-A和Prot-B下的结果将完全匹配。对于不那么精确的分类器,结果会有更多的不同,这从后面的表中也很明显。
In Table 2, we summarize the performance of our framework for the CaffeNet [16]. Again, the results demonstrate a good defense against the perturbations. The final Defenserate for the ℓ2-type perturbation for Prot-A is 96.4%. Under the used metric definition and the experimental protocol, one interpretation of this value is as follows. With the defense wrapper provided by our framework, the performance of the CaffeNet is expected to be 96.4% of its original performance (in the perfect world of clean images), such that there is an equal chance of every query image to be perturbed or clean5 . Considering that the fooling rate of the network was at least 80% on all the test perturbations used in our experiments, it is a good performance recovery.
在表2中,我们总结了CaffeNet[16]框架的性能。结果再次证明了对扰动的良好防御。2型扰动的最终防御率为96.4%。根据使用的度量定义和实验协议,对该值的一种解释如下。使用我们的框架提供的防御包装器,CaffeNet的性能预期为其原始性能的96.4%(在干净图像的完美世界中),这样每个查询图像都有相同的机会被干扰或清除5。考虑到在我们的实验中使用的所有测试干扰下,网络的欺骗率至少为80%,这是一种很好的性能恢复。
In Table 3, the defense summary for the VGG-F network [4] is reported, which again shows a decent performance of our framework. Interestingly, for both CaffeNet and VGG-F, the existence of the ℓ∞-type perturbations in the test images could be detected very accurately by our detector for the ‘different test/train perturbation type’. However, it was not the case for the GoogLeNet. We found that for the ℓ∞-type perturbations (with ξ = 10) the corresponding ℓ2-norm of the perturbations was generally much lower for the GoogLeNet (∼ 2, 400 on avg.) as compared to the CaffeNet and VGG-F (∼ 2, 850 on avg.). This made the detection of the ℓ∞-type perturbations more challenging for the GoogLeNet. The dissimilarity in these values indicate that there is a significant difference between the decision boundaries induced by the GoogLeNet and the other two networks, which is governed by the significant architectural differences of the networks.
在表3中,报告了VGG-F网络[4]的防御总结,这再次显示了我们的框架的良好性能。有趣的是,对于CaffeNet和VGG-F来说,对于不同的测试/列车扰动类型,我们的检测器可以非常准确地检测到测试图像中类型扰动的存在。然而,古glenet的情况并非如此。我们发现的类型扰动(ξ= 10)的相应2-norm扰动是GoogLeNet一般低得多(400 avg。)而CaffeNet和VGG-F在avg(850)。这使得对古glenet来说探测这种类型的扰动更具挑战性。这些值的不同表明由GoogLeNet和其他两个网络引起的决策边界有显著差异,这是由网络的显著结构差异决定的。
Cross-architecture generalisation: With the above observation, it was anticipated that the cross-network defense performance of our framework would be better for the networks with the (relatively) similar architectures. This prediction was verified by the results of our experiments in Tables 4 and 5. These tables show the performance for ℓ2 and ℓ∞-type perturbations where we used the ‘same test/train perturbation type’. The results are reported for protocol A. For the corresponding results under protocol B, we refer to the supplementary material. From these tables, we can conclude that our framework generalizes well across different networks, especially across the networks that have (relatively) similar architectures. We conjecture that the crossnetwork generalization is inherited by our framework from the cross-model generalization of the universal adversarial perturbations. Like our technique, any framework for the defense against these perturbations can be expected to exhibit similar characteristics.
跨架构泛化:根据上述观察结果,我们的框架的跨网络防御性能对于(相对)类似架构的网络来说会更好。表4和表5中我们的实验结果验证了这一预测。这些表显示了我们使用相同的测试/列车扰动类型的2和类型扰动的性能。我们报告了方案a的结果,方案B对应的结果我们参考了补充材料。从这些表中,我们可以得出结论,我们的框架可以很好地跨不同的网络进行推广,特别是跨具有(相对)类似架构的网络。我们推测交叉网络泛化的框架继承了通用对抗性扰动的交叉模型泛化。与我们的技术一样,防御这些扰动的任何框架都可以预期表现出类似的特性。
We presented the first dedicated framework for the defense against universal adversarial perturbations [25] that not only detects the presence of these perturbations in the images but also rectifies the perturbed images so that the targeted classifier can reliably predict their labels. The proposed framework provides defense to a targeted model without the need of modifying it, which makes our technique highly desirable for the practical cases. Moreover, to prevent the potential counter-counter measures, it provides the flexibility of keeping its ‘rectifier’ and ‘detector’ components secretive. We implement the ‘rectifier’ as a Perturbation Rectifying Network (PRN), whereas the ‘detector’ is implemented as an SVM trained by exploiting the image transformations performed by the PRN. For an effective training, we also proposed a method to efficiently compute image-agnostic perturbations synthetically. The efficacy of our framework is demonstrated by a successful defense of CaffeNet [16], VGG-F network [4] and GoogLeNet [37] against the universal adversarial perturbations.
我们提出了第一个专门的框架来防御普遍的对抗性扰动[25],它不仅可以检测到这些扰动在图像中的存在,而且还可以纠正受扰动的图像,以便目标分类器能够可靠地预测它们的标签。所提出的框架提供了对目标模型的防御,而不需要修改它,这使得我们的技术非常适合实际案例。此外,为了防止潜在的反措施,它提供了保持其整流器和检测器组件秘密的灵活性。我们将整流器实现为一种扰动整流网络(PRN),而检测器则实现为一种通过利用PRN执行的图像变换训练的支持向量机。为了有效地进行训练,我们还提出了一种有效地综合计算图像不可知扰动的方法。我们的框架的有效性被一个成功防御的CaffeNet [16], VGG-F网络[4]和GoogLeNet[37]的普遍对敌摄动证明。
This research was supported by ARC grant DP160101458. The Titan Xp used for this research was donated by NVIDIA Corporation.
本研究得到ARC拨款DP160101458的支持。用于这项研究的Titan Xp是由英伟达公司捐赠的。
[1] N. Akhtar and A. Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. arXiv preprint arXiv:1801.00553, 2018.
[2] S. Baluja and I. Fischer. Adversarial transformation networks: Learning to generate adversarial examples. arXiv preprint arXiv:1703.09387, 2017.
[3] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.
[4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
[5] G. K. Dziugaite, Z. Ghahramani, and D. M. Roy. A study of the effect of jpg compression on adversarial images. arXiv preprint arXiv:1608.00853, 2016.
[6] A. Fawzi, O. Fawzi, and P. Frossard. Analysis of classifiers’ robustness to adversarial perturbations. arXiv preprint arXiv:1502.02590, 2015.
[7] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard. Robustness of classifiers: from adversarial to random noise. In Advances in Neural Information Processing Systems, pages 1632–1640, 2016.
[8] V. Fischer, M. C. Kumar, J. H. Metzen, and T. Brox. Adversarial examples for semantic image segmentation. arXiv preprint arXiv:1703.01101, 2017.
[9] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. 2016.
[10] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
[11] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[13] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[14] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
[15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[17] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
[18] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
[19] Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.
[20] J. Lu, T. Issaranon, and D. Forsyth. Safetynet: Detecting and rejecting adversarial examples robustly. arXiv preprint arXiv:1704.00103, 2017.
[21] J. Lu, H. Sibai, E. Fabry, and D. Forsyth. No need to worry about adversarial examples in object detection in autonomous vehicles. arXiv preprint arXiv:1707.03501, 2017.
[22] Y. Luo, X. Boix, G. Roig, T. Poggio, and Q. Zhao. Foveationbased mechanisms alleviate adversarial examples. arXiv preprint arXiv:1511.06292, 2015.
[23] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017.
[24] J. H. Metzen, M. C. Kumar, T. Brox, and V. Fischer. Universal adversarial perturbations against semantic image segmentation. arXiv preprint arXiv:1704.05712, 2017.
[25] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. CVPR, 2017.
[26] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, P. Frossard, and S. Soatto. Analysis of universal adversarial perturbations. arXiv preprint arXiv:1705.09554, 2017.
[27] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.
[28] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
[29] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427–436, 2015.
[30] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582–597. IEEE, 2016.
[31] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer. Deflecting adversarial attacks with pixel deflection. arXiv preprint arXiv:1801.08926, 2018.
[32] A. Rozsa, E. M. Rudd, and T. E. Boult. Adversarial diversity and hard positive generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 25–32, 2016.
[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
[34] S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet. Adversarial manipulation of deep representations. arXiv preprint arXiv:1511.05122, 2015.
[35] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter. Accessorize to a crime: Real and stealthy attacks on state-ofthe-art face recognition. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 1528–1540. ACM, 2016.
[36] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[38] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
[39] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
[40] P. Tabacof and E. Valle. Exploring the space of adversarial images. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 426–433. IEEE, 2016.
[41] T. Tanay and L. Griffin. A boundary tilting persepective on the phenomenon of adversarial examples. arXiv preprint arXiv:1608.07690, 2016.
[42] F. Tramer, A. Kurakin, N. Papernot, D. Boneh, and P. Mc- ` Daniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
[43] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille. Adversarial examples for semantic segmentation and object detection. arXiv preprint arXiv:1703.08603, 2017.
Below, we present the notions of universal adversarial perturbations and the defense against them more formally. Let denote the distribution of the (clean) natural images in a d-dimensional space, such that, a class label is associated with its every sample . Let C(.) be a classifier (a deep network) that maps an image to its class label, i.e. . The vector ρ ∈ R d is a universal adversarial perturbation for the classifier, if it satisfies the following constraint:
下面,我们将更正式地介绍普遍对抗性扰动的概念和对它们的防御。让 表示(清洁)自然图像的分布采用空间,这样,一个类标签与每个样本Ic c。让c()是一个分类器(深网络)将图像映射到它的类标签,即 .向量ρR d是一个普遍的敌对的扰动分类器,如果满足下面的约束:
In (1), the perturbations in question are image-agnostic, hence Moosavi-Dezfooli et al. [25] termed them universal2 . According to the stated definition, the parameter ξ controls the norm of the perturbation. For the quasi-imperceptible perturbations, the value of this parameter should be very small as compared to the image norm . On the other hand, a larger δ is required for the perturbation to fool the classifier with a higher probability. In this work, we let δ ≥ 0.8 and consider the perturbations constrained by their ℓ2 and ℓ∞ norms. For the ℓ2-norm, we let ξ = 2, 000, and select ξ = 10 for the ℓ∞-norm perturbations. For both types, these values are ∼ 4% of the means of the respective image norms used in our experiments (in Section 5), which is the same as [25].
在(1)中,所讨论的扰动是图像不可知的,因此Moosavi-Dezfooli 等人[25]将它们称为普遍性2。根据规定的定义,ξ的参数扰动的控制标准。对于准微扰,与像范数 相比,这个参数的值应该非常小。另一方面,为了使扰动以更高的概率欺骗分类器,需要一个更大的参数。在这项工作中,我们使用了0.8,并考虑了受其2和规范约束的扰动。2-norm,我们让ξ= 2,000年,为规范选择ξ= 10扰动。对于两种类型,这些值都是我们实验中使用的各自图像规范均值的4%(在第5节),与[25]相同。
To defend C(.) against the perturbations, we seek two components of the defense mechanism. (1) A perturbation ‘detector’ and (2) a perturbation ‘rectifier’ , where Iρ = Ic + ρ. The detector determines whether an unseen image Iρ/c is perturbed or clean. The objective of the rectifier is to compute a transformation bI of the perturbed image such that . Notice that the rectifier does not seek to improve the prediction of C(.) on the rectified version of the image beyond the classifier’s performance on the clean/original image. This ensures stable induction of R(.). Moreover, the formulation allows us to compute bI such that . We leverage this property to learn R(.) as the pre-input layers of C(.) in an end-to-end fashion.
.
.