论文标题
对抗性例子也很有用!
Adversarial examples are useful too!
论文作者
论文摘要
深度学习已经走了很长一段路,并取得了前所未有的成功。但是,尽管准确性很高,但深层模型还是脆弱的,并且很容易被不可察觉的对抗扰动所欺骗。与常见的推理时间攻击相反,后门(\又名特洛伊木马)攻击针对模型构建的训练阶段,并且非常难以打击,因为a)a)模型在原始测试集中通常行事,而b)增强扰动可能是微小的,并且可能只会影响很少的训练样本。在这里,我提出了一种新方法来判断模型是否受到后门攻击。这个想法是使用常规攻击(例如FGSM)生成针对性或不靶向的对抗示例,然后将它们送回分类器。通过计算不同类别中图像的统计数据(此处仅是平均图)并将其与参考模型的统计数据进行比较,可以在视觉上找到扰动区域并揭示攻击。
Deep learning has come a long way and has enjoyed an unprecedented success. Despite high accuracy, however, deep models are brittle and are easily fooled by imperceptible adversarial perturbations. In contrast to common inference-time attacks, Backdoor (\aka Trojan) attacks target the training phase of model construction, and are extremely difficult to combat since a) the model behaves normally on a pristine testing set and b) the augmented perturbations can be minute and may only affect few training samples. Here, I propose a new method to tell whether a model has been subject to a backdoor attack. The idea is to generate adversarial examples, targeted or untargeted, using conventional attacks such as FGSM and then feed them back to the classifier. By computing the statistics (here simply mean maps) of the images in different categories and comparing them with the statistics of a reference model, it is possible to visually locate the perturbed regions and unveil the attack.