论文标题
通过连接后门和对抗攻击进行渐进的后门擦除
Progressive Backdoor Erasing via connecting Backdoor and Adversarial Attacks
论文作者
论文摘要
已知深层神经网络(DNN)容易受到后门攻击和对抗性攻击的影响。在文献中,这两种攻击通常被视为独特的问题并分别解决,因为它们分别属于训练时间和推理时间攻击。但是,在本文中,我们发现了它们之间的一种有趣的联系:对于具有后门种植的模型,我们观察到其对抗性示例具有与触发图像相似的行为,即都激活了同一DNN神经元的子集。这表明将后门种植到模型中将显着影响模型的对抗性例子。基于这些观察结果,提出了一种新型的渐进后门擦除(PBE)算法,以通过利用不靶向的对抗性攻击来逐步净化感染模型。与以前的后门防御方法不同,我们方法的一个重要优点是,即使没有清洁的额外数据集不可用,它也可以擦除后门。我们从经验上表明,在5次最先进的后门攻击中,我们的PBE可以有效地擦除后门,而不会在干净的样品上明显降解,并且显着胜过现有的防御方法。
Deep neural networks (DNNs) are known to be vulnerable to both backdoor attacks as well as adversarial attacks. In the literature, these two types of attacks are commonly treated as distinct problems and solved separately, since they belong to training-time and inference-time attacks respectively. However, in this paper we find an intriguing connection between them: for a model planted with backdoors, we observe that its adversarial examples have similar behaviors as its triggered images, i.e., both activate the same subset of DNN neurons. It indicates that planting a backdoor into a model will significantly affect the model's adversarial examples. Based on these observations, a novel Progressive Backdoor Erasing (PBE) algorithm is proposed to progressively purify the infected model by leveraging untargeted adversarial attacks. Different from previous backdoor defense methods, one significant advantage of our approach is that it can erase backdoor even when the clean extra dataset is unavailable. We empirically show that, against 5 state-of-the-art backdoor attacks, our PBE can effectively erase the backdoor without obvious performance degradation on clean samples and significantly outperforms existing defense methods.