论文标题
您所看到的不是网络输入的内容:基于语义矛盾检测对抗示例
What You See is Not What the Network Infers: Detecting Adversarial Examples Based on Semantic Contradiction
论文作者
论文摘要
对抗性示例(AES)对深度神经网络(DNN)在安全至关重要领域(例如自动驾驶)的应用构成了严重威胁。尽管据我们所知,尽管有大量的AE防御解决方案,但它们都遭受了一些弱点的困扰,例如,仅防御AES的一部分或造成合法投入的准确性损失相对较高。此外,大多数现有的解决方案无法防御适应性攻击,在这种攻击中,攻击者对防御机制和制造AES都了解。在本文中,我们提出了一个基于AE的性质的新颖AE检测框架,即它们的语义信息与目标DNN模型提取的判别特征不一致。具体而言,提出的解决方案(即矛盾)模型是通过将输入和推理结果同时将其转移到发电机中以获得合成输出,然后将其与原始输入进行比较。对于正确推断的合法输入,合成输出试图重建输入。相反,对于AES而言,将创建合成输出,而不是重建输入,以尽可能符合错误的标签。因此,通过通过度量学习来测量输入和合成输出之间的距离,我们可以将AE与合法输入区分开。我们在各种AE攻击方案下进行全面的评估,实验结果表明,Contranet的表现优于现有的解决方案,尤其是在自适应攻击下。此外,我们的分析表明,可以绕过contranet的成功的AE往往具有极大的对抗性语义。我们还表明,矛盾可以轻松与对抗训练技术相结合,以进一步提高AE防御能力。
Adversarial examples (AEs) pose severe threats to the applications of deep neural networks (DNNs) to safety-critical domains, e.g., autonomous driving. While there has been a vast body of AE defense solutions, to the best of our knowledge, they all suffer from some weaknesses, e.g., defending against only a subset of AEs or causing a relatively high accuracy loss for legitimate inputs. Moreover, most existing solutions cannot defend against adaptive attacks, wherein attackers are knowledgeable about the defense mechanisms and craft AEs accordingly. In this paper, we propose a novel AE detection framework based on the very nature of AEs, i.e., their semantic information is inconsistent with the discriminative features extracted by the target DNN model. To be specific, the proposed solution, namely ContraNet, models such contradiction by first taking both the input and the inference result to a generator to obtain a synthetic output and then comparing it against the original input. For legitimate inputs that are correctly inferred, the synthetic output tries to reconstruct the input. On the contrary, for AEs, instead of reconstructing the input, the synthetic output would be created to conform to the wrong label whenever possible. Consequently, by measuring the distance between the input and the synthetic output with metric learning, we can differentiate AEs from legitimate inputs. We perform comprehensive evaluations under various AE attack scenarios, and experimental results show that ContraNet outperforms existing solutions by a large margin, especially under adaptive attacks. Moreover, our analysis shows that successful AEs that can bypass ContraNet tend to have much-weakened adversarial semantics. We have also shown that ContraNet can be easily combined with adversarial training techniques to achieve further improved AE defense capabilities.