论文标题
对抗性攻击和防御:解释观点
Adversarial Attacks and Defenses: An Interpretation Perspective
论文作者
论文摘要
尽管在广泛的应用中取得了最新进展,但机器学习模型,尤其是深层神经网络已被证明很容易受到对抗性攻击的影响。攻击者在输入中添加了精心制作的扰动,在这些输入中,人类几乎无法察觉,但可能会导致模型做出错误的预测。保护模型免受对抗输入的技术称为对抗性防御方法。尽管已经提出了许多方法来研究不同情况下的对抗性攻击和防御,但有趣且至关重要的挑战仍然是如何真正理解模型脆弱性?受到“如果您知道自己和敌人的了解,您不必担心战斗”的启发,我们可能会在解释机器学习模型打开黑色盒子后应对上述挑战。模型解释或可解释的机器学习的目的是为模型的工作机理提取人类理解的术语。最近,一些方法开始将解释纳入对抗攻击和防御的探索中。同时,我们还观察到,许多现有的对抗攻击和防御方法虽然没有明确要求,但可以从解释的角度来理解。在本文中,我们回顾了有关对抗性攻击和防御的最新工作,特别是从机器学习解释的角度来看。我们将解释分为两种类型:特征级解释和模型级解释。对于每种类型的解释,我们详细介绍了如何用于对抗攻击和防御。然后,我们简要说明解释与对手之间的其他相关性。最后,我们讨论了通过解释解决对手问题的挑战和未来方向。
Despite the recent advances in a wide spectrum of applications, machine learning models, especially deep neural networks, have been shown to be vulnerable to adversarial attacks. Attackers add carefully-crafted perturbations to input, where the perturbations are almost imperceptible to humans, but can cause models to make wrong predictions. Techniques to protect models against adversarial input are called adversarial defense methods. Although many approaches have been proposed to study adversarial attacks and defenses in different scenarios, an intriguing and crucial challenge remains that how to really understand model vulnerability? Inspired by the saying that "if you know yourself and your enemy, you need not fear the battles", we may tackle the aforementioned challenge after interpreting machine learning models to open the black-boxes. The goal of model interpretation, or interpretable machine learning, is to extract human-understandable terms for the working mechanism of models. Recently, some approaches start incorporating interpretation into the exploration of adversarial attacks and defenses. Meanwhile, we also observe that many existing methods of adversarial attacks and defenses, although not explicitly claimed, can be understood from the perspective of interpretation. In this paper, we review recent work on adversarial attacks and defenses, particularly from the perspective of machine learning interpretation. We categorize interpretation into two types, feature-level interpretation and model-level interpretation. For each type of interpretation, we elaborate on how it could be used for adversarial attacks and defenses. We then briefly illustrate additional correlations between interpretation and adversaries. Finally, we discuss the challenges and future directions along tackling adversary issues with interpretation.