论文标题
在对抗攻击下理解和诊断脆弱性
Understanding and Diagnosing Vulnerability under Adversarial Attacks
论文作者
论文摘要
深度神经网络(DNN)很容易受到对抗性攻击的影响。当前,尚无清楚的了解,微小的扰动如何在分类结果中引起如此巨大的差异以及如何设计更强大的模型体系结构。在这项工作中,我们提出了一种新颖的可解释性方法,即Changlygan,以生成用于潜在变量分类的特征的解释。解释对抗性示例的分类过程揭示了对抗扰动如何逐图影响特征,以及通过扰动修改的特征。此外,我们设计了第一个诊断方法来量化每一层造成的漏洞,可用于识别模型体系结构的脆弱部分。诊断结果表明,引入更多信息损失的图层往往比其他层更容易受到伤害。根据发现,我们对MNIST和CIFAR10数据集的评估结果表明,对于本文研究的网络体系结构而言,较低信息丢失的平均池层比最大池层层更强大。
Deep Neural Networks (DNNs) are known to be vulnerable to adversarial attacks. Currently, there is no clear insight into how slight perturbations cause such a large difference in classification results and how we can design a more robust model architecture. In this work, we propose a novel interpretability method, InterpretGAN, to generate explanations for features used for classification in latent variables. Interpreting the classification process of adversarial examples exposes how adversarial perturbations influence features layer by layer as well as which features are modified by perturbations. Moreover, we design the first diagnostic method to quantify the vulnerability contributed by each layer, which can be used to identify vulnerable parts of model architectures. The diagnostic results show that the layers introducing more information loss tend to be more vulnerable than other layers. Based on the findings, our evaluation results on MNIST and CIFAR10 datasets suggest that average pooling layers, with lower information loss, are more robust than max pooling layers for the network architectures studied in this paper.