论文标题
DNN解释系统的后门攻击
Backdoor Attacks on the DNN Interpretation System
论文作者
论文摘要
解释性对于了解深神经网络(DNN)的内部工作至关重要,并且许多解释方法产生了显着图,这些图表突出了输入图像的一部分,这些图像对DNN的预测最大程度最大。在本文中,我们设计了一种后门攻击,该攻击改变了网络为输入图像而改变的显着图,仅带有注入的触发器,而肉眼看不见,同时保持预测准确性。攻击依赖于将中毒数据注入训练数据集中。显着性图纳入了用于训练深层模型的目标函数的惩罚项中,其对模型训练的影响是基于触发器的存在。我们设计了两种类型的攻击:有针对性的攻击,可实现显着性图的特定修改和无靶向攻击的特定修改,而当原始显着性图的顶部像素的重要性得分大大降低时。我们对针对各种深度学习体系结构的基于梯度和无梯度的解释方法进行了对后门攻击的经验评估。我们表明,在部署不信任来源开发的深度学习模型时,我们的攻击构成了严重的安全威胁。最后,在补充中,我们证明了所提出的方法可以在倒置的设置中使用,在这种情况下,只有在存在触发器的情况下才能获得正确的显着性图(键),从而有效地使解释系统仅适用于选定的用户。
Interpretability is crucial to understand the inner workings of deep neural networks (DNNs) and many interpretation methods generate saliency maps that highlight parts of the input image that contribute the most to the prediction made by the DNN. In this paper we design a backdoor attack that alters the saliency map produced by the network for an input image only with injected trigger that is invisible to the naked eye while maintaining the prediction accuracy. The attack relies on injecting poisoned data with a trigger into the training data set. The saliency maps are incorporated in the penalty term of the objective function that is used to train a deep model and its influence on model training is conditioned upon the presence of a trigger. We design two types of attacks: targeted attack that enforces a specific modification of the saliency map and untargeted attack when the importance scores of the top pixels from the original saliency map are significantly reduced. We perform empirical evaluation of the proposed backdoor attacks on gradient-based and gradient-free interpretation methods for a variety of deep learning architectures. We show that our attacks constitute a serious security threat when deploying deep learning models developed by untrusty sources. Finally, in the Supplement we demonstrate that the proposed methodology can be used in an inverted setting, where the correct saliency map can be obtained only in the presence of a trigger (key), effectively making the interpretation system available only to selected users.