论文标题
使用反事实属性检测木马DNN
Detecting Trojaned DNNs Using Counterfactual Attributions
论文作者
论文摘要
我们针对检测DNN中检测特洛伊木马或后门的问题。这种模型通常用典型的输入行为,但对被特洛伊木马触发的投入产生了特定的错误预测。我们的方法是基于一种新的观察结果,即触发行为取决于一些激活触发模式的幽灵神经元,并在激活时表现出异常更高的相对归因。此外,这些触发神经元在目标类别的正常输入上也有效。因此,我们使用反事实归因来从干净的输入中定位这些幽灵神经元,然后逐步激发它们以观察模型的准确性的变化。我们通过使用一个深度集编码器将这些信息用于特洛伊木马检测,该编码器能够使模型类,体系结构等不变性。我们的方法是在Trinitalai工具中实现的,该工具在深度学习中利用了可信赖性,适应力和可解释性挑战之间的协同作用。我们在模型架构,触发器等方面具有高度多样性的基准评估方法。我们比依赖DNN对特定对抗性攻击的敏感性的最先进方法表现出一致的增长(+10%),这反过来又需要对Trojan攻击性质的强有力的假设。
We target the problem of detecting Trojans or backdoors in DNNs. Such models behave normally with typical inputs but produce specific incorrect predictions for inputs poisoned with a Trojan trigger. Our approach is based on a novel observation that the trigger behavior depends on a few ghost neurons that activate on trigger pattern and exhibit abnormally higher relative attribution for wrong decisions when activated. Further, these trigger neurons are also active on normal inputs of the target class. Thus, we use counterfactual attributions to localize these ghost neurons from clean inputs and then incrementally excite them to observe changes in the model's accuracy. We use this information for Trojan detection by using a deep set encoder that enables invariance to the number of model classes, architecture, etc. Our approach is implemented in the TrinityAI tool that exploits the synergies between trustworthiness, resilience, and interpretability challenges in deep learning. We evaluate our approach on benchmarks with high diversity in model architectures, triggers, etc. We show consistent gains (+10%) over state-of-the-art methods that rely on the susceptibility of the DNN to specific adversarial attacks, which in turn requires strong assumptions on the nature of the Trojan attack.