论文标题
基于梯度的NLP模型分析是可以操纵的
Gradient-based Analysis of NLP Models is Manipulable
论文作者
论文摘要
基于梯度的分析方法,例如显着性图可视化和对抗性输入扰动,由于其简单性,灵活性以及最重要的是它们的忠诚,在解释神经NLP模型中发现了广泛使用。但是,在本文中,我们证明了模型的梯度很容易操纵,从而质疑基于梯度的分析的可靠性。特别是,我们将目标模型的层与外墙融合在一起,该立面在不影响预测的情况下淹没了梯度。可以训练该立面具有误导性且与任务无关的梯度,例如仅关注输入中的停止单词。在各种NLP任务(文本分类,NLI和QA)上,我们表明我们的方法可以操纵众多基于梯度的分析技术:显着图,降低输入和对抗性扰动,所有这些都不重要或有针对性的令牌非常重要。该论文的代码和教程可在http://ucinlp.github.io/facade上找到。
Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, their faithfulness. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses. In particular, we merge the layers of a target model with a Facade that overwhelms the gradients without affecting the predictions. This Facade can be trained to have gradients that are misleading and irrelevant to the task, such as focusing only on the stop words in the input. On a variety of NLP tasks (text classification, NLI, and QA), we show that our method can manipulate numerous gradient-based analysis techniques: saliency maps, input reduction, and adversarial perturbations all identify unimportant or targeted tokens as being highly important. The code and a tutorial of this paper is available at http://ucinlp.github.io/facade.