论文标题
神经元的组成解释
Compositional Explanations of Neurons
论文作者
论文摘要
我们描述了一种通过识别紧密近似神经元行为的组成逻辑概念来解释深度表示中神经元的过程。与先前使用原子标签作为解释的工作相比,在构图上分析神经元可以使我们更精确和表现地表征其行为。我们使用此程序来回答有关视觉和自然语言处理模型中解释性的几个问题。首先,我们研究神经元学到的抽象种类。在图像分类中,我们发现许多神经元学习了高度抽象但具有语义相干的视觉概念,而其他多性神经元则检测到多个无关的特征。在自然语言推理(NLI)中,神经元从数据集偏见学习浅词法启发式方法。其次,我们看到构图解释是否使我们深入了解模型性能:检测到人解剖概念的视力神经元与任务绩效呈正相关,而为浅的启发式启动启发的NLI神经元与任务绩效负相关。最后,我们展示了构图解释如何为最终用户提供一种可访问的方式,以产生简单的“复制”对抗性示例,以可预测的方式改变模型行为。
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts that closely approximate neuron behavior. Compared to prior work that uses atomic labels as explanations, analyzing neurons compositionally allows us to more precisely and expressively characterize their behavior. We use this procedure to answer several questions on interpretability in models for vision and natural language processing. First, we examine the kinds of abstractions learned by neurons. In image classification, we find that many neurons learn highly abstract but semantically coherent visual concepts, while other polysemantic neurons detect multiple unrelated features; in natural language inference (NLI), neurons learn shallow lexical heuristics from dataset biases. Second, we see whether compositional explanations give us insight into model performance: vision neurons that detect human-interpretable concepts are positively correlated with task performance, while NLI neurons that fire for shallow heuristics are negatively correlated with task performance. Finally, we show how compositional explanations provide an accessible way for end users to produce simple "copy-paste" adversarial examples that change model behavior in predictable ways.