解释神经NLP的因果中介分析：性别偏见的情况

论文标题

解释神经NLP的因果中介分析：性别偏见的情况

Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias

论文作者

Vig, Jesse, Gehrmann, Sebastian, Belinkov, Yonatan, Qian, Sharon, Nevo, Daniel, Sakenis, Simas, Huang, Jason, Singer, Yaron, Shieber, Stuart

论文摘要

自然语言处理中解释神经模型的常见方法通常检查其结构或行为，但并非两者兼而有之。我们提出了一种基于因果中介分析理论的方法，用于解释模型的哪些部分与其行为有关。它使我们能够分析信息通过各种模型组件（称为介体）从输入到输出的机制。我们将此方法应用于预先训练的变压器语言模型中的性别偏见。我们研究了单个神经元和注意力头的作用在介导三个旨在评估模型对性别偏见敏感性的数据集中的性别偏差中的作用。我们的调解分析表明，性别偏见效应是（i）稀疏，集中在网络的一小部分中；（ii）由不同组成部分进行协同，放大或抑制；（iii）分解成直接从输入并间接地通过介体流动的效果。

Common methods for interpreting neural models in natural language processing typically examine either their structure or their behavior, but not both. We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior. It enables us to analyze the mechanisms by which information flows from input to output through various model components, known as mediators. We apply this methodology to analyze gender bias in pre-trained Transformer language models. We study the role of individual neurons and attention heads in mediating gender bias across three datasets designed to gauge a model's sensitivity to gender bias. Our mediation analysis reveals that gender bias effects are (i) sparse, concentrated in a small part of the network; (ii) synergistic, amplified or repressed by different components; and (iii) decomposable into effects flowing directly from the input and indirectly through the mediators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题