论文标题
基于概念的模型解释的因果代理模型
Causal Proxy Models for Concept-Based Model Explanations
论文作者
论文摘要
NLP系统的解释性方法会遇到因果推断的基本问题的版本:对于给定的地面实际输入文本,我们从未真正观察到隔离模型表示对输出的因果影响所必需的反事实文本。作为回应,许多解释性方法没有使用反事实文本,假设它们将是不可用的。在本文中,我们表明可以使用近似反事实来创建强大的因果解释方法,该方法可以由人类编写,以近似特定的反事实或简单地使用元数据指导的启发式启发式启示术进行采样。我们建议的核心是因果替代模型(CPM)。 CPM解释了一个黑框$ \ MATHCAL {n} $,因为它经过训练可以具有与$ \ Mathcal {n} $相同的实际输入/输出行为,同时创建可以介入的神经表示,以模拟$ \ nathCal {n} $的反事实输入/输出行为。此外,我们表明,在做出事实预测时,$ \ Mathcal {n} $的最佳CPM与$ \ Mathcal {N} $相当地执行,这意味着CPM可以简单地替换$ \ Mathcal {n} $,从而导致更可解释的已解释模型。我们的代码可在https://github.com/frankaging/causal-proxy-model上找到。
Explainability methods for NLP systems encounter a version of the fundamental problem of causal inference: for a given ground-truth input text, we never truly observe the counterfactual texts necessary for isolating the causal effects of model representations on outputs. In response, many explainability methods make no use of counterfactual texts, assuming they will be unavailable. In this paper, we show that robust causal explainability methods can be created using approximate counterfactuals, which can be written by humans to approximate a specific counterfactual or simply sampled using metadata-guided heuristics. The core of our proposal is the Causal Proxy Model (CPM). A CPM explains a black-box model $\mathcal{N}$ because it is trained to have the same actual input/output behavior as $\mathcal{N}$ while creating neural representations that can be intervened upon to simulate the counterfactual input/output behavior of $\mathcal{N}$. Furthermore, we show that the best CPM for $\mathcal{N}$ performs comparably to $\mathcal{N}$ in making factual predictions, which means that the CPM can simply replace $\mathcal{N}$, leading to more explainable deployed models. Our code is available at https://github.com/frankaging/Causal-Proxy-Model.