致命可解释性研究

论文标题

致命可解释性研究

Towards falsifiable interpretability research

论文作者

Leavitt, Matthew L., Morcos, Ari

论文摘要

理解深层神经网络（DNN）基础的决策和机制的决策的方法通常依赖于建立直觉，这是通过强调各个示例的感官或语义特征来构建直觉。例如，方法旨在可视化输入的组件，这些组件对网络的决定“重要”或测量单个神经元的语义属性。在这里，我们认为，解释性研究遭受了对基于直觉的方法的过度依赖，这些方法在某些情况下引起了鲜明的进步和误导性的结论。我们确定了一组局限性，我们认为这阻碍了解释性研究中有意义的进步，并研究了两种流行的可解释性方法 - 疑虑 - 基于单神经化的方法 - 作为对直觉和缺乏虚假性的过度依赖如何破坏可解释性研究的案例研究。为了解决这些问题，我们提出了一项策略，以框架的形式解决这些障碍，以进行强有力的可解释性研究。我们鼓励研究人员使用他们的直觉作为开发和检验清晰，可伪造的假设的起点，并希望我们的框架能够产生强大的基于证据的可解释性方法，从而在我们对DNN的理解中产生有意义的进步。

Methods for understanding the decisions of and mechanisms underlying deep neural networks (DNNs) typically rely on building intuition by emphasizing sensory or semantic features of individual examples. For instance, methods aim to visualize the components of an input which are "important" to a network's decision, or to measure the semantic properties of single neurons. Here, we argue that interpretability research suffers from an over-reliance on intuition-based approaches that risk-and in some cases have caused-illusory progress and misleading conclusions. We identify a set of limitations that we argue impede meaningful progress in interpretability research, and examine two popular classes of interpretability methods-saliency and single-neuron-based approaches-that serve as case studies for how overreliance on intuition and lack of falsifiability can undermine interpretability research. To address these concerns, we propose a strategy to address these impediments in the form of a framework for strongly falsifiable interpretability research. We encourage researchers to use their intuitions as a starting point to develop and test clear, falsifiable hypotheses, and hope that our framework yields robust, evidence-based interpretability methods that generate meaningful advances in our understanding of DNNs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题