Dolfin：可解释性的潜在特征分布

论文标题

Dolfin：可解释性的潜在特征分布

DoLFIn: Distributions over Latent Features for Interpretability

论文作者

Le, Phong, Zuidema, Willem

论文摘要

解释神经模型的内部运作是确保模型的鲁棒性和可信赖性的关键步骤，但是神经网络可解释性的工作通常会面临折衷：要么模型过于限制以至于无法非常有用，要么模型发现的解决方案太复杂而无法解释。我们提出了一种实现可解释性的新型策略，在我们的实验中 - 避免了这种权衡。我们的方法基于将概率用作中心数量的成功，例如在注意机制中。在我们的架构中，Dolfin（可解释性的潜在功能分布），我们没有事先确定每个功能所代表的内容，并且功能完全进入了无序的集合。每个功能的相关概率范围从0到1，权衡其对进一步处理的重要性。我们表明，与注意力和显着图方法不同，这种设置使得计算输入组件支持神经模型做出的决策的概率变得直接。为了证明该方法的有用性，我们将Dolfin应用于文本分类，并证明Dolfin不仅提供了可解释的解决方案，而且甚至略大于SST2和AG-NEWS数据集上的经典CNN和Bilstm文本分类器。

Interpreting the inner workings of neural models is a key step in ensuring the robustness and trustworthiness of the models, but work on neural network interpretability typically faces a trade-off: either the models are too constrained to be very useful, or the solutions found by the models are too complex to interpret. We propose a novel strategy for achieving interpretability that -- in our experiments -- avoids this trade-off. Our approach builds on the success of using probability as the central quantity, such as for instance within the attention mechanism. In our architecture, DoLFIn (Distributions over Latent Features for Interpretability), we do no determine beforehand what each feature represents, and features go altogether into an unordered set. Each feature has an associated probability ranging from 0 to 1, weighing its importance for further processing. We show that, unlike attention and saliency map approaches, this set-up makes it straight-forward to compute the probability with which an input component supports the decision the neural model makes. To demonstrate the usefulness of the approach, we apply DoLFIn to text classification, and show that DoLFIn not only provides interpretable solutions, but even slightly outperforms the classical CNN and BiLSTM text classifiers on the SST2 and AG-news datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题