通过局部梯度对齐来进行更强大的解释

论文标题

通过局部梯度对齐来进行更强大的解释

Towards More Robust Interpretation via Local Gradient Alignment

论文作者

Joo, Sunghwan, Jeong, Seokhyeon, Heo, Juyeon, Weller, Adrian, Moon, Taesup

论文摘要

神经网络解释方法，尤其是特征归因方法，相对于对抗输入扰动是脆弱的。为了解决这个问题，提出了一些用于达到培训时提高梯度平滑度的方法，以达到\ textit {robust}特征属性。但是，缺乏考虑属性的归一化，这在其可视化中至关重要，这是理解和改善特征归因方法的鲁棒性的障碍。在本文中，我们通过考虑这种标准化来提供新的见解。首先，我们表明，对于每个非负均质神经网络，梯度的幼稚$ \ ell_2 $ - 抛光标准是\ textit {not}归一化不变，这意味着两个具有相同归一化梯度的函数可以具有不同的值。其次，我们制定了一个基于远距离的标准的归一化不变的标准，并得出了其上限，这可以洞悉为什么像以前的工作那样简单地最大程度地降低输入处的Hessian Norm，这不足以获得可靠的特征归因。最后，我们建议将$ \ ell_2 $和基于余弦的距离标准组合为正规化条款，以利用两者在对齐本地梯度方面的优势。结果，我们从实验上表明，与最近的基线相比，使用我们的方法训练的模型在CIFAR-10和Imagenet-100上产生了更强的解释，而不会显着损害准确性。据我们所知，这是第一项验证CIFAR-10以外的大规模数据集上解释的鲁棒性的工作，这要归功于我们方法的计算效率。

Neural network interpretation methods, particularly feature attribution methods, are known to be fragile with respect to adversarial input perturbations. To address this, several methods for enhancing the local smoothness of the gradient while training have been proposed for attaining \textit{robust} feature attributions. However, the lack of considering the normalization of the attributions, which is essential in their visualizations, has been an obstacle to understanding and improving the robustness of feature attribution methods. In this paper, we provide new insights by taking such normalization into account. First, we show that for every non-negative homogeneous neural network, a naive $\ell_2$-robust criterion for gradients is \textit{not} normalization invariant, which means that two functions with the same normalized gradient can have different values. Second, we formulate a normalization invariant cosine distance-based criterion and derive its upper bound, which gives insight for why simply minimizing the Hessian norm at the input, as has been done in previous work, is not sufficient for attaining robust feature attribution. Finally, we propose to combine both $\ell_2$ and cosine distance-based criteria as regularization terms to leverage the advantages of both in aligning the local gradient. As a result, we experimentally show that models trained with our method produce much more robust interpretations on CIFAR-10 and ImageNet-100 without significantly hurting the accuracy, compared to the recent baselines. To the best of our knowledge, this is the first work to verify the robustness of interpretation on a larger-scale dataset beyond CIFAR-10, thanks to the computational efficiency of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题