令人反感的关注：重新思考贝叶斯推断的多头关注

论文标题

令人反感的关注：重新思考贝叶斯推断的多头关注

Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

论文作者

An, Bang, Lyu, Jie, Wang, Zhenyi, Li, Chunyuan, Hu, Changwei, Tan, Fei, Zhang, Ruiyi, Hu, Yifan, Chen, Changyou

论文摘要

神经注意机制在许多自然语言处理应用中起着重要作用。特别是，通过允许模型从不同的角度共同参与信息，使用多头关注会引起单头注意。但是，如果没有明确的限制，多头的注意力可能会遭受注意力崩溃的损失，这一问题使不同的头部提取相似的细心特征，从而限制了模型的表示能力。在本文中，我们第一次从贝叶斯的角度提供了对多头关注的新了解。基于最近开发的粒子优化抽样技术，我们提出了一种非参数方法，可以明确提高多头注意力的排斥性，从而增强模型的表现力。值得注意的是，我们的贝叶斯解释提供了理论上的灵感，这些灵感是在不被察觉的问题上：为什么和如何使用多头关注。在各种注意力模型和应用上进行的广泛实验表明，提出的排斥性关注可以改善学习的功能多样性，从而导致更有信息的表示，并且对各种任务的绩效持续提高。

The neural attention mechanism plays an important role in many natural language processing applications. In particular, the use of multi-head attention extends single-head attention by allowing a model to jointly attend information from different perspectives. Without explicit constraining, however, multi-head attention may suffer from attention collapse, an issue that makes different heads extract similar attentive features, thus limiting the model's representation power. In this paper, for the first time, we provide a novel understanding of multi-head attention from a Bayesian perspective. Based on the recently developed particle-optimization sampling techniques, we propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention and consequently strengthens model's expressiveness. Remarkably, our Bayesian interpretation provides theoretical inspirations on the not-well-understood questions: why and how one uses multi-head attention. Extensive experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity, leading to more informative representations with consistent performance improvement on various tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题