关于基于变压器模型中本地信息的重要性

论文标题

关于基于变压器模型中本地信息的重要性

On the Importance of Local Information in Transformer Based Models

论文作者

Pande, Madhura, Budhraja, Aakriti, Nema, Preksha, Kumar, Pratyush, Khapra, Mitesh M.

论文摘要

自我发场模块是基于变压器的模型的关键组成部分，其中每个令牌都会注意其他所有令牌。最近的研究表明，这些头部表现出句法，语义或局部行为。一些研究还确定了将这种注意力限制为本地的希望，即仅在周围一个小社区中参与其他令牌的令牌。但是，没有结论性的证据表明，仅此当地关注就足以在多个NLP任务上实现高精度。在这项工作中，我们系统地分析了局部信息在学习模型中的作用，并将其与句法信息的作用进行对比。更具体地说，我们首先进行敏感性分析，并表明，在每一层，令牌的表示对周围小邻域中的令牌比与与之相关的令牌更敏感。然后，我们定义一个注意力偏置度量，以确定头部是否对本地令牌或句法相关的令牌表示更多关注。我们表明，与句法偏差相比，较大的头部具有局部性偏差。在确定了当地注意力头的重要性之后，我们训练和评估模型，其中不同的注意力头部被限制为本地。这样的模型将更加有效，因为它们在注意力层中的计算将更少。我们在4个粘合数据集（QQP，SST-2，MRPC，QNLI）和2 mt数据集（EN-DE，ENRU）上评估了这些模型，并清楚地证明了这种约束模型的性能与无约束的模型具有可比性的性能。通过这种系统的评估，我们确定基于变压器的模型中的注意力可以限制为本地的，而不会影响性能。

The self-attention module is a key component of Transformer-based models, wherein each token pays attention to every other token. Recent studies have shown that these heads exhibit syntactic, semantic, or local behaviour. Some studies have also identified promise in restricting this attention to be local, i.e., a token attending to other tokens only in a small neighbourhood around it. However, no conclusive evidence exists that such local attention alone is sufficient to achieve high accuracy on multiple NLP tasks. In this work, we systematically analyse the role of locality information in learnt models and contrast it with the role of syntactic information. More specifically, we first do a sensitivity analysis and show that, at every layer, the representation of a token is much more sensitive to tokens in a small neighborhood around it than to tokens which are syntactically related to it. We then define an attention bias metric to determine whether a head pays more attention to local tokens or to syntactically related tokens. We show that a larger fraction of heads have a locality bias as compared to a syntactic bias. Having established the importance of local attention heads, we train and evaluate models where varying fractions of the attention heads are constrained to be local. Such models would be more efficient as they would have fewer computations in the attention layer. We evaluate these models on 4 GLUE datasets (QQP, SST-2, MRPC, QNLI) and 2 MT datasets (En-De, En-Ru) and clearly demonstrate that such constrained models have comparable performance to the unconstrained models. Through this systematic evaluation we establish that attention in Transformer-based models can be constrained to be local without affecting performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题