论文标题
线性变压器中的魔鬼
The Devil in Linear Transformer
论文作者
论文摘要
线性变压器旨在降低香草变压器的二次时空复杂性。但是,他们通常会在各种任务和语料库上遭受降解的表现。在本文中,我们检查了现有的基于内核的线性变压器,并确定了导致此类性能差距的两个关键问题:1)注意计算中的无界梯度对线性变压器模型的收敛性产生不利影响; 2)注意力稀释,在忽略邻近结构的同时,稀释稀释会分布长序列。为了解决这些问题,我们首先确定注意力矩阵的缩放是无限梯度中的魔鬼,当我们在理论和经验上展示时,它在线性的关注中是不必要的。为此,我们提出了一种新的线性注意,该注意力将缩放操作替换为归一化以稳定梯度。对于注意力稀释的问题,我们利用对角的关注,将注意力仅限于早期的邻近令牌。我们的新线性变压器模型受益于稳定的梯度和提高的注意力,Transnormer在文本分类和语言建模任务以及具有挑战性的远距离竞技场基准上表现出了卓越的性能,超过了Vanilla Transformer和现有的线性变体,同时具有明显的空间效率。该代码可从https://github.com/opennlplab/transnormer获得。
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures. To address these issues, we first identify that the scaling of attention matrices is the devil in unbounded gradients, which turns out unnecessary in linear attention as we show theoretically and empirically. To this end, we propose a new linear attention that replaces the scaling operation with a normalization to stabilize gradients. For the issue of attention dilution, we leverage a diagonal attention to confine attention to only neighbouring tokens in early layers. Benefiting from the stable gradients and improved attention, our new linear transformer model, transNormer, demonstrates superior performance on text classification and language modeling tasks, as well as on the challenging Long-Range Arena benchmark, surpassing vanilla transformer and existing linear variants by a clear margin while being significantly more space-time efficient. The code is available at https://github.com/OpenNLPLab/Transnormer .