用键值内存线性化变压器

论文标题

用键值内存线性化变压器

Linearizing Transformer with Key-Value Memory

论文作者

Zhang, Yizhe, Cai, Deng

论文摘要

具有线性时间复杂性的有效变压器变体已开发出减轻香草变压器的二次计算开销。其中包括低级投影方法，例如线形和基于内核的变压器。尽管它们具有独特的优点，但它们通常会在许多序列生成任务上与香草变压器进行比较，并且在生成短时通常无法获得计算增益。我们提出了Memssizer，这是一种缩小性能差距的方法，同时即使在短期内也提高了效率。它将源序列投射到诸如Linformer之类的较低维表示中，同时享受类似于基于内核的变压器的有效循环式增量计算。这会产生线性计算时间和推理时间的恒定内存复杂性。 Memssizer还采用了轻巧的多头机构，该机制将计算作为单头模型提供。我们证明，在三个典型的序列生成任务中，Memssizer在Vanilla Transformer和其他有效的变压器变体之间提供了提高的平衡，包括机器翻译，抽象文本摘要和语言建模。

Efficient transformer variants with linear time complexity have been developed to mitigate the quadratic computational overhead of the vanilla transformer. Among them are low-rank projection methods such as Linformer and kernel-based Transformers. Despite their unique merits, they usually suffer from a performance drop comparing with the vanilla transformer on many sequence generation tasks, and often fail to obtain computation gain when the generation is short. We propose MemSizer, an approach towards closing the performance gap while improving the efficiency even with short generation. It projects the source sequences into lower dimension representations like Linformer, while enjoying efficient recurrent-style incremental computation similar to kernel-based transformers. This yields linear computation time and constant memory complexity at inference time. MemSizer also employs a lightweight multi-head mechanism which renders the computation as light as a single-head model. We demonstrate that MemSizer provides an improved balance between efficiency and accuracy over the vanilla transformer and other efficient transformer variants in three typical sequence generation tasks, including machine translation, abstractive text summarization, and language modeling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题