通过线性可扩展的长篇小写变压器对蛋白质的蒙版语言建模

论文标题

通过线性可扩展的长篇小写变压器对蛋白质的蒙版语言建模

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

论文作者

Choromanski, Krzysztof, Likhosherstov, Valerii, Dohan, David, Song, Xingyou, Gane, Andreea, Sarlos, Tamas, Hawkins, Peter, Davis, Jared, Belanger, David, Colwell, Lucy, Weller, Adrian

论文摘要

变压器模型已在各种范围内取得了最新的结果。但是，关注培训注意力机制的成本，以学习遥远输入之间的复杂依赖性的问题不断增长。作为响应，利用学习注意力矩阵的结构和稀疏性的解决方案已经蓬勃发展。但是，涉及长序列（例如生物序列分析）的现实世界应用可能无法满足这些假设，从而排除了对这些模型的探索。为了应对这一挑战，我们基于正交随机功能（偏爱）的快速关注而提出了一种新的变压器体系结构，表演者。我们的机制在序列中的代币数量中线性缩放而不是二次缩放，其特征在于亚二次空间复杂性，并且不包含任何稀疏模式先验。此外，它提供了强大的理论保证：注意矩阵和均匀收敛的无偏估计。它也与预训练的常规变压器倒退。我们证明了它在蛋白质序列建模的挑战性任务上的有效性，并提供了详细的理论分析。

Transformer models have achieved state-of-the-art results across a diverse range of domains. However, concern over the cost of training the attention mechanism to learn complex dependencies between distant inputs continues to grow. In response, solutions that exploit the structure and sparsity of the learned attention matrix have blossomed. However, real-world applications that involve long sequences, such as biological sequence analysis, may fall short of meeting these assumptions, precluding exploration of these models. To address this challenge, we present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR). Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. Furthermore, it provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence. It is also backwards-compatible with pre-trained regular Transformers. We demonstrate its effectiveness on the challenging task of protein sequence modeling and provide detailed theoretical analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题