论文标题
快速变压器引起关注
Fast Transformers with Clustered Attention
论文作者
论文摘要
变形金刚已被证明是序列建模各种任务的成功模型。但是,计算注意力矩阵(这是其关键成分)相对于序列长度具有二次复杂性,因此对于大序列而言,它们的昂贵。为了解决这个问题,我们提出了聚集的注意力,该注意力不是计算每个查询的注意力,而是将查询到群集中并计算出对质心的关注。为了进一步改善这种近似值,我们使用计算的簇来识别每个查询最高关注的密钥,并计算确切的键/查询点产品。对于固定数量的簇,这导致了具有线性复杂性的模型。我们在两个自动语音识别数据集上评估了我们的方法,并表明我们的模型始终优于给定计算预算的香草变形金刚。最后,我们证明我们的模型可以通过在胶水和小队基准上近似25个簇且无损失的胶水和小队基准上的预贴BERT模型来近似群集的任意复杂的注意分布。
Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. To further improve this approximation, we use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.