精细的和粗粒状的杂种自我注意，以进行有效的BERT

论文标题

精细的和粗粒状的杂种自我注意，以进行有效的BERT

Fine- and Coarse-Granularity Hybrid Self-Attention for Efficient BERT

论文作者

Zhao, Jing, Wang, Yifan, Bao, Junwei, Wu, Youzheng, He, Xiaodong

论文摘要

基于变压器的预训练模型（例如BERT）在实现最新的最新语言处理应用方面表现出了非凡的成功。但是，由于变压器的标准自我注意事项机制在输入序列长度中遭受二次计算成本，因此部署这些模型的部署可能会高昂。为了面对这一点，我们提出了FCA，这是一种精细的和粗粒状的杂交自我注意力，通过逐步缩短自我注意的计算序列长度来降低计算成本。具体而言，FCA执行了基于注意力的评分策略，以确定每层代币的信息性。然后，信息令牌是自我注意事项中的细粒度计算单元，而非信息令牌则被一个或几个群集代替，作为自我注意力中的粗粒度计算单元。胶水和种族数据集的实验表明，FCA的BERT比原始BERT的拖鞋减少了2倍，精度损失<1％。我们表明，与先前的方法相比，FCA在准确性和失败之间提供了更高的权衡。

Transformer-based pre-trained models, such as BERT, have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, deploying these models can be prohibitively costly, as the standard self-attention mechanism of the Transformer suffers from quadratic computational cost in the input sequence length. To confront this, we propose FCA, a fine- and coarse-granularity hybrid self-attention that reduces the computation cost through progressively shortening the computational sequence length in self-attention. Specifically, FCA conducts an attention-based scoring strategy to determine the informativeness of tokens at each layer. Then, the informative tokens serve as the fine-granularity computing units in self-attention and the uninformative tokens are replaced with one or several clusters as the coarse-granularity computing units in self-attention. Experiments on GLUE and RACE datasets show that BERT with FCA achieves 2x reduction in FLOPs over original BERT with <1% loss in accuracy. We show that FCA offers a significantly better trade-off between accuracy and FLOPs compared to prior methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题