论文标题
通过基于梯度的学习时间修剪来加速注意力
Accelerating Attention through Gradient-Based Learned Runtime Pruning
论文作者
论文摘要
自我注意是各种基于变压器的自然语言处理模型的最新精度的关键推动力。这种注意机制计算每个单词相对于句子中的其他单词的相关得分。通常,只有一小部分单词与受到注意的单词高度相关,该单词仅在运行时确定。因此,由于注意力评分较低,大量计算无关紧要,并且可能会修剪。主要的挑战是找到分数的阈值,而后续计算将是无关紧要的。尽管这样的阈值是离散的,但本文通过集成到训练损失函数的软件正规剂来制定其搜索。该公式易于后退训练,以分析性地将阈值和权重同时进行,从而在准确性和计算修剪之间正式达到了正式的最佳平衡。为了最好地利用这一数学创新,我们设计了一种被称为Leopard的比特式体系结构,用于具有比特级早期终止微实施机制的变压器语言模型。我们在MEMN2N,BERT,ALBERT,GPT-2和VISION Transformer模型的43个后端任务中评估了我们的设计。 Layout后结果表明,平均而言,豹子分别产生1.9倍和3.9倍的加速和能量降低,同时保持平均准确性实际上完好无损(<0.2%降解)
Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9x and 3.9x speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (<0.2% degradation)