异常抑制：推动低位变压器语言模型的极限

论文标题

异常抑制：推动低位变压器语言模型的极限

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

论文作者

Wei, Xiuying, Zhang, Yunchen, Zhang, Xiangguo, Gong, Ruihao, Zhang, Shanghang, Zhang, Qi, Yu, Fengwei, Liu, Xianglong

论文摘要

变压器架构已成为广泛的自然语言处理〜（NLP）模型的基本要素。随着大型NLP模型的趋势，增加的内存和计算成本阻碍了其在资源有限设备上的有效部署。因此，变压器量化吸引了广泛的研究兴趣。最近的工作认识到结构化的异常值是量化性能的关键瓶颈。但是，他们提出的方法增加了计算开销，但仍将离群值留在那里。为了从根本上解决这个问题，本文深入研究了异常值的固有诱因和重要性。我们发现，分层（LN）的$ \boldsymbolγ$是异常值的有罪放大器，而异常值的重要性差异很大，其中一些代币提供的某些离群值覆盖了一个大面积，但可以急剧剪切，但不会迅速剪切而不会产生负面影响。在这些发现的激励下，我们提出了一个异常抑制框架，其中包括两个组成部分：伽玛迁移和依赖式剪辑。伽马迁移将异常放大器迁移到等效转换中的后续模块，从而导致更量化的模型而没有任何额外的负担。令牌的剪辑利用了令牌范围的较大差异，并设计了令牌的粗到精细管道，以有效的方式获得了具有最小的最终量化损失的剪辑范围。该框架有效地抑制了异常值，可以在插件模式下使用。广泛的实验证明，我们的框架超过了现有作品，并且首次将6位训练后的BERT量化量化推向了完整精确度（FP）水平。我们的代码可在https://github.com/wimh966/outlier_suppression上找到。

Transformer architecture has become the fundamental element of the widespread natural language processing~(NLP) models. With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices. Therefore, transformer quantization attracts wide research interest. Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. However, their proposed methods increase the computation overhead and still leave the outliers there. To fundamentally address this problem, this paper delves into the inherent inducement and importance of the outliers. We discover that $\boldsymbol γ$ in LayerNorm (LN) acts as a sinful amplifier for the outliers, and the importance of outliers varies greatly where some outliers provided by a few tokens cover a large area but can be clipped sharply without negative impacts. Motivated by these findings, we propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. The Gamma Migration migrates the outlier amplifier to subsequent modules in an equivalent transformation, contributing to a more quantization-friendly model without any extra burden. The Token-Wise Clipping takes advantage of the large variance of token range and designs a token-wise coarse-to-fine pipeline, obtaining a clipping range with minimal final quantization loss in an efficient way. This framework effectively suppresses the outliers and can be used in a plug-and-play mode. Extensive experiments prove that our framework surpasses the existing works and, for the first time, pushes the 6-bit post-training BERT quantization to the full-precision (FP) level. Our code is available at https://github.com/wimh966/outlier_suppression.

下载PDF全文

下载文献需遵守相关版权规定

论文标题