论文标题

位置神经机器翻译的极低位变压器量化

Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation

论文作者

Chung, Insoo, Kim, Byeongwook, Choi, Yoonjung, Kwon, Se Jung, Jeon, Yongkweon, Park, Baeseong, Kim, Sangha, Lee, Dongsoo

论文摘要

由于推理期间,繁重的计算负载和内存开销,尤其是当目标设备在计算资源(例如移动设备)等计算资源(例如移动设备)中,因此广泛使用的变压器体系结构的部署是具有挑战性的。量化是应对此类挑战的有效技术。我们的分析表明,对于给定数量的量化位,变压器的每个块都会有助于以不同的方式进行翻译质量和推理计算。此外,即使在嵌入块中,每个单词都会提出巨大的贡献。相应地,我们提出了一种混合的精度量化策略,以表示变压器的权重以极低数量的位(例如3位以下)。例如,对于嵌入块中的每个单词,我们基于统计属性分配了不同的量化位。我们量化的变压器型号比基线模型实现11.8 $ \ times $ $小于-0.5 BLEU。我们实现了8.3 $ \ times $减少运行时内存足迹和3.5 $ \ times $ speed up(Galaxy n10+),因此我们提出的压缩策略可以为NMT提供有效的实现。

The deployment of widely used Transformer architecture is challenging because of heavy computation load and memory overhead during inference, especially when the target device is limited in computational resources such as mobile or edge devices. Quantization is an effective technique to address such challenges. Our analysis shows that for a given number of quantization bits, each block of Transformer contributes to translation quality and inference computations in different manners. Moreover, even inside an embedding block, each word presents vastly different contributions. Correspondingly, we propose a mixed precision quantization strategy to represent Transformer weights by an extremely low number of bits (e.g., under 3 bits). For example, for each word in an embedding block, we assign different quantization bits based on statistical property. Our quantized Transformer model achieves 11.8$\times$ smaller model size than the baseline model, with less than -0.5 BLEU. We achieve 8.3$\times$ reduction in run-time memory footprints and 3.5$\times$ speed up (Galaxy N10+) such that our proposed compression strategy enables efficient implementation for on-device NMT.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源