SAMP：通过自适应混合精液进行文本处理后训练后量化的模型推理工具包

论文标题

SAMP：通过自适应混合精液进行文本处理后训练后量化的模型推理工具包

SAMP: A Model Inference Toolkit of Post-Training Quantization for Text Processing via Self-Adaptive Mixed-Precision

论文作者

Tian, Rong, Zhao, Zijing, Liu, Weijie, Liu, Haoyan, Mao, Weiquan, Zhao, Zhe, Zhou, Kan

论文摘要

最新的工业推理引擎（例如ForstransFormer和TurboTransFormer）已证实了半精度的浮点（FP16）和8位整数（INT8）量化可以极大地提高模型推理速度。但是，现有的INT8量化方法太复杂了，使用不当将大大导致模型性能损害。在本文中，我们开发了一个工具包，供用户轻松量化其模型以进行推理，其中提出了自适应混合精液（SAMP），以通过混合精确体系结构自动控制量化速率，以平衡模型的准确性和效率。实验结果表明，我们的SAMP工具包的加速度比Pytorch和FertransFormer高，同时确保了所需的准确性。此外，SAMP基于模块化设计，将令牌，嵌入，编码器和目标层解耦，该层允许用户处理各种下游任务，并且可以将其无缝集成到Pytorch中。

The latest industrial inference engines, such as FasterTransformer and TurboTransformers, have verified that half-precision floating point (FP16) and 8-bit integer (INT8) quantization can greatly improve model inference speed. However, the existing INT8 quantization methods are too complicated, and improper usage will lead to model performance damage greatly. In this paper, we develop a toolkit for users to easily quantize their models for inference, in which Self-Adaptive Mixed-Precision (SAMP) is proposed to automatically control quantization rate by a mixed-precision architecture to balance model accuracy and efficiency. Experimental results show that our SAMP toolkit has a higher speedup than PyTorch and FasterTransformer while ensuring the required accuracy. In addition, SAMP is based on a modular design, decoupling the tokenizer, embedding, encoder and target layers, which allows users to handle various downstream tasks and can be seamlessly integrated into PyTorch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题