嵌入式芯片组的流媒体关键字发现模型的子8位量化

论文标题

嵌入式芯片组的流媒体关键字发现模型的子8位量化

Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets

论文作者

Zeng, Lu, Parthasarathi, Sree Hari Krishnan, Liu, Yuzong, Escott, Alex, Cheekatmalla, Santosh Kumar, Strom, Nikko, Vitaladevuni, Shiv

论文摘要

我们为250k参数feedforward，流媒体，无状态关键字发现模型的所有组件提供了一种新颖的2阶段次级量子量化训练算法。对于第一阶段，我们使用tanh（。）在致密层的重量上使用非线性转换来调整最近提出的量化技术。在第二阶段，我们在网络的其余部分上使用线性量化方法，包括其他参数（偏见，增益，batchnorm），输入和激活。我们进行大规模实验，对26,000小时的去识别生产，远场和近场音频数据进行培训（对4,000小时的数据进行评估）。我们在两个嵌入式芯片组设置中组织结果：a）使用商品臂霓虹灯指令集和8位容器，我们使用次级8位权重（4、5、5、8位）和其他网络的8位量化的精度，CPU和内存结果； b）具有现成的神经网络加速器，用于一系列重量位宽度（1和5位），同时提出准确性结果，我们预测记忆利用率的减少。在两种配置中，我们的结果都表明，所提出的算法可以实现：a）以虚假拒绝率（FRR）的错误检测率（FDR）表示，具有完整的浮点模型在检测错误权衡（DET）曲线上的均等操作点； b）计算和内存的显着降低，最大提高了CPU消耗量的3倍，并且记忆消耗改善了4倍以上。

We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1st-stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the 2nd-stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with off-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both configurations, our results show that the proposed algorithm can achieve: a) parity with a full floating point model's operating point on a detection error tradeoff (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) significant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption.

下载PDF全文

下载文献需遵守相关版权规定

论文标题