关注基于设备流的语音识别，并具有大型语音语料库

论文标题

关注基于设备流的语音识别，并具有大型语音语料库

Attention based on-device streaming speech recognition with large speech corpus

论文作者

Kim, Kwangyoun, Lee, Kyungmin, Gowda, Dhananjaya, Park, Junmo, Kim, Sungsoo, Jin, Sichen, Lee, Young-Yoon, Yeo, Jinsu, Kim, Daehyun, Jung, Seokyeong, Lee, Jungin, Han, Myoungji, Kim, Chanwoo

论文摘要

在本文中，我们提出了一种基于单调块的注意力（MOCHA）模型的新型自动语音识别（ASR）系统，该模型接受了大型（> 10k小时）语料库的训练。我们主要通过使用连接派时间分类器（CTC）和跨熵（CE）损失，最小单词错误率（MWER）培训，层次训练和数据扩增方法的联合培训来达到通用域单词识别率的90％。此外，我们使用迭代性超低近似（LRA）方法将模型压缩了3.4倍以上，同时最大程度地减少了识别精度的降解。通过8位量化进一步降低了内存足迹，以使最终模型大小降低到39 MB。对于按需适应，我们将Mocha模型与统计N-Gram模型融合在一起，并且对于包括通用域（通用域）的目标域，我们的单词错误率（WER）平均可以取得相对36％的提高。

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pre-training and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题