论文标题

关注基于设备流的语音识别,并具有大型语音语料库

Attention based on-device streaming speech recognition with large speech corpus

论文作者

Kim, Kwangyoun, Lee, Kyungmin, Gowda, Dhananjaya, Park, Junmo, Kim, Sungsoo, Jin, Sichen, Lee, Young-Yoon, Yeo, Jinsu, Kim, Daehyun, Jung, Seokyeong, Lee, Jungin, Han, Myoungji, Kim, Chanwoo

论文摘要

在本文中,我们提出了一种基于单调块的注意力(MOCHA)模型的新型自动语音识别(ASR)系统,该模型接受了大型(> 10k小时)语料库的训练。我们主要通过使用连接派时间分类器(CTC)和跨熵(CE)损失,最小单词错误率(MWER)培训,层次训练和数据扩增方法的联合培训来达到通用域单词识别率的90%。此外,我们使用迭代性超低近似(LRA)方法将模型压缩了3.4倍以上,同时最大程度地减少了识别精度的降解。通过8位量化进一步降低了内存足迹,以使最终模型大小降低到39 MB。对于按需适应,我们将Mocha模型与统计N-Gram模型融合在一起,并且对于包括通用域(通用域)的目标域,我们的单词错误率(WER)平均可以取得相对36%的提高。

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pre-training and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源