知识转移和蒸馏从自回旋到非自动回归语音识别

论文标题

知识转移和蒸馏从自回旋到非自动回归语音识别

Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition

论文作者

Gong, Xun, Zhou, Zhikai, Qian, Yanmin

论文摘要

现代非自动性〜（NAR）语音识别系统旨在加速推理速度；但是，与自动回归〜（AR）型号以及巨大的模型大小问题相比，它们会遭受性能降解。我们提出了一种新颖的知识转移和蒸馏体系结构，该结构利用AR模型的知识来提高NAR性能，同时降低模型的大小。框架和序列级别的目标已精心设计用于转移学习。为了进一步提高NAR的性能，开发了蒙版-CTC上的光束搜索方法，以扩大推理阶段的搜索空间。实验表明，所提出的NAR束搜索相对可在Aishell-1基准测试中降低5％以上，并具有可耐受的实时因子〜（RTF）增量。通过知识转移，与AR教师相同的NAR学生在Aishell-1 Dev/Test集中获得了8/16％的相对CER降低，并且对Librispeech测试清洁/其他集合的相对减少了25％。此外，通过提出的知识转移和蒸馏，在Aishell-1和LibrisPeech基准上，〜9x较小的NAR模型在Aishell-1和LibrisPeech基准上都实现了约25％的相对CER/WER。

Modern non-autoregressive~(NAR) speech recognition systems aim to accelerate the inference speed; however, they suffer from performance degradation compared with autoregressive~(AR) models as well as the huge model size issue. We propose a novel knowledge transfer and distillation architecture that leverages knowledge from AR models to improve the NAR performance while reducing the model's size. Frame- and sequence-level objectives are well-designed for transfer learning. To further boost the performance of NAR, a beam search method on Mask-CTC is developed to enlarge the search space during the inference stage. Experiments show that the proposed NAR beam search relatively reduces CER by over 5% on AISHELL-1 benchmark with a tolerable real-time-factor~(RTF) increment. By knowledge transfer, the NAR student who has the same size as the AR teacher obtains relative CER reductions of 8/16% on AISHELL-1 dev/test sets, and over 25% relative WER reductions on LibriSpeech test-clean/other sets. Moreover, the ~9x smaller NAR models achieve ~25% relative CER/WER reductions on both AISHELL-1 and LibriSpeech benchmarks with the proposed knowledge transfer and distillation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题