通过教师学习端到端语音识别的领域适应

论文标题

通过教师学习端到端语音识别的领域适应

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

论文作者

Meng, Zhong, Li, Jinyu, Gaur, Yashesh, Gong, Yifan

论文摘要

教师 - 学生（T/S）已证明对混合语音识别系统中深神经网络模型的领域适应有效。在这项工作中，我们将T/S学习扩展到大规模的无监督域，通过两个级别的知识转移：教师的代币后代作为软标签，作为解码器指导，对基于注意力的端到端（E2E）模型的适应。为了进一步改善T/S学习，我们提出了自适应T/S（AT/S）学习。学生始终以AT/s的速度从教师的柔软令牌后代或单热的地面标签中进行选择，而是通过分配给柔软和单热的标签的一对自适应权重，始终从老师和地面真相中学习，量化了每个知识源的信心。置信得分在每个解码器步骤上都是动态估计的，这是软标签和单热标签的函数。在3400小时的情况下，平行封闭式和远场Microsoft Cortana数据用于域适应性，T/S和/s AT/S实现了6.3％和10.3％的相对单词错误率在强的E2E模型上提高了具有相同数量的远场数据的强大E2E模型。

Teacher-student (T/S) has shown to be effective for domain adaptation of deep neural network acoustic models in hybrid speech recognition systems. In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance. To further improve T/S learning with the help of ground-truth labels, we propose adaptive T/S (AT/S) learning. Instead of conditionally choosing from either the teacher's soft token posteriors or the one-hot ground-truth label, in AT/S, the student always learns from both the teacher and the ground truth with a pair of adaptive weights assigned to the soft and one-hot labels quantifying the confidence on each of the knowledge sources. The confidence scores are dynamically estimated at each decoder step as a function of the soft and one-hot labels. With 3400 hours parallel close-talk and far-field Microsoft Cortana data for domain adaptation, T/S and AT/S achieve 6.3% and 10.3% relative word error rate improvement over a strong E2E model trained with the same amount of far-field data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题