论文标题

通过教师学习端到端语音识别的领域适应

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

论文作者

Meng, Zhong, Li, Jinyu, Gaur, Yashesh, Gong, Yifan

论文摘要

教师 - 学生(T/S)已证明对混合语音识别系统中深神经网络模型的领域适应有效。在这项工作中,我们将T/S学习扩展到大规模的无监督域,通过两个级别的知识转移:教师的代币后代作为软标签,作为解码器指导,对基于注意力的端到端(E2E)模型的适应。为了进一步改善T/S学习,我们提出了自适应T/S(AT/S)学习。学生始终以AT/s的速度从教师的柔软令牌后代或单热的地面标签中进行选择,而是通过分配给柔软和单热的标签的一对自适应权重,始终从老师和地面真相中学习,量化了每个知识源的信心。置信得分在每个解码器步骤上都是动态估计的,这是软标签和单热标签的函数。在3400小时的情况下,平行封闭式和远场Microsoft Cortana数据用于域适应性,T/S和/s AT/S实现了6.3%和10.3%的相对单词错误率在强的E2E模型上提高了具有相同数量的远场数据的强大E2E模型。

Teacher-student (T/S) has shown to be effective for domain adaptation of deep neural network acoustic models in hybrid speech recognition systems. In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance. To further improve T/S learning with the help of ground-truth labels, we propose adaptive T/S (AT/S) learning. Instead of conditionally choosing from either the teacher's soft token posteriors or the one-hot ground-truth label, in AT/S, the student always learns from both the teacher and the ground truth with a pair of adaptive weights assigned to the soft and one-hot labels quantifying the confidence on each of the knowledge sources. The confidence scores are dynamically estimated at each decoder step as a function of the soft and one-hot labels. With 3400 hours parallel close-talk and far-field Microsoft Cortana data for domain adaptation, T/S and AT/S achieve 6.3% and 10.3% relative word error rate improvement over a strong E2E model trained with the same amount of far-field data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源