朝着具有上下文适配器和自适应增强的CTC语音识别模型的个性化

论文标题

朝着具有上下文适配器和自适应增强的CTC语音识别模型的个性化

Towards Personalization of CTC Speech Recognition Models with Contextual Adapters and Adaptive Boosting

论文作者

Dingliwal, Saket, Sunkara, Monica, Bodapati, Sravan, Ronanki, Srikanth, Farris, Jeff, Kirchhoff, Katrin

论文摘要

端到端的语音识别模型使用联合连接派时间分类（CTC） - 注意力损失训练，最近越来越受欢迎。在这些模型中，由于其速度和简单性，通常在推理时间使用非自动回归的CTC解码器。但是，此类模型很难个性化，因为它们的有条件独立性假设可以阻止输出代币从以前的时间步骤中影响未来的预测。为了解决这个问题，我们提出了一种新颖的双向方法，该方法首先要关注编码器，而不是预定的稀有长尾和量表范围内的单词（OOV）单词，然后在解码过程中使用动态增强和电话对齐网络，以进一步偏向子词预测。我们评估了对开源voxpopuli和内部医疗数据集的方法，以展示与强CTC基线相比，在特定领域的稀有单词上的F1分数提高了60％。

End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword predictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words over a strong CTC baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题