通过细粒度的上下文知识选择改善端到端的上下文语音识别

论文标题

通过细粒度的上下文知识选择改善端到端的上下文语音识别

Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection

论文作者

Han, Minglun, Dong, Linhao, Liang, Zhenlin, Cai, Meng, Zhou, Shiyu, Ma, Zejun, Xu, Bo

论文摘要

如今，端到端上下文语音识别中的大多数方法都将识别过程偏向上下文知识。由于全神经上下文偏见方法依赖于短语级上下文建模和基于注意力的相关性建模，因此它们可能会在相似的特定上下文特定短语之间遇到混乱，这损害了令牌级别的预测。在这项工作中，我们专注于通过精细的上下文知识选择（Finecos）来缓解混乱问题。在Finecos中，我们引入了细粒度的知识，以减少令牌预测的不确定性。具体而言，我们首先采用短语选择来缩小候选词组的范围，然后对所选短语候选者的令牌进行令牌关注。此外，我们将最相关短语的注意力权重重新归一化，以获取更集中的短语级上下文表示，并注入位置信息以更好地区分短语或令牌。在LibrisPeech和一个内部的160,000小时数据集上，我们基于可控的全神经偏置方法，协作解码（COLDEC）探讨了提出的方法。所提出的方法最多可提供6.1％的相对单词错误率降低，而coldec内部数据集的相对字符错误率降低为16.4％。

Nowadays, most methods in end-to-end contextual speech recognition bias the recognition process towards contextual knowledge. Since all-neural contextual biasing methods rely on phrase-level contextual modeling and attention-based relevance modeling, they may encounter confusion between similar context-specific phrases, which hurts predictions at the token level. In this work, we focus on mitigating confusion problems with fine-grained contextual knowledge selection (FineCoS). In FineCoS, we introduce fine-grained knowledge to reduce the uncertainty of token predictions. Specifically, we first apply phrase selection to narrow the range of phrase candidates, and then conduct token attention on the tokens in the selected phrase candidates. Moreover, we re-normalize the attention weights of most relevant phrases in inference to obtain more focused phrase-level contextual representations, and inject position information to better discriminate phrases or tokens. On LibriSpeech and an in-house 160,000-hour dataset, we explore the proposed methods based on a controllable all-neural biasing method, collaborative decoding (ColDec). The proposed methods provide at most 6.1% relative word error rate reduction on LibriSpeech and 16.4% relative character error rate reduction on the in-house dataset over ColDec.

下载PDF全文

下载文献需遵守相关版权规定

论文标题