论文标题
带有图形神经网络编码的树木约束指针生成器,用于上下文语音识别
Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition
论文作者
论文摘要
将作为上下文知识获得的偏见单词纳入对于许多自动语音识别(ASR)应用至关重要。本文建议将图形神经网络(GNN)编码用于端到端上下文ASR中的树受限指针生成器(TCPGEN)组件。通过使用基于树的GNN编码前缀树中的有偏见的单词,可以在每个树节点上通过将有关从其扎根的树枝上的所有文字组合在一起的信息来实现端到端ASR解码的未来文字,从而实现了更准确的预测偏见的单词的产生可能性。使用模拟的偏置任务在Librispeech语料库上评估系统,并通过提出一种新颖的视觉构想的上下文ASR管道,在AMI语料库上评估系统,该杂志在每次会议旁边从幻灯片中提取有偏见的单词。结果表明,与原始TCPGEN相比,具有GNN编码的TCPGEN对偏置单词的相对减少了约15%,而解码的计算成本可以忽略不计。
Incorporating biasing words obtained as contextual knowledge is critical for many automatic speech recognition (ASR) applications. This paper proposes the use of graph neural network (GNN) encodings in a tree-constrained pointer generator (TCPGen) component for end-to-end contextual ASR. By encoding the biasing words in the prefix-tree with a tree-based GNN, lookahead for future wordpieces in end-to-end ASR decoding is achieved at each tree node by incorporating information about all wordpieces on the tree branches rooted from it, which allows a more accurate prediction of the generation probability of the biasing words. Systems were evaluated on the Librispeech corpus using simulated biasing tasks, and on the AMI corpus by proposing a novel visual-grounded contextual ASR pipeline that extracts biasing words from slides alongside each meeting. Results showed that TCPGen with GNN encodings achieved about a further 15% relative WER reduction on the biasing words compared to the original TCPGen, with a negligible increase in the computation cost for decoding.