低延迟流ASR的延迟拟合传感器

论文标题

低延迟流ASR的延迟拟合传感器

Delay-penalized transducer for low-latency streaming ASR

论文作者

Kang, Wei, Yao, Zengwei, Kuang, Fangjun, Guo, Liyong, Yang, Xiaoyu, lin, Long, Żelasko, Piotr, Povey, Daniel

论文摘要

在流媒体自动语音识别（ASR）中，希望尽可能减少延迟，同时对识别准确性产生最小的影响。尽管一些现有的方法能够实现此目标，但由于它们对外部一致性的依赖，它们很难实施。在本文中，我们提出了一种简单的方法来惩罚传感器模型中的符号延迟，以便我们可以平衡符号延迟和流媒体模型的准确性之间的权衡。具体而言，我们的方法添加了一个少量的恒定时间（T/2- T），其中T是帧的数量，T是当前帧的数量，将所有非直率的对数概率（在归一化之后）中添加到二维传感器递归中。对于流式构象模型和单向长期短期记忆（LSTM）模型，实验结果表明，它可以通过可接受的性能降解大大减少符号延迟。我们的方法与先前发表的FastEmit实现了类似的延迟准确性权衡，但是我们认为我们的方法是可取的，因为它具有更好的理由：相当于惩罚平均符号延迟。我们的工作是开源的，公开可用（https://github.com/k2-fsa/k2）。

In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy. Although a few existing methods are able to achieve this goal, they are difficult to implement due to their dependency on external alignments. In this paper, we propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments. Specifically, our method adds a small constant times (T/2 - t), where T is the number of frames and t is the current frame, to all the non-blank log-probabilities (after normalization) that are fed into the two dimensional transducer recursion. For both streaming Conformer models and unidirectional long short-term memory (LSTM) models, experimental results show that it can significantly reduce the symbol delay with an acceptable performance degradation. Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay. Our work is open-sourced and publicly available (https://github.com/k2-fsa/k2).

下载PDF全文

下载文献需遵守相关版权规定

论文标题