带有令牌级序列化输出培训的流媒体多对话者ASR

论文标题

带有令牌级序列化输出培训的流媒体多对话者ASR

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

论文作者

Kanda, Naoyuki, Wu, Jian, Wu, Yu, Xiao, Xiong, Meng, Zhong, Wang, Xiaofei, Gaur, Yashesh, Chen, Zhuo, Li, Jinyu, Yoshioka, Takuya

论文摘要

本文提出了代币级的序列化输出训练（T-SOT），这是流式传输多对话者自动语音识别（ASR）的新型框架。与使用多个输出分支的现有流媒体多对话者ASR模型不同，T-SOT模型只有一个单个输出分支，该分支基于其排放时间生成多个扬声器的多个扬声器的识别令牌（例如，单词，子字）。引入了一个特殊的令牌，以表明``虚拟''输出通道的更改以跟踪重叠的话语。与先前的流媒体多对话者ASR模型相比，T-SOT模型具有更低的推理成本和更简单的模型体系结构的优点。此外，在我们对LibrisPeechMix和Librics数据集的实验中，基于T-SOT的变压器传感器模型可以达到最新的单词错误率，从而达到了先前的结果。对于非重叠的语音，T-SOT模型就精确性和计算成本而言与单聊天者ASR模型相提并论，这为单个单词和多对话者方案的部署一个模型打开了大门。

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题