论文标题
带有令牌级序列化输出培训的流媒体多对话者ASR
Streaming Multi-Talker ASR with Token-Level Serialized Output Training
论文作者
论文摘要
本文提出了代币级的序列化输出训练(T-SOT),这是流式传输多对话者自动语音识别(ASR)的新型框架。与使用多个输出分支的现有流媒体多对话者ASR模型不同,T-SOT模型只有一个单个输出分支,该分支基于其排放时间生成多个扬声器的多个扬声器的识别令牌(例如,单词,子字)。引入了一个特殊的令牌,以表明``虚拟''输出通道的更改以跟踪重叠的话语。与先前的流媒体多对话者ASR模型相比,T-SOT模型具有更低的推理成本和更简单的模型体系结构的优点。此外,在我们对LibrisPeechMix和Librics数据集的实验中,基于T-SOT的变压器传感器模型可以达到最新的单词错误率,从而达到了先前的结果。对于非重叠的语音,T-SOT模型就精确性和计算成本而言与单聊天者ASR模型相提并论,这为单个单词和多对话者方案的部署一个模型打开了大门。
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.