端到端扬声器的最低贝叶斯风险培训ASR

论文标题

端到端扬声器的最低贝叶斯风险培训ASR

Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR

论文作者

Kanda, Naoyuki, Meng, Zhong, Lu, Liang, Gaur, Yashesh, Wang, Xiaofei, Chen, Zhuo, Yoshioka, Takuya

论文摘要

最近，提出了一种端到端的说话者的自动语音识别（E2E SA-ASR）模型，以作为说话者计数，语音识别和扬声器识别的共同模型，以供单声道重叠的语音识别。在先前的研究中，模型参数是根据说话者归纳的最大共同信息（SA-MMI）标准培训的，多对词器转录的联合后验概率和扬声器识别的联合后验概率在训练数据上最大化。尽管SA-MMI训练显示了由各种说话者组成的重叠语音的令人鼓舞的结果，但训练标准与最终评估指标（即说话者 - 说话者归纳的单词错误率（SA-WER））没有直接相关。在本文中，我们提出了一种由扬声器归类的最小贝叶斯风险（SA-MBR）训练方法，其中训练参数以直接最大程度地减少训练数据中预期的SA-WER。使用Librispeech语料库的实验表明，与SA-MMI-Mi-train的模型相比，提出的SA-MBR训练相对减少了9.0％。

Recently, an end-to-end speaker-attributed automatic speech recognition (E2E SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech. In the previous study, the model parameters were trained based on the speaker-attributed maximum mutual information (SA-MMI) criterion, with which the joint posterior probability for multi-talker transcription and speaker identification are maximized over training data. Although SA-MMI training showed promising results for overlapped speech consisting of various numbers of speakers, the training criterion was not directly linked to the final evaluation metric, i.e., speaker-attributed word error rate (SA-WER). In this paper, we propose a speaker-attributed minimum Bayes risk (SA-MBR) training method where the parameters are trained to directly minimize the expected SA-WER over the training data. Experiments using the LibriSpeech corpus show that the proposed SA-MBR training reduces the SA-WER by 9.0 % relative compared with the SA-MMI-trained model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题