在多党会议上，一项关于说话者归属的自动语音识别的比较研究

论文标题

在多党会议上，一项关于说话者归属的自动语音识别的比较研究

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

论文作者

Yu, Fan, Du, Zhihao, Zhang, Shiliang, Lin, Yuxiao, Xie, Lei

论文摘要

在本文中，我们在多方会议场景中对说话者归类的自动语音识别（SA-ASR）进行了比较研究，这个主题越来越关注丰富的转录。具体而言，在本研究中评估了三种方法。第一种方法，即FD-SOT，由框架级诊断模型组成，以识别说话者和多对话者ASR来识别话语。通过对准诊断结果和公认的假设，可以获得说话者的归因于缩写的转录。但是，由于模块化的独立性，这种对齐策略可能会遭受错误的时间戳，从而严重阻碍了模型性能。因此，我们提出了第二种方法WD-SOT，以通过引入单词级诊断模型来解决对齐误差，该模型可以摆脱这种时间戳对齐的依赖性。为了进一步减轻对齐问题，我们提出了第三种方法TS-ASR，该方法可以共同训练目标扬声器分离模块和ASR模块。通过比较每种SA-ASR方法的各种策略，对真实的会议场景语料库的实验结果表明，与FD-SOT方法相比，WD-SOT方法对平均扬声器依赖性角色错误率（SD-CER）的相对相对降低了10.7％。此外，TS-ASR方法还优于FD-SOT方法，并带来16.5％的相对平均SD-CER降低。

In this paper, we conduct a comparative study on speaker-attributed automatic speech recognition (SA-ASR) in the multi-party meeting scenario, a topic with increasing attention in meeting rich transcription. Specifically, three approaches are evaluated in this study. The first approach, FD-SOT, consists of a frame-level diarization model to identify speakers and a multi-talker ASR to recognize utterances. The speaker-attributed transcriptions are obtained by aligning the diarization results and recognized hypotheses. However, such an alignment strategy may suffer from erroneous timestamps due to the modular independence, severely hindering the model performance. Therefore, we propose the second approach, WD-SOT, to address alignment errors by introducing a word-level diarization model, which can get rid of such timestamp alignment dependency. To further mitigate the alignment issues, we propose the third approach, TS-ASR, which trains a target-speaker separation module and an ASR module jointly. By comparing various strategies for each SA-ASR approach, experimental results on a real meeting scenario corpus, AliMeeting, reveal that the WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER), compared with the FD-SOT approach. In addition, the TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题