ICASSP 2022多渠道多方会议转录挑战的CUHK陈述器诊断系统

论文标题

ICASSP 2022多渠道多方会议转录挑战的CUHK陈述器诊断系统

The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

论文作者

Zheng, Naijun, Li, Na, Wu, Xixin, Meng, Lingwei, Kang, Jiawen, Wu, Haibin, Weng, Chao, Su, Dan, Meng, Helen

论文摘要

本文描述了我们提交给多渠道多党会议转录（M2MET）挑战的说话者诊断系统，在该系统中，普通话会议数据以多通道格式记录，以进行诊断和自动语音识别（ASR）任务。在这些会议场景中，说话者号码的不确定性和重叠的演讲比例高于诊断的巨大挑战。基于以下假设：声学特征，空间相关和与扬声器相关的特征之间存在有价值的互补信息，我们提出了一个基于多级特征融合机制的基于目标式宣传机制语音活动检测（FFM-TS-VAD）系统，以提高常规TS-VAD系统的性能。此外，当两个说话者之间的角度差相对较小时，我们在训练过程中提出了一种数据增强方法，以改善系统的鲁棒性。我们为在M2Met挑战中使用的不同子系统提供了比较。我们的提交是几个子系统的融合，在诊断任务中排名第二。

This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks. In these meeting scenarios, the uncertainty of the speaker number and the high ratio of overlapped speech present great challenges for diarization. Based on the assumption that there is valuable complementary information between acoustic features, spatial-related and speaker-related features, we propose a multi-level feature fusion mechanism based target-speaker voice activity detection (FFM-TS-VAD) system to improve the performance of the conventional TS-VAD system. Furthermore, we propose a data augmentation method during training to improve the system robustness when the angular difference between two speakers is relatively small. We provide comparisons for different sub-systems we used in M2MeT challenge. Our submission is a fusion of several sub-systems and ranks second in the diarization task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题