论文标题

M2Met挑战的理狂语音识别系统

The RoyalFlush System of Speech Recognition for M2MeT Challenge

论文作者

Ye, Shuaishuai, Wang, Peiyao, Chen, Shunfei, Hu, Xinhui, Xu, Xinkang

论文摘要

本文介绍了我们在M2Met挑战中轨迹自动语音识别(ASR)轨迹的理王粉系统。我们采用了带有大规模仿真数据的基于序列化的输出培训(SOT)的多演讲者ASR系统。首先,我们研究了一组前端方法,包括多通道加权预测误差(WPE),波束形成,语音分离,语音增强等,以处理过程训练,验证和测试集。但是,我们仅根据其实验结果选择了WPE和波束形成作为前端方法。其次,我们在多演讲者ASR的数据增强方面做出了巨大努力,主要包括增加噪音和混响,重叠的语音模拟,多频道语音模拟,速度扰动,前端处理等,这为我们带来了极大的绩效改进。最后,为了充分利用不同模型体系结构的性能互补,我们培训了基于标准构象异构体的关节CTC/注意力(​​构象异构体)和U2 ++ ASR模型,并具有双向注意解码器(构象异构体的修改)以融合其结果。与官方基线系统相比,我们的系统在验证集的绝对角色错误率(CER)降低了12.22%,测试集的绝对性字符错误率(CER)为12.11%。

This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement and so on, to process training, validation and test sets. But we only selected WPE and beamforming as our frontend methods according to their experimental results. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, mainly including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, and so on, which brought us a great performance improvement. Finally, in order to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results. Comparing with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源