论文标题
端到端音乐混合语音识别
End-to-end Music-mixed Speech Recognition
论文作者
论文摘要
多媒体内容中的自动语音识别(ASR)是有希望的应用之一,但是这种内容中的语音数据经常与背景音乐混合在一起,这对ASR的性能有害。在这项研究中,我们提出了一种基于时间域源分离的背景音乐改善ASR的方法。我们利用Conv-Tasnet作为一个分离网络,该网络已实现了多扬声器源分离的最先进的性能,从波形域中的语音音乐混合物中提取语音信号。我们还提出了使用分离和ASR目标的基于注意的ASR后端的预训练的Conc-TASNET前端进行的联合微调。我们通过使用语音数据与来自各种日本动画的背景音乐混合的语音数据评估了我们的方法。我们表明,时间域语音示意性的分离大大提高了用混合物数据训练的后端模型的ASR性能,关节优化产生了进一步的显着降低。时域分离方法的表现优于频域分离方法,该方法在简单的级联和关节训练设置中重复了输入混合物信号的相位信息。我们还证明,我们的方法在古典,爵士和流行流派的音乐干扰方面有着强大的作用。
Automatic speech recognition (ASR) in multimedia content is one of the promising applications, but speech data in this kind of content are frequently mixed with background music, which is harmful for the performance of ASR. In this study, we propose a method for improving ASR with background music based on time-domain source separation. We utilize Conv-TasNet as a separation network, which has achieved state-of-the-art performance for multi-speaker source separation, to extract the speech signal from a speech-music mixture in the waveform domain. We also propose joint fine-tuning of a pre-trained Conv-TasNet front-end with an attention-based ASR back-end using both separation and ASR objectives. We evaluated our method through ASR experiments using speech data mixed with background music from a wide variety of Japanese animations. We show that time-domain speech-music separation drastically improves ASR performance of the back-end model trained with mixture data, and the joint optimization yielded a further significant WER reduction. The time-domain separation method outperformed a frequency-domain separation method, which reuses the phase information of the input mixture signal, both in simple cascading and joint training settings. We also demonstrate that our method works robustly for music interference from classical, jazz and popular genres.