混合连续性损失，以减少时间域目标扬声器提取的过量抑制

论文标题

混合连续性损失，以减少时间域目标扬声器提取的过量抑制

A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

论文作者

Pan, Zexu, Ge, Meng, Li, Haizhou

论文摘要

说话者提取算法从包含干涉语音和背景噪声的混合语音中提取目标语音。提取过程有时会过度抑制提取的目标语音，这不仅会在聆听过程中产生伪影，而且会损害下游自动语音识别算法的性能。我们提出了时间域扬声器提取算法的混合连续性损失函数，以解决过度抑制问题。除了用于上级信号质量的波形级别损失（即Si-SDR），我们引入了频域中的多分辨率三角洲光谱损失，以确保提取的语音信号的连续性，从而减轻过度抑制。我们使用YouTube LRS2-BBC数据集上的时域音频说话者提取算法检查混合连续性损失函数。实验结果表明，所提出的损失函数减少了过度抑制，并提高了在干净和嘈杂的两扬声器混合物上的语音识别率，而不会损害重建的语音质量。

The speaker extraction algorithm extracts the target speech from a mixture speech containing interference speech and background noise. The extraction process sometimes over-suppresses the extracted target speech, which not only creates artifacts during listening but also harms the performance of downstream automatic speech recognition algorithms. We propose a hybrid continuity loss function for time-domain speaker extraction algorithms to settle the over-suppression problem. On top of the waveform-level loss used for superior signal quality, i.e., SI-SDR, we introduce a multi-resolution delta spectrum loss in the frequency-domain, to ensure the continuity of an extracted speech signal, thus alleviating the over-suppression. We examine the hybrid continuity loss function using a time-domain audio-visual speaker extraction algorithm on the YouTube LRS2-BBC dataset. Experimental results show that the proposed loss function reduces the over-suppression and improves the word error rate of speech recognition on both clean and noisy two-speakers mixtures, without harming the reconstructed speech quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题