通过合成来增强语音的自学学习

论文标题

通过合成来增强语音的自学学习

Self-Supervised Learning for Speech Enhancement through Synthesis

论文作者

Irvin, Bryce, Stamenovic, Marko, Kegler, Mikolaj, Yang, Li-Chia

论文摘要

现代语音增强（SE）网络通常通过时频掩蔽，潜在表示掩盖或判别信号预测来实现噪声抑制。相比之下，最近的一些作品通过生成语音综合探索了SE，其中系统的输出是在固有的有损特征降低步骤之后由神经声码器合成的。在本文中，我们提出了一种Denoising Vocoder（DEVO）方法，其中声码器接受嘈杂的表示形式，并学会了直接综合简洁的语音。我们利用自我监督学习（SSL）语音模型的丰富表示形式来发现相关特征。我们在15个潜在的SSL前端进行了候选搜索，随后以最佳的SSL配置对对手进行对手进行训练。此外，我们演示了能够在10毫秒延迟和最小性能降解的流媒体音频上运行的因果版本。最后，我们进行客观评估和主观听力研究，以表明我们的系统改善了客观指标，并主观优于现有的最新SE模型。

Modern speech enhancement (SE) networks typically implement noise suppression through time-frequency masking, latent representation masking, or discriminative signal prediction. In contrast, some recent works explore SE via generative speech synthesis, where the system's output is synthesized by a neural vocoder after an inherently lossy feature-denoising step. In this paper, we propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech. We leverage rich representations from self-supervised learning (SSL) speech models to discover relevant features. We conduct a candidate search across 15 potential SSL front-ends and subsequently train our vocoder adversarially with the best SSL configuration. Additionally, we demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation. Finally, we conduct both objective evaluations and subjective listening studies to show our system improves objective metrics and outperforms an existing state-of-the-art SE model subjectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题