基于扩散的生成模型的语音增强和覆盖

论文标题

基于扩散的生成模型的语音增强和覆盖

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

论文作者

Richter, Julius, Welker, Simon, Lemercier, Jean-Marie, Lay, Bunlong, Gerkmann, Timo

论文摘要

在这项工作中，我们以先前出版物为基础，并使用基于扩散的生成模型来增强语音。我们介绍了基于随机微分方程的扩散过程的详细概述，并深入探讨了对其含义的广泛理论研究。与通常的有条件生成任务相反，我们不是从纯高斯噪声中开始反向过程，而是从嘈杂的语音和高斯噪声的混合物开始。这与我们的前进过程相匹配，该过程通过包括一个漂移术语，从干净的语音转变为嘈杂的演讲。我们表明，此过程仅使用30个扩散步骤来生成高质量的干净语音估计。通过调整网络体系结构，我们能够显着提高语音增强性能，表明网络而不是形式主义是我们原始方法的主要限制。在广泛的跨数据库评估中，我们表明，改进的方法可以与最近的判别模型竞争，并在评估与培训不同的语料库时可以更好地概括。我们通过使用现实世界嘈杂的录音和听力实验的仪器评估来补充结果，其中我们提出的方法是最好的。检查以解决反向过程的不同采样器配置，使我们能够平衡所提出方法的性能和计算速度。此外，我们表明所提出的方法也适用于横断，因此不限于添加背景噪声去除。代码和音频示例可在线获得，请参见https://github.com/sp-uhh/sgmse

In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse

下载PDF全文

下载文献需遵守相关版权规定

论文标题