diffgan-tts：高保真和有效的文本到语音，并具有脱氧扩散gan

论文标题

diffgan-tts：高保真和有效的文本到语音，并具有脱氧扩散gan

DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

论文作者

Liu, Songxiang, Su, Dan, Yu, Dong

论文摘要

剥离扩散概率模型（DDPM）是表达生成模型，用于解决各种语音合成问题。但是，由于其采样成本很高，DDPM很难在实时语音处理应用中使用。在本文中，我们介绍了DiFfgan-TTS，这是一种基于DDPM的新型文本到语音（TTS）模型，可实现高保真和有效的语音合成。 Diffgan-TTS基于deno的扩散生成对抗网络（GAN），该网络采用了对抗训练的表达模型来近似于脱氧分布。我们通过多扬声器TTS实验显示，Diffgan-TT只能在仅4个denoising步骤中生成高保真的语音样本。我们提出了一种积极的浅扩散机制，以进一步加快推断。提出了一个两阶段的训练计划，在第一阶段进行了基本TTS声学模型，为在第二阶段训练的DDPM提供了宝贵的先验信息。我们的实验表明，Diffgan-TTS只能通过1个DeNoising步骤来实现高合成性能。

Denoising diffusion probabilistic models (DDPMs) are expressive generative models that have been used to solve a variety of speech synthesis problems. However, because of their high sampling costs, DDPMs are difficult to use in real-time speech processing applications. In this paper, we introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving high-fidelity and efficient speech synthesis. DiffGAN-TTS is based on denoising diffusion generative adversarial networks (GANs), which adopt an adversarially-trained expressive model to approximate the denoising distribution. We show with multi-speaker TTS experiments that DiffGAN-TTS can generate high-fidelity speech samples within only 4 denoising steps. We present an active shallow diffusion mechanism to further speed up inference. A two-stage training scheme is proposed, with a basic TTS acoustic model trained at stage one providing valuable prior information for a DDPM trained at stage two. Our experiments show that DiffGAN-TTS can achieve high synthesis performance with only 1 denoising step.

下载PDF全文

下载文献需遵守相关版权规定

论文标题