论文标题
diffgan-tts:高保真和有效的文本到语音,并具有脱氧扩散gan
DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs
论文作者
论文摘要
剥离扩散概率模型(DDPM)是表达生成模型,用于解决各种语音合成问题。但是,由于其采样成本很高,DDPM很难在实时语音处理应用中使用。在本文中,我们介绍了DiFfgan-TTS,这是一种基于DDPM的新型文本到语音(TTS)模型,可实现高保真和有效的语音合成。 Diffgan-TTS基于deno的扩散生成对抗网络(GAN),该网络采用了对抗训练的表达模型来近似于脱氧分布。我们通过多扬声器TTS实验显示,Diffgan-TT只能在仅4个denoising步骤中生成高保真的语音样本。我们提出了一种积极的浅扩散机制,以进一步加快推断。提出了一个两阶段的训练计划,在第一阶段进行了基本TTS声学模型,为在第二阶段训练的DDPM提供了宝贵的先验信息。我们的实验表明,Diffgan-TTS只能通过1个DeNoising步骤来实现高合成性能。
Denoising diffusion probabilistic models (DDPMs) are expressive generative models that have been used to solve a variety of speech synthesis problems. However, because of their high sampling costs, DDPMs are difficult to use in real-time speech processing applications. In this paper, we introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving high-fidelity and efficient speech synthesis. DiffGAN-TTS is based on denoising diffusion generative adversarial networks (GANs), which adopt an adversarially-trained expressive model to approximate the denoising distribution. We show with multi-speaker TTS experiments that DiffGAN-TTS can generate high-fidelity speech samples within only 4 denoising steps. We present an active shallow diffusion mechanism to further speed up inference. A two-stage training scheme is proposed, with a basic TTS acoustic model trained at stage one providing valuable prior information for a DDPM trained at stage two. Our experiments show that DiffGAN-TTS can achieve high synthesis performance with only 1 denoising step.