Glow-Wavegan 2：高质量的零击文本到语音综合和任何对任何语音转换

论文标题

Glow-Wavegan 2：高质量的零击文本到语音综合和任何对任何语音转换

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

论文作者

Lei, Yi, Yang, Shan, Cong, Jian, Xie, Lei, Su, Dan

论文摘要

语音产生的零拍情景旨在综合一个新颖的声音，只用一种目标扬声器来综合。尽管在两个阶段都存在零击场景中适应新声音的挑战 - 声学建模和Vocoder，但以前的作品通常仅考虑一个阶段的问题。在本文中，我们将以前的发光 - 沃文扩展到Glow-Wavegan 2，旨在从两个阶段解决问题，以获取高质量的零击文本到语音和任何一对一的语音转换。我们首先构建了一个通用波甘模型，用于提取语音的潜在分布$ p（z）$，并从中重建波形。然后，基于流的声学模型只需要从文本中学习相同的$ p（z）$，这自然避免了声学模型与声码器之间的不匹配，从而导致高质量生成的语音，而无需微调模型。基于连续的扬声器空间和流动的可逆属性，可以为任何说话者获得条件分布，因此我们可以进一步为新扬声器进行高质量的零声音生成。我们特别研究了两种构造扬声器空间的方法，即预先训练的扬声器编码器和共同训练的扬声器编码器。 Glow-Wavegan 2的优越性已通过在Clibritts和VTCK语料库上进行的TTS和VC实验证明。

The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution $p(z)$ of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same $p(z)$ from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained speaker encoder and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.

下载PDF全文

下载文献需遵守相关版权规定

论文标题