论文标题
Glow-Wavegan 2:高质量的零击文本到语音综合和任何对任何语音转换
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion
论文作者
论文摘要
语音产生的零拍情景旨在综合一个新颖的声音,只用一种目标扬声器来综合。尽管在两个阶段都存在零击场景中适应新声音的挑战 - 声学建模和Vocoder,但以前的作品通常仅考虑一个阶段的问题。在本文中,我们将以前的发光 - 沃文扩展到Glow-Wavegan 2,旨在从两个阶段解决问题,以获取高质量的零击文本到语音和任何一对一的语音转换。我们首先构建了一个通用波甘模型,用于提取语音的潜在分布$ p(z)$,并从中重建波形。然后,基于流的声学模型只需要从文本中学习相同的$ p(z)$,这自然避免了声学模型与声码器之间的不匹配,从而导致高质量生成的语音,而无需微调模型。基于连续的扬声器空间和流动的可逆属性,可以为任何说话者获得条件分布,因此我们可以进一步为新扬声器进行高质量的零声音生成。我们特别研究了两种构造扬声器空间的方法,即预先训练的扬声器编码器和共同训练的扬声器编码器。 Glow-Wavegan 2的优越性已通过在Clibritts和VTCK语料库上进行的TTS和VC实验证明。
The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution $p(z)$ of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same $p(z)$ from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained speaker encoder and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.