GAZEV：基于GAN的零击语音转换在非平行语音语料库上

论文标题

GAZEV：基于GAN的零击语音转换在非平行语音语料库上

GAZEV: GAN-Based Zero-Shot Voice Conversion over Non-parallel Speech Corpus

论文作者

Zhang, Zining, He, Bingsheng, Zhang, Zhenjie

论文摘要

非平行的多与许多语音转换最近在语音处理社区中吸引了巨大的研究工作。语音转换系统通过将内容保留在原始话语中，并用目标扬声器的声音来代替目标扬声器的另一种说法，从而将源扬声器的话语转换为目标扬声器的另一种说法。现有的解决方案，例如Stargan-VC2，只有在模型培训期间可用的演讲者的语音语料库时，才会出现有希望的结果。 Autovcis能够在看不见的扬声器上执行语音转换，但需要外部审计的扬声器验证模型。在本文中，我们介绍了新的基于GAN的零发音转换解决方案，即GAZEV，该解决方案的目标是支持在源和目标话语上支持看不见的说话者。我们的主要技术贡献是在GAN框架之上采用说话者嵌入损失以及自适应实例归一化策略，以解决在现有解决方案中说话者身份转移的局限性。我们的经验评估表明，与AUTOVC相似的说话语音质量和可比的扬声器相似性表现出了显着的性能提高。

Non-parallel many-to-many voice conversion is recently attract-ing huge research efforts in the speech processing community. A voice conversion system transforms an utterance of a source speaker to another utterance of a target speaker by keeping the content in the original utterance and replacing by the vocal features from the target speaker. Existing solutions, e.g., StarGAN-VC2, present promising results, only when speech corpus of the engaged speakers is available during model training. AUTOVCis able to perform voice conversion on unseen speakers, but it needs an external pretrained speaker verification model. In this paper, we present our new GAN-based zero-shot voice conversion solution, called GAZEV, which targets to support unseen speakers on both source and target utterances. Our key technical contribution is the adoption of speaker embedding loss on top of the GAN framework, as well as adaptive instance normalization strategy, in order to address the limitations of speaker identity transfer in existing solutions. Our empirical evaluations demonstrate significant performance improvement on output speech quality and comparable speaker similarity to AUTOVC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题