fragmentVC：通过端到端提取和融合细粒度的语音片段的任何一对一的语音转换

论文标题

fragmentVC：通过端到端提取和融合细粒度的语音片段的任何一对一的语音转换

FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

论文作者

Lin, Yist Y., Chien, Chung-Ming, Lin, Jheng-Hao, Lee, Hung-yi, Lee, Lin-shan

论文摘要

任何一对一的语音转换都旨在将声音从训练中甚至看不见的说话者转换为任何演讲者，这与一对一或多一对一对多任务相比，这更具挑战性，但在现实情况下更具吸引力。在本文中，我们提出了fragmentVC，其中从wav2vec 2.0获得了来自源说话者的话语的潜在语音结构，而目标扬声器的发音（s）的频谱特征是从log-mel-spectrograms中获得的。通过将两个不同特征空间的隐藏结构与两个阶段的训练过程对齐，FragmentVC能够从目标扬声器话语中提取细粒度的语音片段，并将它们融合到所需的话语中，所有这些都是基于对注意力图的分析的变压器的注意机制，并完成了端到端。该方法仅通过内容和说话者信息之间的任何解开考虑，不需要并行数据训练。基于说话者验证和MOS主观评估的客观评估都表明，这种方法的表现超过了SOTA方法，例如ADAIN-VC和AUTOVC。

Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms. By aligning the hidden structures of the two different feature spaces with a two-stage training process, FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance, all based on the attention mechanism of Transformer as verified with analysis on attention maps, and is accomplished end-to-end. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information and doesn't require parallel data. Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题