IQDUBBING：基于离散的自我监督语音表示表达语音转换的韵律建模

论文标题

IQDUBBING：基于离散的自我监督语音表示表达语音转换的韵律建模

IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion

论文作者

Gan, Wendong, Wen, Bolong, Yan, Ying, Chen, Haitao, Wang, Zhichao, Du, Hongqiang, Xie, Lei, Guo, Kaixuan, Li, Hai

论文摘要

韵律建模很重要，但在表达语音转换方面仍然具有挑战性。由于韵律很难建模，因此其他因素，例如，在韵律建模中删除了与韵律纠缠的说话者，环境和内容。在本文中，我们提出IQDUBBING来解决此问题以进行表达语音转换。为了模拟韵律，我们利用离散自我监督的语音表示（DSSR）的最新进展。具体而言，首先从预先训练的VQ-WAV2VEC模型中提取韵律向量，该模型嵌入了丰富的韵律信息，而大多数说话者和环境信息通过量化有效地删除。为了进一步过滤冗余的信息，除了韵律（例如内容和部分说话者信息），我们提出了两种韵律过滤器，以从韵律矢量中采样韵律。实验表明，iQdubbing在语音质量方面优于基线和比较系统，同时保持韵律一致性和说话者的相似性。

Prosody modeling is important, but still challenging in expressive voice conversion. As prosody is difficult to model, and other factors, e.g., speaker, environment and content, which are entangled with prosody in speech, should be removed in prosody modeling. In this paper, we present IQDubbing to solve this problem for expressive voice conversion. To model prosody, we leverage the recent advances in discrete self-supervised speech representation (DSSR). Specifically, prosody vector is first extracted from pre-trained VQ-Wav2Vec model, where rich prosody information is embedded while most speaker and environment information are removed effectively by quantization. To further filter out the redundant information except prosody, such as content and partial speaker information, we propose two kinds of prosody filters to sample prosody from the prosody vector. Experiments show that IQDubbing is superior to baseline and comparison systems in terms of speech quality while maintaining prosody consistency and speaker similarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题