通过非平行训练数据转换频谱和韵律以进行情感语音转换

论文标题

通过非平行训练数据转换频谱和韵律以进行情感语音转换

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

论文作者

Zhou, Kun, Sisman, Berrak, Li, Haizhou

论文摘要

情感语音转换旨在将频谱和韵律转换为改变语音的情感模式，同时保留说话者的身份和语言内容。许多研究需要在不同的情绪模式之间进行平行的语音数据，这在现实生活中是不切实际的。此外，它们通常会以简单的线性变换对基本频率（F0）的转换进行建模。由于F0是语调的关键方面，本质上是层次结构的，我们认为使用小波变换以不同时间尺度的F0模型更为足够。我们提出了一个自行车网络，通过使用对抗性和周期矛盾损失同时学习前进和逆映射，从非平行训练数据中找到最佳的伪对。我们还研究了连续小波变换（CWT）将F0分解为十个时间尺度的使用，这些时间尺度描述了在不同时间分辨率的语音韵律，以有效的F0转换。实验结果表明，我们提出的框架在客观和主观评估中的表现优于基准。

Emotional voice conversion aims to convert the spectrum and prosody to change the emotional patterns of speech, while preserving the speaker identity and linguistic content. Many studies require parallel speech data between different emotional patterns, which is not practical in real life. Moreover, they often model the conversion of fundamental frequency (F0) with a simple linear transform. As F0 is a key aspect of intonation that is hierarchical in nature, we believe that it is more adequate to model F0 in different temporal scales by using wavelet transform. We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data by learning forward and inverse mappings simultaneously using adversarial and cycle-consistency losses. We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution, for effective F0 conversion. Experimental results show that our proposed framework outperforms the baselines both in objective and subjective evaluations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题