通过潜在变量正则化和链式编码器predictor网络的多演讲者情绪转换

论文标题

通过潜在变量正则化和链式编码器predictor网络的多演讲者情绪转换

Multi-speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network

论文作者

Shankar, Ravi, Hsieh, Hsi-Wei, Charon, Nicolas, Venkataraman, Archana

论文摘要

我们提出了一种基于连锁编码器predictor神经网络架构的语音转换的新方法，以进行情感转换。编码器构建了基本频率（F0）轮廓和光谱的潜在嵌入，我们使用大型差异度量映射（LDDMM）登记框架正规化。解码器使用此嵌入来预测目标情感类别中修改的F0轮廓。最后，预测变量使用原始频谱和修改的F0轮廓来生成相应的目标谱。我们的关节目标函数同时优化了三个模型块的参数。我们表明，我们的方法表现优于现有的最新方法，即情绪转换的显着性和重新合成语音的质量。此外，LDDMM正则化允许我们的模型转换训练中不存在的短语，从而提供了样本外概括的证据。

We propose a novel method for emotion conversion in speech based on a chained encoder-decoder-predictor neural network architecture. The encoder constructs a latent embedding of the fundamental frequency (F0) contour and the spectrum, which we regularize using the Large Diffeomorphic Metric Mapping (LDDMM) registration framework. The decoder uses this embedding to predict the modified F0 contour in a target emotional class. Finally, the predictor uses the original spectrum and the modified F0 contour to generate a corresponding target spectrum. Our joint objective function simultaneously optimizes the parameters of three model blocks. We show that our method outperforms the existing state-of-the-art approaches on both, the saliency of emotion conversion and the quality of resynthesized speech. In addition, the LDDMM regularization allows our model to convert phrases that were not present in training, thus providing evidence for out-of-sample generalization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题