论文标题
文本驱动的情感风格控制和神经TT中的跨言式风格转移
Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS
论文作者
论文摘要
近年来,表现力的文本到语音表现出改善的性能。但是,综合语音的样式控制通常仅限于离散的情绪类别,并且需要目标扬声器以目标方式记录的培训数据。在许多实际情况下,用户可能没有在目标情感中记录的参考语音,但仍然感兴趣仅通过键入所需情感风格的文本描述来控制语音风格。在本文中,我们提出了一个基于文本的界面,用于情感风格控制和多演讲者TTS中的跨语言风格转移。我们提出了双模式样式编码器,该编码器模拟了文本描述嵌入与语言模型嵌入语音样式之间的语义关系。为了进一步改善横向扬声器风格的转移,在多种风格的数据集上,我们提出了新型样式损失。实验结果表明,即使以看不见的风格,我们的模型也可以产生高质量的表达语音。
Expressive text-to-speech has shown improved performance in recent years. However, the style control of synthetic speech is often restricted to discrete emotion categories and requires training data recorded by the target speaker in the target style. In many practical situations, users may not have reference speech recorded in target emotion but still be interested in controlling speech style just by typing text description of desired emotional style. In this paper, we propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS. We propose the bi-modal style encoder which models the semantic relationship between text description embedding and speech style embedding with a pretrained language model. To further improve cross-speaker style transfer on disjoint, multi-style datasets, we propose the novel style loss. The experimental results show that our model can generate high-quality expressive speech even in unseen style.