低数据？没问题：低资源，语言不可能的对话文本到语音通过F0条件数据增强

论文标题

低数据？没问题：低资源，语言不可能的对话文本到语音通过F0条件数据增强

Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation

论文作者

Comini, Giulia, Huybrechts, Goeric, Ribeiro, Manuel Sam, Gabrys, Adam, Lorenzo-Trueba, Jaime

论文摘要

跨语言的表达方式中的数据可用性有限，并且记录会话昂贵且耗时。为了克服这些问题，我们演示了如何在没有1小时的对话语音的情况下建立低资源，神经文本到语音（TTS）声音，而没有其他对话数据以相同的语言可用。假设使用该语言的非表达语音数据的可用性，我们提出了三步技术：1）我们将F0条件的语音转换（VC）模型作为数据增强技术培训； 2）我们训练F0预测器，以控制语音转换的合成数据的对话风格； 3）我们训练一个消耗增强数据的TTS系统。我们证明，我们的技术可以实现F0可控性，可以在扬声器和语言之间进行扩展，并且在最先进的基线模型的自然性方面具有竞争力，这是另一种不利用F0信息的增强方法。

The availability of data in expressive styles across languages is limited, and recording sessions are costly and time consuming. To overcome these issues, we demonstrate how to build low-resource, neural text-to-speech (TTS) voices with only 1 hour of conversational speech, when no other conversational data are available in the same language. Assuming the availability of non-expressive speech data in that language, we propose a 3-step technology: 1) we train an F0-conditioned voice conversion (VC) model as data augmentation technique; 2) we train an F0 predictor to control the conversational flavour of the voice-converted synthetic data; 3) we train a TTS system that consumes the augmented data. We prove that our technology enables F0 controllability, is scalable across speakers and languages and is competitive in terms of naturalness over a state-of-the-art baseline model, another augmented method which does not make use of F0 information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题