论文标题

部分可观测时空混沌系统的无模型预测

Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

论文作者

Ellinas, Nikolaos, Vamvoukakis, Georgios, Markopoulos, Konstantinos, Maniati, Georgia, Kakoulidis, Panos, Sung, June Sig, Hwang, Inchul, Raptis, Spyros, Chalamandaris, Aimilios, Tsiakoulis, Pirros

论文摘要

本文提出了一种端到端跨语性文本到语音(TTS)的方法,该方法旨在保留目标语言的发音,而不论原始说话者的语言如何。所使用的模型基于非煽动性的TACOTRON架构,在该体系结构中,解码器已被扬声器身份条件的归一化流网络替换,从而允许TTS和语音转换(VC)由于固有的语言内容和扬声器的身份标识不符合,因此可以通过相同的模型执行相同的模型。当在跨语言设置中使用时,声学特征最初是用目标语言的母语人士制作的,然后由同一模型应用语音转换,以将这些功能转换为目标扬声器的声音。我们通过客观和主观评估来验证与基线跨语性合成相比,我们的方法可以具有好处。通过包括说话的说话者的讲话平均7.5分钟,我们还对低资源场景提出了积极的结果。

This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC) to be performed by the same model due to the inherent linguistic content and speaker identity disentanglement. When used in a cross-lingual setting, acoustic features are initially produced with a native speaker of the target language and then voice conversion is applied by the same model in order to convert these features to the target speaker's voice. We verify through objective and subjective evaluations that our method can have benefits compared to baseline cross-lingual synthesis. By including speakers averaging 7.5 minutes of speech, we also present positive results on low-resource scenarios.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源