Adadurian：神经文本到语音的镜头很少，榴莲

论文标题

Adadurian：神经文本到语音的镜头很少，榴莲

AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN

论文作者

Zhang, Zewang, Tian, Qiao, Lu, Heng, Chen, Ling-Hui, Liu, Shan

论文摘要

本文研究了如何利用榴莲的平均模型，使新的扬声器能够具有准确的发音和流利的跨语性口语，并具有非常有限的单语言数据。最近提出的端到端文本到语音（TTS）系统的弱点是，很难实现稳健的一致性，这会阻碍它在非常有限的数据方面很好地扩展。为了解决这个问题，我们通过培训改进的榴莲的平均模型来介绍Adadurian，并利用其在不同扬声器的独立于主说话者的内容编码器中进行几次学习。我们的实验中的几次学习任务表明，Adadurian可以大大优于基线端到端系统。主观评估还表明，Adadurian会产生更高的自然意见评分（MOS），并且对说话者相似性的偏好更高。此外，我们还将Adadurian应用于情感转移任务并展示其有希望的表现。

This paper investigates how to leverage a DurIAN-based average model to enable a new speaker to have both accurate pronunciation and fluent cross-lingual speaking with very limited monolingual data. A weakness of the recently proposed end-to-end text-to-speech (TTS) systems is that robust alignment is hard to achieve, which hinders it to scale well with very limited data. To cope with this issue, we introduce AdaDurIAN by training an improved DurIAN-based average model and leverage it to few-shot learning with the shared speaker-independent content encoder across different speakers. Several few-shot learning tasks in our experiments show AdaDurIAN can outperform the baseline end-to-end system by a large margin. Subjective evaluations also show that AdaDurIAN yields higher mean opinion score (MOS) of naturalness and more preferences of speaker similarity. In addition, we also apply AdaDurIAN to emotion transfer tasks and demonstrate its promising performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题