非自动回归文本到语音的跨完全条件VAE

论文标题

非自动回归文本到语音的跨完全条件VAE

Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech

论文作者

Li, Yang, Yu, Cheng, Sun, Guangzhi, Jiang, Hua, Sun, Fanglei, Zu, Weiqin, Wen, Ying, Yang, Yang, Wang, Jun

论文摘要

建模韵律变化对于在端到端文本到语音（TTS）系统中综合自然和表达语音至关重要。在本文中，提出了一个跨牙的条件VAE（CUC-VAE），以估算每个音素的后验概率分布，通过根据声学特征，扬声器信息和从过去和将来的句子获得的文本特征进行调节，来估计每个音素的潜在韵律特征。在推理时，CUC-VAE不是VAE使用的标准高斯分布，而是从特定于以跨完全信息为条件的特定说服的先前分布进行抽样，这允许TTS系统产生的韵律特征与上下文相关，并且与人类自然产生繁琐的方式更相似。 CUC-VAE的性能是通过定性听力测试来评估自然性，清晰度和定量测量值的性能，包括单词错误率和韵律属性的标准偏差。 LJ语音和Libritts数据的实验结果表明，拟议的CUC-VAE TTS系统可以通过明显的边缘提高自然性和韵律多样性。

Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems. In this paper, a cross-utterance conditional VAE (CUC-VAE) is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme by conditioning on acoustic features, speaker information, and text features obtained from both past and future sentences. At inference time, instead of the standard Gaussian distribution used by VAE, CUC-VAE allows sampling from an utterance-specific prior distribution conditioned on cross-utterance information, which allows the prosody features generated by the TTS system to be related to the context and is more similar to how humans naturally produce prosody. The performance of CUC-VAE is evaluated via a qualitative listening test for naturalness, intelligibility and quantitative measurements, including word error rates and the standard deviation of prosody attributes. Experimental results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS system improves naturalness and prosody diversity with clear margins.

下载PDF全文

下载文献需遵守相关版权规定

论文标题