端到端轻量轻量文本到语音的中间声学特征的对抗性学习

论文标题

端到端轻量轻量文本到语音的中间声学特征的对抗性学习

Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

论文作者

Yoon, Hyungchan, Um, Seyun, Kim, Changwhan, Kang, Hong-Goo

论文摘要

为了简化生成过程，几个文本到语音（TTS）系统隐式学习中间的潜在表示，而不是依靠预定义的特征（例如MEL-SPECTROGRAM）。但是，它们的产生质量并不令人满意，因为这些表示缺乏言语差异。在本文中，我们通过将\ emph {韵律嵌入}添加到潜在表示中来提高TTS性能。在训练过程中，我们从MEL光谱图中提取参考韵律嵌入，在推断期间，我们使用生成对抗网络（GAN）从文本中估算了这些嵌入。使用甘恩斯，我们以快速的方式可靠地估计韵律嵌入，由于语音的动态性质，它们具有复杂的分布。我们还表明，韵律嵌入起着有效的特征，可以学习文本和声学特征之间的稳健对齐。我们提出的模型超过了几个公共可用模型，在比较实验中，参数和计算复杂性较少。

To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations. During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs). Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech. We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features. Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题