论文标题

端到端轻量轻量文本到语音的中间声学特征的对抗性学习

Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

论文作者

Yoon, Hyungchan, Um, Seyun, Kim, Changwhan, Kang, Hong-Goo

论文摘要

为了简化生成过程,几个文本到语音(TTS)系统隐式学习中间的潜在表示,而不是依靠预定义的特征(例如MEL-SPECTROGRAM)。但是,它们的产生质量并不令人满意,因为这些表示缺乏言语差异。在本文中,我们通过将\ emph {韵律嵌入}添加到潜在表示中来提高TTS性能。在训练过程中,我们从MEL光谱图中提取参考韵律嵌入,在推断期间,我们使用生成对抗网络(GAN)从文本中估算了这些嵌入。使用甘恩斯,我们以快速的方式可靠地估计韵律嵌入,由于语音的动态性质,它们具有复杂的分布。我们还表明,韵律嵌入起着有效的特征,可以学习文本和声学特征之间的稳健对齐。我们提出的模型超过了几个公共可用模型,在比较实验中,参数和计算复杂性较少。

To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations. During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs). Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech. We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features. Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源