有效的T：高效且高质量的文本到语音体系结构

论文标题

有效的T：高效且高质量的文本到语音体系结构

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

论文作者

Miao, Chenfeng, Liang, Shuang, Liu, Zhencheng, Chen, Minchuan, Ma, Jun, Wang, Shaojun, Xiao, Jing

论文摘要

在这项工作中，我们通过提出一个称为EdgitionTTS的非解放性架构来解决文本到语音（TTS）任务。与经过外部对齐器的训练的主要非自动回旋TTS模型不同，有效地通过稳定的端到端训练程序优化其所有参数，同时允许以快速有效的方式合成高质量的语音。有效的THS是由一种新的单调对准建模方法（也在本工作中引入的）激励，该方法指定了对序列对齐的单调约束，几乎没有计算的增加。通过将有效的TTT与不同的前馈网络结构相结合，我们开发了一个TTS模型系列，包括文本到胶合图和文本对波形网络。我们通过实验表明，在语音质量，训练效率和综合速度方面，提出的模型显着超过了诸如Tacotron 2和Glow-TT之类的模型，同时仍产生强大的稳健性和巨大多样性。此外，我们证明了提出的方法很容易扩展到自回归模型，例如Tacotron 2。

In this work, we address the Text-to-Speech (TTS) task by proposing a non-autoregressive architecture called EfficientTTS. Unlike the dominant non-autoregressive TTS models, which are trained with the need of external aligners, EfficientTTS optimizes all its parameters with a stable, end-to-end training procedure, while allowing for synthesizing high quality speech in a fast and efficient manner. EfficientTTS is motivated by a new monotonic alignment modeling approach (also introduced in this work), which specifies monotonic constraints to the sequence alignment with almost no increase of computation. By combining EfficientTTS with different feed-forward network structures, we develop a family of TTS models, including both text-to-melspectrogram and text-to-waveform networks. We experimentally show that the proposed models significantly outperform counterpart models such as Tacotron 2 and Glow-TTS in terms of speech quality, training efficiency and synthesis speed, while still producing the speeches of strong robustness and great diversity. In addition, we demonstrate that proposed approach can be easily extended to autoregressive models such as Tacotron 2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题