Aligntts：高效的馈送文本到语音系统，无明确的对齐方式

论文标题

Aligntts：高效的馈送文本到语音系统，无明确的对齐方式

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment

论文作者

Zeng, Zhen, Wang, Jianzong, Cheng, Ning, Xia, Tian, Xiao, Jing

论文摘要

在高效率和性能方面，我们提议Aligntts并行预测Mel-Spectrum。 Aligntts基于一个馈电型变压器，该变压器从一系列字符中产生Mel-spectrum，并且每个字符的持续时间由持续时间预测器确定。在使用变压器TTS中采用注意力机制以使文本与MEL-SPECTRUM相结合至MEL-SPECTRUM，以培训来考虑所有可能的训练。 LJSpeech数据集上的实验表明，我们的模型不仅达到了最先进的性能，在平均选项得分（MOS）中，其表现优于变形金刚TTS，而且高效率高于实时速度超过50倍。

Targeting at both high efficiency and performance, we propose AlignTTS to predict the mel-spectrum in parallel. AlignTTS is based on a Feed-Forward Transformer which generates mel-spectrum from a sequence of characters, and the duration of each character is determined by a duration predictor.Instead of adopting the attention mechanism in Transformer TTS to align text to mel-spectrum, the alignment loss is presented to consider all possible alignments in training by use of dynamic programming. Experiments on the LJSpeech dataset show that our model achieves not only state-of-the-art performance which outperforms Transformer TTS by 0.03 in mean option score (MOS), but also a high efficiency which is more than 50 times faster than real-time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题