端到端对抗文本到语音

论文标题

端到端对抗文本到语音

End-to-End Adversarial Text-to-Speech

论文作者

Donahue, Jeff, Dieleman, Sander, Bińkowski, Mikołaj, Elsen, Erich, Simonyan, Karen

论文摘要

现代文本到语音综合管道通常涉及多个处理阶段，每个处理阶段都是独立于其余的设计或学习的。在这项工作中，我们承担了学习以端到端方式从归一化文本或音素合成语音的具有挑战性的任务，从而导致模型直接在字符或音素输入序列上运行并产生原始的语音音频输出。我们提出的发电机是前进的，因此使用基于令牌长度预测的可区分对准方案，因此对于训练和推理有效。它学会了通过对抗反馈和预测损失的结合来产生高富达音频，从而限制了生成的音频，以大致匹配地面真相，从而使其总持续时间和MEL光谱图。为了允许模型捕获生成的音频中的时间变化，我们在基于频谱图的预测损失中采用了软动态时间扭曲。最终的模型在5点尺度上达到了超过4的平均意见分数，这与依靠多阶段训练和其他监督的最先进模型相媲美。

Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题