论文标题
旨在开发13种印度语言的最先进的TTS合成器,并具有信号处理
Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments
论文作者
论文摘要
端到端(E2E)系统合成高质量的语音,但这通常需要大量数据。随着E2E合成从Tacotron到FastSpeech2的发展,很明显,代表韵律,尤其是子字的特征对于无错误的合成很重要。 FastSpeech的变体使用教师模型或强制对齐方式进行培训。本文使用信号处理线索与强制对齐方式同时,为培训数据提供准确的电话边界。由于持续时间建模更好,因此开发了优质的合成器。评估表明,使用建议的信号处理方法开发的系统比使用其他对齐方式开发的系统更好,尤其是在低资源场景中。我们的系统还胜过现有的最佳TTS系统,可用于13种印度语言。
End-to-end (E2E) systems synthesise high-quality speech, but this typically requires a large amount of data. As E2E synthesis progressed from Tacotron to FastSpeech2, it became evident that features representing prosody, particularly sub-word durations, are important for error-free synthesis. Variants of FastSpeech use a teacher model or forced alignments for training. This paper uses signal processing cues in tandem with forced alignment to produce accurate phone boundaries for the training data. As a result of better duration modelling, good-quality synthesisers are developed. Evaluations indicate that systems developed using the proposed signal processing-aided approach are better than systems developed using other alignment approaches, especially in low-resource scenarios. Our systems also outperform the existing best TTS systems available for 13 Indian languages.