论文标题
ppspeech:基于短语的平行端到端TTS系统
PPSpeech: Phrase based Parallel End-to-End TTS System
论文作者
论文摘要
当前的端到端自回旋TTS系统(例如TaCotron 2)在合成语音的质量方面优于传统的平行方法。但是,他们同时引入了新问题。由于自回归性质,推理的时间成本必须与文本长度成正比,这对在线服务构成了巨大的挑战。另一方面,综合语音的风格变得不稳定,并可能在句子中明显改变。在本文中,我们提出了一个基于短语的并行端到端TTS系统(PPSPeech)来解决这些问题。 ppspeech在短语中使用自动进度方法,并为不同的短语执行并行策略。通过这种方法,我们可以达到高质量和高效率。此外,我们提出了声学嵌入和文本上下文嵌入,作为编码器的条件,以保持连续性并防止突然的样式或音色变化。实验表明,当句子具有超过5个短语时,PPSpeech的综合速度比句子级别自回归塔科克斯2要快得多。速度优势随句子长度的增长而增加。主观实验表明,由于条件可以使句子渐变梯度和自然变化,因此提出的带有声学嵌入和上下文嵌入的拟议系统显然在MOS中击败了全球样式令牌(GST)。
Current end-to-end autoregressive TTS systems (e.g. Tacotron 2) have outperformed traditional parallel approaches on the quality of synthesized speech. However, they introduce new problems at the same time. Due to the autoregressive nature, the time cost of inference has to be proportional to the length of text, which pose a great challenge for online serving. On the other hand, the style of synthetic speech becomes unstable and may change obviously among sentences. In this paper, we propose a Phrase based Parallel End-to-End TTS System (PPSpeech) to address these issues. PPSpeech uses autoregression approach within a phrase and executes parallel strategies for different phrases. By this method, we can achieve both high quality and high efficiency. In addition, we propose acoustic embedding and text context embedding as the conditions of encoder to keep successive and prevent from abrupt style or timbre change. Experiments show that, the synthesis speed of PPSpeech is much faster than sentence level autoregressive Tacotron 2 when a sentence has more than 5 phrases. The speed advantage increases with the growth of sentence length. Subjective experiments show that the proposed system with acoustic embedding and context embedding as conditions can make the style transition across sentences gradient and natural, defeating Global Style Token (GST) obviously in MOS.