论文标题
使用Pseudo LookAhead具有大型语言模型的增量文本到语音综合
Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model
论文作者
论文摘要
这封信提出了一种增量文本到语音(TTS)方法,该方法在保持小语言单元的同时保持了输出语音的自然性。增量TT通常在延迟和综合语音质量之间进行权衡。通过低延迟的设置进行高质量的演讲是一项挑战,该设置不会太多使用未经观察的未来句子(以下简称“ Lookahead”)。为了解决此问题,我们提出了一种增量TTS方法,该方法使用使用语言模型生成的伪lookahead来考虑未来的上下文信息而不会增加延迟。我们的方法可以被视为模仿人类的增量阅读,并使用验证的GPT2,该GPT2占据了大规模的语言知识。评估结果表明,我们的方法1)比仅考虑观察到的信息的方法达到的语音质量更高,并且2)实现了等同于等待未来上下文观察的语音质量。
This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation.