使用Pseudo LookAhead具有大型语言模型的增量文本到语音综合

论文标题

使用Pseudo LookAhead具有大型语言模型的增量文本到语音综合

Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model

论文作者

Saeki, Takaaki, Takamichi, Shinnosuke, Saruwatari, Hiroshi

论文摘要

这封信提出了一种增量文本到语音（TTS）方法，该方法在保持小语言单元的同时保持了输出语音的自然性。增量TT通常在延迟和综合语音质量之间进行权衡。通过低延迟的设置进行高质量的演讲是一项挑战，该设置不会太多使用未经观察的未来句子（以下简称“ Lookahead”）。为了解决此问题，我们提出了一种增量TTS方法，该方法使用使用语言模型生成的伪lookahead来考虑未来的上下文信息而不会增加延迟。我们的方法可以被视为模仿人类的增量阅读，并使用验证的GPT2，该GPT2占据了大规模的语言知识。评估结果表明，我们的方法1）比仅考虑观察到的信息的方法达到的语音质量更高，并且2）实现了等同于等待未来上下文观察的语音质量。

This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题