Paratts：基于段落的TTS学习语言和韵律跨句子信息

论文标题

Paratts：基于段落的TTS学习语言和韵律跨句子信息

ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS

论文作者

Xue, Liumeng, Soong, Frank K., Zhang, Shaofei, Xie, Lei

论文摘要

神经端到端TTS模型的最新进展显示出在常规的基于句子的TT中表现出高质量的自然合成语音。但是，当TTS中考虑整个段落时，重现相似的高质量，在建立基于段落的TTS模型时需要考虑大量上下文信息。为了减轻培训的困难，我们建议通过考虑跨性别，嵌入式结构在培训中对语言和韵律信息进行建模。三个子模块，包括语言学意识，韵律和句子位置网络。具体来说，要了解嵌入在段落中的信息以及相应的组件句子之间的关系，我们使用语言学意识和韵律感知网络。段落中的信息由编码器捕获，段落中的句子间信息通过多头注意机制学习。段落中的相对句子位置由句子位置网络明确利用。拟议中的TTS模型在女性普通话中录制的讲故事的音频语料库（4.08小时）接受了培训，该模型表明，它可以产生相当自然而良好的语音段落。与基于句子的模型相比，可以更好地预测和渲染的跨句子上下文信息，例如连续句子之间的断裂和韵律变化。在段落文本上进行了测试，其长度与训练数据的典型段落长度相似，比训练数据的典型段落长度，新模型产生的TTS语音始终优先于主观测试中的基于句子的模型并在客观测量中确认。

Recent advancements in neural end-to-end TTS models have shown high-quality, natural synthesized speech in a conventional sentence-based TTS. However, it is still challenging to reproduce similar high quality when a whole paragraph is considered in TTS, where a large amount of contextual information needs to be considered in building a paragraph-based TTS model. To alleviate the difficulty in training, we propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training. Three sub-modules, including linguistics-aware, prosody-aware and sentence-position networks, are trained together with a modified Tacotron2. Specifically, to learn the information embedded in a paragraph and the relations among the corresponding component sentences, we utilize linguistics-aware and prosody-aware networks. The information in a paragraph is captured by encoders and the inter-sentence information in a paragraph is learned with multi-head attention mechanisms. The relative sentence position in a paragraph is explicitly exploited by a sentence-position network. Trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker, the proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise. The cross-sentence contextual information, such as break and prosodic variations between consecutive sentences, can be better predicted and rendered than the sentence-based model. Tested on paragraph texts, of which the lengths are similar to, longer than, or much longer than the typical paragraph length of the training data, the TTS speech produced by the new model is consistently preferred over the sentence-based model in subjective tests and confirmed in objective measures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题