论文标题
具有预训练的文本语音模型的自动韵律注释
Automatic Prosody Annotation with Pre-Trained Text-Speech Model
论文作者
论文摘要
在自然性和可读性方面,韵律边界在文本到语音综合(TTS)中起着重要作用。但是,韵律边界标签的获取依赖于手动注释,这是昂贵且耗时的。在本文中,我们建议通过带有预训练的音频编码器的神经文本语音模型自动从文本审计数据中提取韵律边界标签。该模型分别对文本和语音数据进行了预先训练,并以三重态格式对TTS数据进行了微调:{语音,文本,韵律}。自动评估和人类评估的实验结果表明:1)提出的文本言论韵律注释框架显着超过了仅纯文本基准; 2)自动韵律边界注释的质量与人类注释相当; 3)经过模型的界限训练的TTS系统比使用手动系统的系统要好得多。
Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: {speech, text, prosody}. The experimental results on both automatic evaluation and human evaluation demonstrate that: 1) the proposed text-speech prosody annotation framework significantly outperforms text-only baselines; 2) the quality of automatic prosodic boundary annotations is comparable to human annotations; 3) TTS systems trained with model-annotated boundaries are slightly better than systems that use manual ones.