使用韵律和语言特征调查内容感知到内容感知的神经文本到语音莫斯的预测

论文标题

使用韵律和语言特征调查内容感知到内容感知的神经文本到语音莫斯的预测

Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features

论文作者

Vioni, Alexandra, Maniati, Georgia, Ellinas, Nikolaos, Sung, June Sig, Hwang, Inchul, Chalamandaris, Aimilios, Tsiakoulis, Pirros

论文摘要

自动合成语音评估的当前最新方法基于MOS预测神经模型。这样的MOS预测模型包括使用光谱特征作为输入的MOSNET和LDNET，以及依赖于直接使用语音信号作为输入的预审计的自我监督学习模型的SSL-MOS。在现代高质量的神经TTS系统中，关于口语内容的韵律适当性是言语自然性的决定性因素。因此，我们建议将韵律和语言特征作为MOS预测系统中的其他输入，并评估其对预测结果的影响。我们将音素级别的F0和持续时间特征视为韵律输入，以及Tacotron编码器输出，POS标签和Bert嵌入作为高级语言输入。所有MOS预测系统均经过SOMOS的培训，Somos是一个仅神经TTS的数据集，具有众包自然性MOS评估。结果表明，提出的其他特征在MOS预测任务中是有益的，它通过在说话级和系统级别的预测上改善了预测的MOS得分与地面真相的相关性。

Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations. Results show that the proposed additional features are beneficial in the MOS prediction task, by improving the predicted MOS scores' correlation with the ground truths, both at utterance-level and system-level predictions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题