深层MOS预测器用于使用基于群集的建模的合成语音

论文标题

深层MOS预测器用于使用基于群集的建模的合成语音

Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling

论文作者

Choi, Yeunju, Jung, Youngmoon, Kim, Hoirin

论文摘要

尽管深度学习在语音综合和语音转换方面取得了令人印象深刻的进步，但人类参与者仍在进行综合语音的评估。最近的几篇论文提出了基于深度学习的评估模型，并显示了自动化语音质量评估的潜力。为了改善先前提出的评估模型MOSNET，我们使用基于群集的建模方法提出了三个模型：使用全局质量令牌（GQT）层，使用编码层，并使用两个模型。我们使用语音转换挑战2018的评估结果进行实验，以预测合成语音和参考语音之间合成语音和相似性评分的平均意见评分。结果表明，GQT层通过自动学习任务的有用质量令牌有助于更好地预测人类评估，并且编码层有助于更精确地利用帧级得分。

While deep learning has made impressive progress in speech synthesis and voice conversion, the assessment of the synthesized speech is still carried out by human participants. Several recent papers have proposed deep-learning-based assessment models and shown the potential to automate the speech quality assessment. To improve the previously proposed assessment model, MOSNet, we propose three models using cluster-based modeling methods: using a global quality token (GQT) layer, using an Encoding Layer, and using both of them. We perform experiments using the evaluation results of the Voice Conversion Challenge 2018 to predict the mean opinion score of synthesized speech and similarity score between synthesized speech and reference speech. The results show that the GQT layer helps to predict human assessment better by automatically learning the useful quality tokens for the task and that the Encoding Layer helps to utilize frame-level scores more precisely.

下载PDF全文

下载文献需遵守相关版权规定

论文标题