对语音合成质量的深度学习MOS预测变量的比较

论文标题

对语音合成质量的深度学习MOS预测变量的比较

A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality

论文作者

Ragano, Alessandro, Benetos, Emmanouil, Chinen, Michael, Martinez, Helard B., Reddy, Chandan K. A., Skoglund, Jan, Hines, Andrew

论文摘要

语音综合质量预测随着监督和自我监督学习（SSL）MOS预测指标的发展取得了显着的进步，但是与数据相关的某些方面仍不清楚，需要进一步研究。在本文中，我们根据WAV2VEC 2.0和NISQA语音质量预测模型评估了几个MOS预测指标，以探讨训练数据的作用，系统类型的影响以及跨域特征在SSL模型中的作用。我们的评估基于VoiceMos挑战数据集。结果表明，与监督模型相比，基于SSL的模型显示出最高的相关性和最低的平方误差。这项研究的关键点是，仅基于MOS预测变量的统计性能基准不足以对模型进行排名，因为隐藏在数据中的潜在问题可能会偏向评估的性能。

Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题