论文标题
文本到语音管道,评估方法和儿童语音综合的初步微调结果
A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
论文作者
论文摘要
随着当前的文本到语音(TTS)模型现在可以产生自然的人类语音,语音综合已经走了很长一段路。但是,大多数TTS研究都专注于使用成人语音数据,并且在儿童语音综合方面做的工作非常有限。这项研究开发了使用儿童语音数据集的微调最先进(SOTA)神经TTS模型的培训管道。这种方法采用多扬声器TTS重新调整工作流程,以提供转移学习管道。清理了一个公开可用的儿童语音数据集,以提供大约19个小时的较小子集,这构成了我们的微调实验的基础。使用验证的MOSNET进行客观评估和新型主观框架(MOS)评估进行了主观和客观评估。主观评估实现了语音清晰度的3.95的MOS,语音自然性的3.89,语音一致性为3.96。使用预估计的MOSNET进行客观评估,在真实儿童声音和合成儿童声音之间显示出很强的相关性。说话者相似性也通过计算话语嵌入之间的余弦相似性来验证。自动语音识别(ASR)模型也用于提供单词错误率(WER)比较真实和合成子声音之间的比较。最终训练的TTS模型能够从参考音频样本中综合出类似儿童的语音,短短5秒钟。
Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. Speaker similarity was also verified by calculating the cosine similarity between the embeddings of utterances. An automatic speech recognition (ASR) model is also used to provide a word error rate (WER) comparison between the real and synthetic child voices. The final trained TTS model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.