在跨域语音识别的帮助下，使用情绪未标记的数据集进行情感控制语音综合

论文标题

在跨域语音识别的帮助下，使用情绪未标记的数据集进行情感控制语音综合

Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition

论文作者

Cai, Xiong, Dai, Dongyang, Wu, Zhiyong, Li, Xiang, Li, Jingbei, Meng, Helen

论文摘要

神经文本到语音（TTS）方法通常需要大量的高质量语音数据，这使得很难获得带有额外情感标签的数据集。在本文中，我们提出了一种新颖的方法，用于在没有情感标签的情况下在TTS数据集上合成。具体而言，我们提出的方法包括跨域语音情感识别（SER）模型和情感TTS模型。首先，我们在SER和TTS数据集上训练跨域SER模型。然后，我们使用训练有素的SER模型预测的TTS数据集上的情感标签来构建辅助SER任务，并通过TTS模型共同训练它。实验结果表明，我们提出的方法可以通过指定的情感表达力来产生语音，并且几乎不会阻碍语音质量。

Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emotional TTS model. Firstly, we train the cross-domain SER model on both SER and TTS datasets. Then, we use emotion labels on the TTS dataset predicted by the trained SER model to build an auxiliary SER task and jointly train it with the TTS model. Experimental results show that our proposed method can generate speech with the specified emotional expressiveness and nearly no hindering on the speech quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题