论文标题
对话历史的语言和韵律环境
Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History
论文作者
论文摘要
我们提出了一个端到端的移情对话语音综合(DSS)模型,该模型既考虑对话历史的语言和韵律环境。同理心是人类积极尝试进入对话中的对话者,而善解人意的DSS是在口语对话系统中实施此行为的技术。我们的模型以语言和韵律特征的历史为条件,以预测适当的对话环境。因此,它可以被视为传统基于语言 - 基于语言的对话历史建模的扩展。为了有效地培训同理心DSS模型,我们研究1)通过大型语音语料库预测的一个自我监督的学习模型,2)一种风格的指导培训,使用对话上下文嵌入的当前话语的韵律嵌入,3)嵌入对话中的嵌入,3)一个跨模式的关注,以使文本和语音模态与句子模型相结合,并实现良好的模型。评估结果表明,1)仅考虑对话历史的韵律环境并不能提高善解人意的DSS中的语音质量和2)引入样式引导的训练和句子嵌入模型的言语质量比传统方法更高。
We propose an end-to-end empathetic dialogue speech synthesis (DSS) model that considers both the linguistic and prosodic contexts of dialogue history. Empathy is the active attempt by humans to get inside the interlocutor in dialogue, and empathetic DSS is a technology to implement this act in spoken dialogue systems. Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context. As such, it can be regarded as an extension of the conventional linguistic-feature-based dialogue history modeling. To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling. The evaluation results demonstrate that 1) simply considering prosodic contexts of the dialogue history does not improve the quality of speech in empathetic DSS and 2) introducing style-guided training and sentence-wise embedding modeling achieves higher speech quality than that by the conventional method.