自我监督的对抗领域适应跨语言和跨语言语音情感识别

论文标题

自我监督的对抗领域适应跨语言和跨语言语音情感识别

Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition

论文作者

Latif, Siddique, Rana, Rajib, Khalifa, Sara, Jurdak, Raja, Schuller, Björn

论文摘要

尽管在单个语料库环境中最近在语音情感识别（SER）方面取得了进步，但这些SER系统的性能显着降低了跨语言和跨语言场景。关键原因是在SER系统中缺乏对看不见的条件的概括，这导致它们在跨科语和跨语言环境中的表现不佳。最近的研究着重于利用对抗方法来学习域通用表示，以改善跨语言和跨语言SER来解决此问题。但是，这些方法中的许多方法仅集中在交叉Corpus SER上，而无需解决跨语言ser性能降解，这是由于源和目标语言数据之间的域间隙较大。此贡献提出了一个对抗性双歧视器（ADDI）网络，该网络使用三人游戏对手游戏来学习通用表示，而无需任何目标数据标签。我们还引入了一个自我监督的ADDI（SADDI）网络，该网络利用没有标记的数据使用自我监管的预训练。我们将综合数据生成作为SADDI的借口任务，使网络能够产生情感上的歧视性和域不变表示，并提供互补的合成数据以增强系统。提出的模型使用三种语言的五个公开可用数据集进行了严格的评估，并将其与关于跨语言和跨语言Ser的多个研究进行了比较。实验结果表明，与最先进的方法相比，所提出的模型可以提高性能。

Despite the recent advancement in speech emotion recognition (SER) within a single corpus setting, the performance of these SER systems degrades significantly for cross-corpus and cross-language scenarios. The key reason is the lack of generalisation in SER systems towards unseen conditions, which causes them to perform poorly in cross-corpus and cross-language settings. Recent studies focus on utilising adversarial methods to learn domain generalised representation for improving cross-corpus and cross-language SER to address this issue. However, many of these methods only focus on cross-corpus SER without addressing the cross-language SER performance degradation due to a larger domain gap between source and target language data. This contribution proposes an adversarial dual discriminator (ADDi) network that uses the three-players adversarial game to learn generalised representations without requiring any target data labels. We also introduce a self-supervised ADDi (sADDi) network that utilises self-supervised pre-training with unlabelled data. We propose synthetic data generation as a pretext task in sADDi, enabling the network to produce emotionally discriminative and domain invariant representations and providing complementary synthetic data to augment the system. The proposed model is rigorously evaluated using five publicly available datasets in three languages and compared with multiple studies on cross-corpus and cross-language SER. Experimental results demonstrate that the proposed model achieves improved performance compared to the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题