论文标题
样式是您需要的吗?情绪和基于GST的说话者识别之间的依赖性
Is Style All You Need? Dependencies Between Emotion and GST-based Speaker Recognition
论文作者
论文摘要
在这项工作中,我们研究了以下假设:从语音样本中提取的说话者身份嵌入可以用于检测和分类情感。特别是,我们表明,通过使用1-D三胞胎卷积神经网络(CNN)和全球样式令牌(例如GST)方案(例如,DeepTalk网络)并重用受过训练的扬声器识别模型来生成情感分类域中的特征,可以通过学习说话者身份有效地确定情绪。自动扬声器识别(ASR)网络通过Voxceleb1,Voxceleb2和Librispeech数据集进行了培训,并使用扬声器身份标签具有三重态培训功能。使用支持向量机(SVM)分类器,我们将扬声器的身份嵌入到CREMA-D,IEMOCAP和MSP播客数据集中的离散情绪类别中。在语音情感检测的任务上,我们通过Crema-D的ACT示例获得了80.8%的ACC,在Iemocap中具有半自然情绪样本的81.2%ACC,在MSP播音中获得了66.9%的ACC,并带有自然情绪样本。我们还提出了一种新型的两阶段分层分类器(HC)方法,该方法在CREMA-D情感样本上表现出 +2%ACC的改善。通过这项工作,我们试图传达在音频样品中整体建模用户变化的重要性
In this work, we study the hypothesis that speaker identity embeddings extracted from speech samples may be used for detection and classification of emotion. In particular, we show that emotions can be effectively identified by learning speaker identities by use of a 1-D Triplet Convolutional Neural Network (CNN) & Global Style Token (GST) scheme (e.g., DeepTalk Network) and reusing the trained speaker recognition model weights to generate features in the emotion classification domain. The automatic speaker recognition (ASR) network is trained with VoxCeleb1, VoxCeleb2, and Librispeech datasets with a triplet training loss function using speaker identity labels. Using an Support Vector Machine (SVM) classifier, we map speaker identity embeddings into discrete emotion categories from the CREMA-D, IEMOCAP, and MSP-Podcast datasets. On the task of speech emotion detection, we obtain 80.8% ACC with acted emotion samples from CREMA-D, 81.2% ACC with semi-natural emotion samples in IEMOCAP, and 66.9% ACC with natural emotion samples in MSP-Podcast. We also propose a novel two-stage hierarchical classifier (HC) approach which demonstrates +2% ACC improvement on CREMA-D emotion samples. Through this work, we seek to convey the importance of holistically modeling intra-user variation within audio samples