论文标题
使用语音和图像数据的混合融合来解释的多模式情绪识别
Interpretable Multimodal Emotion Recognition using Hybrid Fusion of Speech and Image Data
论文作者
论文摘要
本文提出了一种基于混合融合的多模式情绪识别系统,该系统将语音话语和相应图像描绘的情绪分类为离散类。已经开发了一种新的可解释性技术来确定重要的语音和图像特征,从而预测了特定的情感类别。拟议的系统的体系结构是通过大量消融研究确定的。它融合了语音和图像特征,然后结合了语音,图像和中间融合输出。提出的可解释性技术结合了鸿沟和征服方法,以计算表示每个语音和图像特征的重要性的刻薄值。我们还构建了一个大规模数据集(IIT-R M的数据集),包括语音说服,相应的图像和班级标签,即“愤怒”,“快乐”,“仇恨”和“悲伤”。提出的系统已达到83.29%的情绪识别精度。提出的系统的增强性能提倡利用多种方式中的互补信息来识别情绪的重要性。
This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. The proposed system's architecture has been determined through intensive ablation studies. It fuses the speech & image features and then combines speech, image, and intermediate fusion outputs. The proposed interpretability technique incorporates the divide & conquer approach to compute shapely values denoting each speech & image feature's importance. We have also constructed a large-scale dataset (IIT-R SIER dataset), consisting of speech utterances, corresponding images, and class labels, i.e., 'anger,' 'happy,' 'hate,' and 'sad.' The proposed system has achieved 83.29% accuracy for emotion recognition. The enhanced performance of the proposed system advocates the importance of utilizing complementary information from multiple modalities for emotion recognition.