情绪识别的视听方式的时间聚集

论文标题

情绪识别的视听方式的时间聚集

Temporal aggregation of audio-visual modalities for emotion recognition

论文作者

Birhala, Andreea, Ristea, Catalin Nicolae, Radoi, Anamaria, Dutu, Liviu Cristian

论文摘要

情绪识别在情感计算和人类计算机相互作用中具有关键作用。当前的技术发展导致收集有关人情绪状态的数据的可能性增加。通常，人类对受试者传播的情绪的看法是基于在与受试者互动的第一秒钟内收集的声音和视觉信息。结果，在当前大多数当前的情感识别方法中，口头（即语音）和非语言（即图像）信息的整合似乎是首选选择。在本文中，我们提出了一种多模式融合技术，以基于从时间窗口和每种方式的时间偏移不同的时间窗口中的音频视图组合。我们表明，我们提出的方法优于文献和人类准确性等级的其他方法。实验是在开放式多模式数据集Crema-D上进行的。

Emotion recognition has a pivotal role in affective computing and in human-computer interaction. The current technological developments lead to increased possibilities of collecting data about the emotional state of a person. In general, human perception regarding the emotion transmitted by a subject is based on vocal and visual information collected in the first seconds of interaction with the subject. As a consequence, the integration of verbal (i.e., speech) and non-verbal (i.e., image) information seems to be the preferred choice in most of the current approaches towards emotion recognition. In this paper, we propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality. We show that our proposed method outperforms other methods from the literature and human accuracy rating. The experiments are conducted over the open-access multimodal dataset CREMA-D.

下载PDF全文

下载文献需遵守相关版权规定

论文标题