论文标题
通过可学习的图形和图形网络的动态情绪建模
Dynamic Emotion Modeling with Learnable Graphs and Graph Inception Network
论文作者
论文摘要
使用各种动态数据方式表达,感知和捕获人类的情绪,例如语音(语言),视频(面部表情)和运动传感器(身体手势)。我们提出了一种通用的情感识别方法,可以通过将动态数据建模为结构化图来适应跨模态。图形方法背后的动机是建立紧凑的模型而不损害性能。为了减轻最佳图形结构的问题,我们将其作为联合图形学习和分类任务。为此,我们介绍了可学习的图形网络(L-Grin),该网络共同学习识别情绪并识别动态数据中的基础图结构。我们的体系结构包括多个新颖的组件:新的图形卷积操作,图形构造层,可学习的邻接和可学习的池函数,可产生图形级别的嵌入。我们在五个基准情感识别数据库上评估了拟议的体系结构,该数据库涵盖了三种不同的方式(视频,音频,运动捕获),其中每个数据库捕获了以下情感线索之一:面部表情,语音和身体手势。我们在所有五个数据库上实现最先进的性能,优于几个竞争基线和相关的现有方法。我们的图形体系结构显示出卓越的性能,其参数明显较少(与卷积或经常性神经网络相比),承诺其对资源受限设备的适用性。
Human emotion is expressed, perceived and captured using a variety of dynamic data modalities, such as speech (verbal), videos (facial expressions) and motion sensors (body gestures). We propose a generalized approach to emotion recognition that can adapt across modalities by modeling dynamic data as structured graphs. The motivation behind the graph approach is to build compact models without compromising on performance. To alleviate the problem of optimal graph construction, we cast this as a joint graph learning and classification task. To this end, we present the Learnable Graph Inception Network (L-GrIN) that jointly learns to recognize emotion and to identify the underlying graph structure in the dynamic data. Our architecture comprises multiple novel components: a new graph convolution operation, a graph inception layer, learnable adjacency, and a learnable pooling function that yields a graph-level embedding. We evaluate the proposed architecture on five benchmark emotion recognition databases spanning three different modalities (video, audio, motion capture), where each database captures one of the following emotional cues: facial expressions, speech and body gestures. We achieve state-of-the-art performance on all five databases outperforming several competitive baselines and relevant existing methods. Our graph architecture shows superior performance with significantly fewer parameters (compared to convolutional or recurrent neural networks) promising its applicability to resource-constrained devices.