论文标题
连续时间的视听融合,复发与对野外识别的关注
Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for In-The-Wild Affect Recognition
论文作者
论文摘要
在本文中,我们提出了对第三个情感行为分析(ABAW)挑战的提交。多模式序列之间的LearningComplex相互作用对于识别野外视听数据的维度影响至关重要。复发和注意力是文献中广泛使用的序列建模机制。为了清楚地了解视听情感识别中经常性和注意力模型之间的性能差异,我们对基于LSTM-RNN,自我注意力和跨模式注意的融合模型进行了全面评估,对价和唤醒估计进行了训练。特别是,我们研究了一些关键设计选择的影响:CNN骨架的建模复杂性,这些骨架为时间模型提供特征,并在端到端学习的情况下。我们通过系统地调整参与网络体系结构设计和培训优化的超参数,培训了野外ABAW语料库的视听情感识别模型。我们对视听融合模型的广泛评估表明,LSTM-RNNS与低复杂的CNN骨架结合并以端到端的方式进行训练时,LSTM-RNN可以胜过注意力模型,这意味着注意力模型可能不一定是连续时间多态情感识别的最佳选择。
In this paper, we present our submission to 3rd Affective Behavior Analysis in-the-wild (ABAW) challenge. Learningcomplex interactions among multimodal sequences is critical to recognise dimensional affect from in-the-wild audiovisual data. Recurrence and attention are the two widely used sequence modelling mechanisms in the literature. To clearly understand the performance differences between recurrent and attention models in audiovisual affect recognition, we present a comprehensive evaluation of fusion models based on LSTM-RNNs, self-attention and cross-modal attention, trained for valence and arousal estimation. Particularly, we study the impact of some key design choices: the modelling complexity of CNN backbones that provide features to the the temporal models, with and without end-to-end learning. We trained the audiovisual affect recognition models on in-the-wild ABAW corpus by systematically tuning the hyper-parameters involved in the network architecture design and training optimisation. Our extensive evaluation of the audiovisual fusion models shows that LSTM-RNNs can outperform the attention models when coupled with low-complex CNN backbones and trained in an end-to-end fashion, implying that attention models may not necessarily be the optimal choice for continuous-time multimodal emotion recognition.