论文标题
时空变压器,用于野外动态面部表达识别
Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in the Wild
论文作者
论文摘要
野生动态面部表达的先前方法主要基于卷积神经网络(CNN),其本地操作忽略了视频中的长期依赖性。为了解决这个问题,我们提出了时空变压器(STT),以捕获每个帧中的歧视特征,并在框架之间建模上下文关系。时空依赖性由我们的统一变压器捕获和集成。具体而言,给定一个由多个帧作为输入组成的图像序列,我们利用CNN骨架将每个帧转换为视觉特征序列。随后,每个块内的空间注意力和时间关注被共同应用于序列水平的学习时空表示。此外,我们提出紧凑的软磁横熵损失,以进一步鼓励学习的特征具有最小的阶层内距离和最大阶层距离。对两个内部动态面部表达数据集(即DFEW和AFEW)进行的实验表明,我们的方法提供了一种使用空间和时间依赖性的有效方法来用于动态面部表达识别。源代码和培训日志将公开可用。
Previous methods for dynamic facial expression in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the long-range dependencies in videos. To solve this problem, we propose the spatio-temporal Transformer (STT) to capture discriminative features within each frame and model contextual relationships among frames. Spatio-temporal dependencies are captured and integrated by our unified Transformer. Specifically, given an image sequence consisting of multiple frames as input, we utilize the CNN backbone to translate each frame into a visual feature sequence. Subsequently, the spatial attention and the temporal attention within each block are jointly applied for learning spatio-temporal representations at the sequence level. In addition, we propose the compact softmax cross entropy loss to further encourage the learned features have the minimum intra-class distance and the maximum inter-class distance. Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and AFEW) indicate that our method provides an effective way to make use of the spatial and temporal dependencies for dynamic facial expression recognition. The source code and the training logs will be made publicly available.