论文标题
VPN:学习视频姿势嵌入日常生活活动
VPN: Learning Video-Pose Embedding for Activities of Daily Living
论文作者
论文摘要
在本文中,我们关注识别日常生活活动(ADL)的时空方面。 ADL具有两个特定的特性(i)微妙的时空模式和(ii)随时间变化的相似视觉模式。因此,ADL看起来可能非常相似,并且通常需要查看它们的细粒细节以区分它们。由于最近的时空3D Convnets过于刚性,无法捕获动作中的微妙视觉模式,因此我们提出了一个新颖的视频档案网络:VPN。该VPN的两个关键组件是空间嵌入和注意力网络。空间嵌入将3D提出和RGB提示在共同的语义空间中提出。这使动作识别框架能够学习利用这两种模式的更好时空特征。为了区分类似的动作,注意网络提供了两个功能 - (i)利用人体拓扑的端到端可学习的姿势骨干,以及(ii)一个耦合器,可以在视频中提供关节时空注意力。实验表明,VPN在大规模的人类活动数据集上的动作分类效果优于最新的结果:NTU-RGB+D 120,其子集NTU-RGB+D 60,一个现实世界中挑战的人类活动数据集:Toyota Smarthome和小规模的人类对象互动数据集。
In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.