VPN：学习视频姿势嵌入日常生活活动

论文标题

VPN：学习视频姿势嵌入日常生活活动

VPN: Learning Video-Pose Embedding for Activities of Daily Living

论文作者

Das, Srijan, Sharma, Saurav, Dai, Rui, Bremond, Francois, Thonnat, Monique

论文摘要

在本文中，我们关注识别日常生活活动（ADL）的时空方面。 ADL具有两个特定的特性（i）微妙的时空模式和（ii）随时间变化的相似视觉模式。因此，ADL看起来可能非常相似，并且通常需要查看它们的细粒细节以区分它们。由于最近的时空3D Convnets过于刚性，无法捕获动作中的微妙视觉模式，因此我们提出了一个新颖的视频档案网络：VPN。该VPN的两个关键组件是空间嵌入和注意力网络。空间嵌入将3D提出和RGB提示在共同的语义空间中提出。这使动作识别框架能够学习利用这两种模式的更好时空特征。为了区分类似的动作，注意网络提供了两个功能 - （i）利用人体拓扑的端到端可学习的姿势骨干，以及（ii）一个耦合器，可以在视频中提供关节时空注意力。实验表明，VPN在大规模的人类活动数据集上的动作分类效果优于最新的结果：NTU-RGB+D 120，其子集NTU-RGB+D 60，一个现实世界中挑战的人类活动数据集：Toyota Smarthome和小规模的人类对象互动数据集。

In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题