论文标题
3DV:3D动态体素,用于在深度视频中进行动作识别
3DV: 3D Dynamic Voxel for Action Recognition in Depth Video
论文作者
论文摘要
为了促进基于深度的3D动作识别,将3D动态体素(3DV)提出为新型3D运动表示。使用3D空间素化,3DV的关键思想是通过时间秩池在深度视频中编码3D运动信息,将3D运动信息编码为常规的voxel集(即3DV)。每个可用的3DV素本质上都涉及3D空间和运动特征。然后将3DV抽象为点集,并以端到端的学习方式输入PointNet ++以进行3D动作识别。将3DV传输到点集形式的直觉是,PointNet ++对于对点集的深度特征学习是轻巧且有效的。由于3DV可能会失去外观线索,因此还提出了多流3D动作识别方式共同学习运动和外观特征。为了提取动作的更丰富的时间顺序信息,我们还将深度视频分为时间拆分,并在3DV中整合地编码此过程。在4个良好的基准数据集上进行的广泛实验证明了我们命题的优势。令人印象深刻的是,我们在NTU RGB+D 120 [13]上获得了82.4%和93.5%的准确性,分别具有交叉设备和交叉群测试设置。 3DV的代码可在https://github.com/3huo/3dv-action上找到。
To facilitate depth-based 3D action recognition, 3D dynamic voxel (3DV) is proposed as a novel 3D motion representation. With 3D space voxelization, the key idea of 3DV is to encode 3D motion information within depth video into a regular voxel set (i.e., 3DV) compactly, via temporal rank pooling. Each available 3DV voxel intrinsically involves 3D spatial and motion feature jointly. 3DV is then abstracted as a point set and input into PointNet++ for 3D action recognition, in the end-to-end learning way. The intuition for transferring 3DV into the point set form is that, PointNet++ is lightweight and effective for deep feature learning towards point set. Since 3DV may lose appearance clue, a multi-stream 3D action recognition manner is also proposed to learn motion and appearance feature jointly. To extract richer temporal order information of actions, we also divide the depth video into temporal splits and encode this procedure in 3DV integrally. The extensive experiments on 4 well-established benchmark datasets demonstrate the superiority of our proposition. Impressively, we acquire the accuracy of 82.4% and 93.5% on NTU RGB+D 120 [13] with the cross-subject and crosssetup test setting respectively. 3DV's code is available at https://github.com/3huo/3DV-Action.