基于骨架的动作识别的多尺度空间时间图卷积网络

论文标题

基于骨架的动作识别的多尺度空间时间图卷积网络

Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

论文作者

Chen, Zhan, Li, Sicheng, Yang, Bing, Li, Qinghan, Liu, Hong

论文摘要

图形卷积网络由于非欧盟数据的出色建模能力而广泛用于基于骨架的动作识别。由于图形卷积是本地操作，因此它只能利用短程关节依赖性和短期轨迹，但无法直接建模遥远的关节关系和远程时间信息，这些信息对于区分各种动作至关重要。为了解决此问题，我们提出了一个多尺度的空间图卷积（MS-GC）模块和一个多尺度的时间图卷积（MT-GC）模块，以在空间和时间尺寸中丰富模型的接受场。具体而言，MS-GC和MT-GC模块将相应的局部图卷积分解为一组子图形卷积，形成了层次的残留体系结构。在不引入其他参数的情况下，这些功能将通过一系列子图卷积处理，并且每个节点都可以与其邻居一起完成多个空间和时间聚集。因此，最终的等效接收场被扩大，能够捕获空间和时间域中的短期和远程依赖性。通过将这两个模块耦合为基本块，我们进一步提出了一个多尺度的空间时间图卷积网络（MST-GCN），该网络（MST-GCN）堆叠了多个块以学习有效的运动表现以进行动作识别。拟议的MST-GCN在基于骨架的动作识别方面，在三个具有挑战性的基准数据集（NTU RGB+D，NTU-120 RGB+D和动力学）上取得了出色的性能。

Graph convolutional networks have been widely used for skeleton-based action recognition due to their excellent modeling ability of non-Euclidean data. As the graph convolution is a local operation, it can only utilize the short-range joint dependencies and short-term trajectory but fails to directly model the distant joints relations and long-range temporal information that are vital to distinguishing various actions. To solve this problem, we present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module to enrich the receptive field of the model in spatial and temporal dimensions. Concretely, the MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of sub-graph convolution, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-graph convolutions, and each node could complete multiple spatial and temporal aggregations with its neighborhoods. The final equivalent receptive field is accordingly enlarged, which is capable of capturing both short- and long-range dependencies in spatial and temporal domains. By coupling these two modules as a basic block, we further propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition. The proposed MST-GCN achieves remarkable performance on three challenging benchmark datasets, NTU RGB+D, NTU-120 RGB+D and Kinetics-Skeleton, for skeleton-based action recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题