GTA：视频动作理解的全球时间关注

论文标题

GTA：视频动作理解的全球时间关注

GTA: Global Temporal Attention for Video Action Understanding

论文作者

He, Bo, Yang, Xitong, Wu, Zuxuan, Chen, Hao, Lim, Ser-Nam, Shrivastava, Abhinav

论文摘要

自我发作会学习成对的相互作用，以建模远程依赖性，从而为视频动作识别提供了很大的改进。在本文中，我们寻求对视频中时间建模的自我关注的更深入的了解。我们首先证明，通过平坦的所有像素是优化的，无法明确捕获框架之间的时间关系，从而证明了时空信息的纠缠建模。为此，我们引入了全球暂时关注（GTA），该关注以脱钩的方式在空间关注的基础上表现出全球时间的关注。我们将GTA应用于像素和语义相似区域，以捕获不同水平的空间粒度的时间关系。与传统的自我发作不同，该自我发作计算特定于实例的注意力矩阵，GTA直接学习了一个全局注意矩阵，该矩阵旨在编码跨不同样本的时间结构。我们以跨渠道多头时尚的方式进一步增强了GTA，以利用通道相互作用，以更好地建模。在2D和3D网络上进行的广泛实验表明，我们的方法始终增强时间建模，并在三个视频动作识别数据集上提供最先进的性能。

Self-attention learns pairwise interactions to model long-range dependencies, yielding great improvements for video action recognition. In this paper, we seek a deeper understanding of self-attention for temporal modeling in videos. We first demonstrate that the entangled modeling of spatio-temporal information by flattening all pixels is sub-optimal, failing to capture temporal relationships among frames explicitly. To this end, we introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. We apply GTA on both pixels and semantically similar regions to capture temporal relationships at different levels of spatial granularity. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA directly learns a global attention matrix that is intended to encode temporal structures that generalize across different samples. We further augment GTA with a cross-channel multi-head fashion to exploit channel interactions for better temporal modeling. Extensive experiments on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题