论文标题
GTA:视频动作理解的全球时间关注
GTA: Global Temporal Attention for Video Action Understanding
论文作者
论文摘要
自我发作会学习成对的相互作用,以建模远程依赖性,从而为视频动作识别提供了很大的改进。在本文中,我们寻求对视频中时间建模的自我关注的更深入的了解。我们首先证明,通过平坦的所有像素是优化的,无法明确捕获框架之间的时间关系,从而证明了时空信息的纠缠建模。为此,我们引入了全球暂时关注(GTA),该关注以脱钩的方式在空间关注的基础上表现出全球时间的关注。我们将GTA应用于像素和语义相似区域,以捕获不同水平的空间粒度的时间关系。与传统的自我发作不同,该自我发作计算特定于实例的注意力矩阵,GTA直接学习了一个全局注意矩阵,该矩阵旨在编码跨不同样本的时间结构。我们以跨渠道多头时尚的方式进一步增强了GTA,以利用通道相互作用,以更好地建模。在2D和3D网络上进行的广泛实验表明,我们的方法始终增强时间建模,并在三个视频动作识别数据集上提供最先进的性能。
Self-attention learns pairwise interactions to model long-range dependencies, yielding great improvements for video action recognition. In this paper, we seek a deeper understanding of self-attention for temporal modeling in videos. We first demonstrate that the entangled modeling of spatio-temporal information by flattening all pixels is sub-optimal, failing to capture temporal relationships among frames explicitly. To this end, we introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. We apply GTA on both pixels and semantically similar regions to capture temporal relationships at different levels of spatial granularity. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA directly learns a global attention matrix that is intended to encode temporal structures that generalize across different samples. We further augment GTA with a cross-channel multi-head fashion to exploit channel interactions for better temporal modeling. Extensive experiments on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.