TAM：视频识别的时间自适应模块

论文标题

TAM：视频识别的时间自适应模块

TAM: Temporal Adaptive Module for Video Recognition

论文作者

Liu, Zhaoyang, Wang, Limin, Wu, Wayne, Qian, Chen, Lu, Tong

论文摘要

视频数据具有复杂的时间动态，这是由于各种因素，例如摄像机运动，速度变化和不同的活动。为了有效捕获这种多样化的运动模式，本文介绍了一个新的时间自适应模块（{\ bf tam}），以基于其自己的功能图生成特定于视频的时间内核。 TAM通过将动态内核分解为位置敏感性映射和位置不变的聚合权重来提出独特的两级自适应建模方案。重要性图是在本地时间窗口中学习的，以捕获短期信息，而聚合权重是从全球视图产生的，重点是长期结构。 TAM是一个模块化的块，可以集成到2D CNN中，以产生强大的视频体系结构（TANET），其额外的计算成本很小。 Kinetics-400和某些事物数据集的广泛实验表明，我们的TAM始终超过其他时间建模方法，并在相似的复杂性下实现了最先进的性能。该代码可在\ url {https://github.com/liu-zhy/temporal-aptive-module}中获得。

Video data is with complex temporal dynamics due to various factors such as camera motion, speed variation, and different activities. To effectively capture this diverse motion pattern, this paper presents a new temporal adaptive module ({\bf TAM}) to generate video-specific temporal kernels based on its own feature map. TAM proposes a unique two-level adaptive modeling scheme by decoupling the dynamic kernel into a location sensitive importance map and a location invariant aggregation weight. The importance map is learned in a local temporal window to capture short-term information, while the aggregation weight is generated from a global view with a focus on long-term structure. TAM is a modular block and could be integrated into 2D CNNs to yield a powerful video architecture (TANet) with a very small extra computational cost. The extensive experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently, and achieves the state-of-the-art performance under the similar complexity. The code is available at \url{ https://github.com/liu-zhy/temporal-adaptive-module}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题