看起来更多，但在视频识别方面却更少关心

论文标题

看起来更多，但在视频识别方面却更少关心

Look More but Care Less in Video Recognition

论文作者

Zhang, Yitian, Bai, Yue, Wang, Huan, Xu, Yi, Fu, Yun

论文摘要

现有的动作识别方法通常采样几个帧以表示每个视频以避免巨大的计算，这通常会限制识别性能。为了解决这个问题，我们提出了足够的焦点网络（AFNET），该网络由两个分支组成，用于利用更多框架，但计算较少。具体而言，充分的分支将所有输入框架带有凝结计算获得丰富的信息，并通过建议的导航模块为焦点分支提供指导。焦点分支将时间尺寸挤压，只专注于每个卷积块的显着帧；最后，两个分支的结果被自适应融合以防止信息丢失。通过这种设计，我们可以向网络介绍更多框架，但计算的成本较小。此外，我们证明AFNET可以使用较少的帧，同时获得更高的准确性，因为中级功能中的动态选择强制执行隐式时间建模。此外，我们表明我们的方法可以扩展以减少空间冗余，成本更低。在五个数据集上进行的大量实验证明了我们方法的有效性和效率。

Existing action recognition methods typically sample a few frames to represent each video to avoid the enormous computation, which often limits the recognition performance. To tackle this problem, we propose Ample and Focal Network (AFNet), which is composed of two branches to utilize more frames but with less computation. Specifically, the Ample Branch takes all input frames to obtain abundant information with condensed computation and provides the guidance for Focal Branch by the proposed Navigation Module; the Focal Branch squeezes the temporal size to only focus on the salient frames at each convolution block; in the end, the results of two branches are adaptively fused to prevent the loss of information. With this design, we can introduce more frames to the network but cost less computation. Besides, we demonstrate AFNet can utilize fewer frames while achieving higher accuracy as the dynamic selection in intermediate features enforces implicit temporal modeling. Further, we show that our method can be extended to reduce spatial redundancy with even less cost. Extensive experiments on five datasets demonstrate the effectiveness and efficiency of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题