MVFNET：用于高效视频识别的多视图融合网络

论文标题

MVFNET：用于高效视频识别的多视图融合网络

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

论文作者

Wu, Wenhao, He, Dongliang, Lin, Tianwei, Li, Fu, Gan, Chuang, Ding, Errui

论文摘要

通常，时空建模网络及其复杂性是视频动作识别中最集中的两个研究主题。现有的最新方法已经达到了出色的准确性，无论其复杂性与同时，性能的有效时空建模解决方案略微降低。在本文中，我们试图同时获得效率和有效性。首先，除了传统上将H X W X T视频帧视为时空信号（从高宽度的空间平面观看）外，我们还建议对其他两个高度时和宽度时间平面进行建模视频，以彻底捕获视频的动态。其次，我们的模型是基于2D CNN骨架设计的，并且通过设计牢记模型复杂性。具体而言，我们引入了一种新型的多视图融合（MVF）模块，以使用可分离卷积来利用视频动力学，以提高效率。它是一个插件模块，可以插入到现成的2D CNN中，以形成一个名为MVFNET的简单且有效的模型。此外，可以将MVFNET视为一个广义的视频建模框架，并且可以专门成为现有方法，例如在不同的设置下，慢速和TSM。广泛的实验是在流行的基准测试中进行的（即某种事物的V1和V1，动力学，UCF-101和HMDB-51），以表现出其优越性。拟议的MVFNET可以通过2D CNN的复杂性实现最先进的性能。

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题