V4D：用于视频级表示学习的4D卷积神经网络学习

论文标题

V4D：用于视频级表示学习的4D卷积神经网络学习

V4D:4D Convolutional Neural Networks for Video-level Representation Learning

论文作者

Zhang, Shiwen, Guo, Sheng, Huang, Weilin, Scott, Matthew R., Wang, Limin

论文摘要

视频表示学习的大多数现有3D CNN是基于剪辑的方法，因此不考虑时空特征的视频级时级时间演变。在本文中，我们建议使用称为V4D的视频级别4D卷积神经网络，以建模长期时空表示的演变，并同时保留具有残差连接的强3D时空表示。具体而言，我们设计了一个新的4D残差块，能够捕获clip Inter的相互作用，这可以增强原始夹级3D CNN的表示能力。可以轻松地将4D残差块集成到现有的3D CNN中，以层次进行远程建模。我们进一步介绍了拟议的V4D的培训和推理方法。在三个视频识别基准上进行了广泛的实验，其中V4D取得了出色的成绩，超过了最近的3D CNN。

Most existing 3D CNNs for video representation learning are clip-based methods, and thus do not consider video-level temporal evolution of spatio-temporal features. In this paper, we propose Video-level 4D Convolutional Neural Networks, referred as V4D, to model the evolution of long-range spatio-temporal representation with 4D convolutions, and at the same time, to preserve strong 3D spatio-temporal representation with residual connections. Specifically, we design a new 4D residual block able to capture inter-clip interactions, which could enhance the representation power of the original clip-level 3D CNNs. The 4D residual blocks can be easily integrated into the existing 3D CNNs to perform long-range modeling hierarchically. We further introduce the training and inference methods for the proposed V4D. Extensive experiments are conducted on three video recognition benchmarks, where V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题