通过发现时空统计数据的自我监督视频表示学习

论文标题

通过发现时空统计数据的自我监督视频表示学习

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

论文作者

Wang, Jiangliu, Jiao, Jianbo, Bao, Linchao, He, Shengfeng, Liu, Wei, Liu, Yun-hui

论文摘要

本文提出了一个新颖的借口，以解决自我监督的视频表示学习问题。具体来说，如果有一个未标记的视频剪辑，我们计算了一系列时空统计摘要，例如空间位置和最大运动的主要方向，沿时间轴的最大色彩多样性的空间位置和主要色彩等。然后，构建并培训了一个神经网络，以构建并培训了统计摘要，以产生视频列表的统计摘要。为了减轻学习难度，我们采用了几种空间分区模式来编码粗糙的空间位置，而不是确切的空间笛卡尔坐标。我们的方法灵感来自于观察到人类视觉系统对视野快速变化的内容敏感，并且仅需要关于粗糙空间位置的印象才能了解视觉内容。为了验证所提出的方法的有效性，我们使用四个3D骨干网络（即C3D，3D-RESNET，R（2+1）D和S3D-G进行了广泛的实验。结果表明，我们的方法在四个下游视频分析任务上超过了这些骨干网络上现有方法的表现，包括动作识别，视频检索，动态场景识别和动作相似性标签。源代码可公开可用：https：//github.com/laura-wang/video_repres_sts。

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题