RSPNET：无监督视频表示学习的相对速度感知学习

论文标题

RSPNET：无监督视频表示学习的相对速度感知学习

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

论文作者

Chen, Peihao, Huang, Deng, He, Dongliang, Long, Xiang, Zeng, Runhao, Wen, Shilei, Tan, Mingkui, Gan, Chuang

论文摘要

我们研究了无监督的视频表示学习，该学习旨在仅从未标记的视频中学习运动和外观功能，可以将其重复用于下游任务，例如动作识别。但是，由于1）视频中的高度复杂的时空信息，因此这项任务极具挑战性； 2）缺乏标记的数据进行培训。与静态图像的表示形式学习不同，很难构建合适的自制任务来很好地模拟运动和外观特征。最近，已经尝试通过视频播放速度预测来学习视频表示。但是，获得视频的精确速度标签是不算平的。更重要的是，学到的模型可能倾向于专注于运动模式，因此可能无法很好地学习外观功能。在本文中，我们观察到相对播放速度与运动模式更一致，因此为表示学习提供了更有效和稳定的监督。因此，我们提出了一种新的方式来感知播放速度并利用两个视频剪辑之间的相对速度作为标签。这样，我们能够很好地感知速度并学习更好的运动功能。此外，为了确保学习外观功能，我们进一步提出了一项以外观为中心的任务，在此过程中，我们强制执行模型以感知两个视频剪辑之间的外观差异。我们表明，优化这两个任务共同始终如一地提高了两个下游任务的性能，即动作识别和视频检索。值得注意的是，对于在UCF101数据集上的动作识别，我们在不使用标记的数据进行预训练的情况下达到了93.7％的精度，这表现优于Imagenet监督预训练的模型。可以在https://github.com/peihaochen/rspnet上找到代码和预训练的模型。

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to 1) the highly complex spatial-temporal information in videos; and 2) the lack of labeled data for training. Unlike the representation learning for static images, it is difficult to construct a suitable self-supervised task to well model both motion and appearance features. More recently, several attempts have been made to learn video representation through video playback speed prediction. However, it is non-trivial to obtain precise speed labels for the videos. More critically, the learnt models may tend to focus on motion pattern and thus may not learn appearance features well. In this paper, we observe that the relative playback speed is more consistent with motion pattern, and thus provide more effective and stable supervision for representation learning. Therefore, we propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels. In this way, we are able to well perceive speed and learn better motion features. Moreover, to ensure the learning of appearance features, we further propose an appearance-focused task, where we enforce the model to perceive the appearance difference between two video clips. We show that optimizing the two tasks jointly consistently improves the performance on two downstream tasks, namely action recognition and video retrieval. Remarkably, for action recognition on UCF101 dataset, we achieve 93.7% accuracy without the use of labeled data for pre-training, which outperforms the ImageNet supervised pre-trained model. Code and pre-trained models can be found at https://github.com/PeihaoChen/RSPNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题