论文标题
通过密集的轨迹群集对视频表示的无监督学习
Unsupervised Learning of Video Representations via Dense Trajectory Clustering
论文作者
论文摘要
本文讨论了在视频中无监督学习以进行行动识别的任务。以前的著作提议利用未来的预测或其他特定领域的目标来培训网络,但仅取得了有限的成功。相比之下,在图像表示学习的相关领域中,基于更简单的基于歧视的方法最近将差距弥合到完全监督的性能。我们首先建议在此类中调整两个最佳性能的目标 - 实例识别和本地聚合,以适应视频域。特别是,后者的方法在将视频集群在网络的特征空间中进行迭代和对其进行更新以尊重群集,并以非参数分类损失。我们观察到有希望的表现,但是定性分析表明,学识渊博的表示未能捕获运动模式,根据外观对视频进行分组。为了减轻此问题,我们求助于基于启发式的IDT描述符,这些描述源是手动设计的,以编码视频中的运动模式。我们在IDT空间中形成了簇,使用这些描述符作为迭代局部聚合算法中的无监督先验。我们的实验表明,这种方法在UCF101和HMDB51动作识别基准上都表现优于先前的工作。我们还定性地分析了学习的表示形式,并表明它们成功捕获了视频动态。
This paper addresses the task of unsupervised learning of representations for action recognition in videos. Previous works proposed to utilize future prediction, or other domain-specific objectives to train a network, but achieved only limited success. In contrast, in the relevant field of image representation learning, simpler, discrimination-based methods have recently bridged the gap to fully-supervised performance. We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation, to the video domain. In particular, the latter approach iterates between clustering the videos in the feature space of a network and updating it to respect the cluster with a non-parametric classification loss. We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns, grouping the videos based on appearance. To mitigate this issue, we turn to the heuristic-based IDT descriptors, that were manually designed to encode motion patterns in videos. We form the clusters in the IDT space, using these descriptors as a an unsupervised prior in the iterative local aggregation algorithm. Our experiments demonstrates that this approach outperform prior work on UCF101 and HMDB51 action recognition benchmarks. We also qualitatively analyze the learned representations and show that they successfully capture video dynamics.