视频的自学学习：一项调查

论文标题

视频的自学学习：一项调查

Self-Supervised Learning for Videos: A Survey

论文作者

Schiappa, Madeline C., Rawat, Yogesh S., Shah, Mubarak

论文摘要

深度学习在各个领域的显着成功取决于大规模注释数据集的可用性。但是，获得注释是昂贵的，需要巨大的努力，这对于视频尤其具有挑战性。此外，使用人类生成的注释会导致模型具有偏见的学习和较差的领域概括和鲁棒性。作为替代方案，自我监督的学习提供了一种表示不需要注释的表示方式，并在图像和视频域中都表现出了希望。与图像域不同，由于时间维度，学习视频表示形式更具挑战性，从而引发运动和其他环境动态。这也为视频独立思想提供了机会，这些想法可以在视频和多模式领域中推进自我监督的学习。在这项调查中，我们对专注于视频领域的自我监督学习的现有方法进行了综述。我们根据其学习目标将这些方法汇总为四个不同的类别：1）借口任务，2）生成学习，3）对比度学习和4）跨模式一致。我们进一步介绍了常用的数据集，下游评估任务，对现有作品局限性的见解以及该领域的潜在未来方向。

The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.

下载PDF全文

下载文献需遵守相关版权规定

论文标题