自我监管的多模式多功能网络

论文标题

自我监管的多模式多功能网络

Self-Supervised MultiModal Versatile Networks

论文作者

Alayrac, Jean-Baptiste, Recasens, Adrià, Schneider, Rosalia, Arandjelović, Relja, Ramapuram, Jason, De Fauw, Jeffrey, Smaira, Lucas, Dieleman, Sander, Zisserman, Andrew

论文摘要

视频是多模式监督的丰富来源。在这项工作中，我们通过利用自然存在视频中存在的三种方式来学习表示表示形式：视觉，音频和语言流。为此，我们介绍了多模式多功能网络的概念，该网络可以摄取多种模态并且其表示形式可以以多种方式启用下游任务。特别是，我们探讨了如何最好地结合方式，从而可以保持视觉和音频方式的细粒度表示，同时还将文本整合到一个共同的嵌入中。在多功能性的驱动下，我们还引入了一种新颖的通缩过程，以便以视频或静态图像的形式轻松地将网络应用于视觉数据。我们演示了在视频，视频文本，图像和音频任务上，如何将经过大量未标记视频数据收集的此类网络应用于大量的无标记视频数据。配备这些表示形式，我们在多个具有挑战性的基准上获得了最先进的性能，包括UCF101，HMDB51，Kinetics600，Audioset和Esc-50，与以前的自我监督工作相比。我们的模型公开可用。

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题