Virtex：从文本注释中学习视觉表示

论文标题

Virtex：从文本注释中学习视觉表示

VirTex: Learning Visual Representations from Textual Annotations

论文作者

Desai, Karan, Johnson, Justin

论文摘要

许多视觉任务的事实方法是从验证的视觉表示开始，通常是通过对Imagenet的监督培训来学习的。最近的方法已经探索了无监督的预告片以扩展到大量未标记的图像。相反，我们旨在从较少的图像中学习高质量的视觉表示。为此，我们重新审视了监督预处理，并寻求基于分类预处理的数据有效替代方案。我们提出了Virtex-使用语义密集的字幕学习视觉表示的预处理方法。我们在可可字幕上从头开始训练卷积网络，并将其转移到下游识别任务，包括图像分类，对象检测和实例分段。在所有任务上，Virtex产生的功能匹配或超过Imagenet上所学的功能（受监督或无监督），尽管使用了较少的图像。

The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images.

下载PDF全文

下载文献需遵守相关版权规定

论文标题