弥合对象和图像级表示之间的差距，以进行开放式视频检测

论文标题

弥合对象和图像级表示之间的差距，以进行开放式视频检测

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

论文作者

Rasheed, Hanoona, Maaz, Muhammad, Khattak, Muhammad Uzair, Khan, Salman, Khan, Fahad Shahbaz

论文摘要

现有的开放式摄制对象检测器通常通过利用不同形式的弱监督来扩大其词汇大小。这有助于推断出新的对象。开放式视频检测（OVD）中使用的两种流行形式的弱点，包括预之前的剪辑模型和图像级监督。我们注意到，这两种监督模式均未在检测任务中最佳地对齐：剪辑经过图像文本对培训，并且缺乏对象的精确定位，而图像级监督已与未准确指定本地对象区域的启发式方法一起使用。在这项工作中，我们建议通过从剪辑模型中执行以对象为中心的语言嵌入来解决此问题。此外，我们仅使用伪标记的过程来视觉上以图像级的监督为目标，该过程提供高质量的对象建议，并有助于在训练过程中扩展词汇。我们通过新的重量转移函数在上述两个对象对准策略之间建立桥梁，该策略汇总了它们的免费强度。从本质上讲，所提出的模型旨在最大程度地减少OVD设置中对象和以图像为中心表示之间的差距。在可可基准上，我们提出的方法在新颖班级上达到36.6 AP50，比以前的最佳表现获得了8.2的绝对增长。对于LVIS，我们通过5.0 Mask AP超过了最先进的VILD模型，总体类别为3.4。代码：https：//github.com/hanoonar/object-centric-ovd。

Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object proposals and helps expand the vocabulary during training. We establish a bridge between the above two object-alignment strategies via a novel weight transfer function that aggregates their complimentary strengths. In essence, the proposed model seeks to minimize the gap between object and image-centric representations in the OVD setting. On the COCO benchmark, our proposed approach achieves 36.6 AP50 on novel classes, an absolute 8.2 gain over the previous best performance. For LVIS, we surpass the state-of-the-art ViLD model by 5.0 mask AP for rare categories and 3.4 overall. Code: https://github.com/hanoonaR/object-centric-ovd.

下载PDF全文

下载文献需遵守相关版权规定

论文标题