论文标题
有条件匹配
Open-Vocabulary DETR with Conditional Matching
论文作者
论文摘要
开放式视频对象检测与检测自然语言指导的新物体的问题有关,它引起了社区的关注。理想情况下,我们想扩展一个开放式摄影探测器,以便它可以基于自然语言或示例图像的用户输入来产生边界框预测。这为人类计算机互动提供了极大的灵活性和用户体验。为此,我们根据DETR提出了一个新颖的开胶检测器 - 因此,OV-DETR的名称 - 一旦受过训练,可以检测到给定其类名称或示例图像的任何对象。将DETR转变为开放式摄影探测器的最大挑战是,不可能在不访问其标记的图像的情况下计算新型类别的分类成本矩阵。为了克服这一挑战,我们将学习目标提出为二进制匹配,将输入查询(类名称或示例图像)与相应的对象之间的二进制匹配,该对象学会了有用的对应关系以概括为在测试期间看不见的查询。对于训练,我们选择在从剪辑(例如夹子)中获得的输入嵌入中调节变压器解码器,以启用文本和图像查询的匹配。通过对LVI和可可数据集进行的广泛实验,我们证明了我们的OV-Detr(第一个基于端到的端子变压器的开放式摄影剂检测器)可以实现与当前艺术状态相比的非平凡改进。
Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR -- hence the name OV-DETR -- which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR -- the first end-to-end Transformer-based open-vocabulary detector -- achieves non-trivial improvements over current state of the arts.