论文标题
使用字幕检测开放式摄制对象检测
Open-Vocabulary Object Detection Using Captions
论文作者
论文摘要
尽管深度神经网络在对象检测中具有显着的准确性,但由于监督要求,它们的训练和规模成本很高。特别是,学习更多的对象类别通常需要按比例更有限的框注释。已经探索了弱监督和零射击学习技术,以减少监督,以扩展对象探测器到更多类别,但它们并没有像监督模型那样成功且被广泛采用。在本文中,我们提出了一个新颖的对象检测问题的表述,即开放式摄制对象检测,它比弱监督和零摄像的方法更一般,更实用,更有效。我们提出了一种新方法,使用边界框注释训练对象探测器,以限制对象类别集,以及图像符合对象对,以明显较低的成本覆盖较大的对象。我们表明,所提出的方法可以检测和本地化对象,在训练过程中,在训练过程中提供了无边界框注释,其精度明显高于零击方法。同时,几乎可以检测到具有边界框注释的对象几乎与监督方法一样准确,这比弱监督的基线要好得多。因此,我们为可扩展对象检测建立了新的最新技术。
Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. Particularly, learning more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models. In this paper, we put forth a novel formulation of the object detection problem, namely open-vocabulary object detection, which is more general, more practical, and more effective than weakly supervised and zero-shot approaches. We propose a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost. We show that the proposed method can detect and localize objects for which no bounding box annotation is provided during training, at a significantly higher accuracy than zero-shot approaches. Meanwhile, objects with bounding box annotation can be detected almost as accurately as supervised methods, which is significantly better than weakly supervised baselines. Accordingly, we establish a new state of the art for scalable object detection.