论文标题
间接:图像的语言引导的零射击深度学习
InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images
论文作者
论文摘要
常见的深度度量学习(DML)数据集仅指定一个相似性概念,例如,如果显示同一汽车模型,则认为CARS196数据集中的两个图像相似。我们认为,根据应用程序,图像检索系统的用户具有不同的相似性概念,应尽可能轻松地合并。因此,我们将语言引导的零击深度度量学习(LANZ-DML)作为一种新的DML设置,在该设置中,用户控制对图像表示重要的属性,而无需使用自然语言培训数据。为此,我们提出了间接(使用剪辑嵌入式文本上的维度降低的图像表示),这是图像上Lanz-DML的模型,该模型专门使用一些文本提示进行训练。间接将夹子用作图像和文本的固定特征提取器,并将文本提示嵌入的变化转移到图像嵌入空间。在五个数据集和总体13个相似性概念上进行的大量实验表明,尽管在训练过程中看不到任何图像,但间接表现效果还是强大的基本线,并且可以实现完全监督模型的性能。一项分析表明,间接学会学会着眼于与所需相似性概念相关的图像区域,这使得训练很快,并且易于使用的方法仅使用自然语言创建自定义嵌入空间。
Common Deep Metric Learning (DML) datasets specify only one notion of similarity, e.g., two images in the Cars196 dataset are deemed similar if they show the same car model. We argue that depending on the application, users of image retrieval systems have different and changing similarity notions that should be incorporated as easily as possible. Therefore, we present Language-Guided Zero-Shot Deep Metric Learning (LanZ-DML) as a new DML setting in which users control the properties that should be important for image representations without training data by only using natural language. To this end, we propose InDiReCT (Image representations using Dimensionality Reduction on CLIP embedded Texts), a model for LanZ-DML on images that exclusively uses a few text prompts for training. InDiReCT utilizes CLIP as a fixed feature extractor for images and texts and transfers the variation in text prompt embeddings to the image embedding space. Extensive experiments on five datasets and overall thirteen similarity notions show that, despite not seeing any images during training, InDiReCT performs better than strong baselines and approaches the performance of fully-supervised models. An analysis reveals that InDiReCT learns to focus on regions of the image that correlate with the desired similarity notion, which makes it a fast to train and easy to use method to create custom embedding spaces only using natural language.