使用混合模式查询进行图像检索的渐进学习

论文标题

使用混合模式查询进行图像检索的渐进学习

Progressive Learning for Image Retrieval with Hybrid-Modality Queries

论文作者

Zhao, Yida, Song, Yuqing, Jin, Qin

论文摘要

带有混合模式查询的图像检索，也称为图像检索的文本和图像（CTI-IR），是一个检索任务，其中搜索意图以更复杂的查询格式表示，涉及视觉和文本方式。例如，使用参考产品图像以及有关更改参考图像的某些属性作为查询的某些属性的文本搜索目标产品图像。这是一个更具挑战性的图像检索任务，需要语义空间学习和跨模式融合。试图处理这两个方面的先前方法都达到了不令人满意的性能。在本文中，我们将CTI-IR任务分解为一个三阶段的学习问题，以逐步学习使用混合模式查询的图像检索的复杂知识。我们首先利用开放域图像文本检索的语义嵌入空间，然后通过与时尚相关的预训练任务将学习的知识转移到时尚域。最后，我们增强了从单质量到CTI-IR任务的混合模式查询的预训练模型。此外，随着个人模式在混合模式中的贡献在不同的检索场景中有所不同，我们提出了一种自我监督的自适应加权策略，以动态确定图像和文本在混合模式查询中的重要性，以更好地检索。广泛的实验表明，我们提出的模型在Recove@K的平均值中大大优于最先进的方法，分别在Fashion-IQ和Shoes基准数据集中分别胜过24.9％和9.5％。

Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format, involving both vision and text modalities. For example, a target product image is searched using a reference product image along with text about changing certain attributes of the reference image as the query. It is a more challenging image retrieval task that requires both semantic space learning and cross-modal fusion. Previous approaches that attempt to deal with both aspects achieve unsatisfactory performance. In this paper, we decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries. We first leverage the semantic embedding space for open-domain image-text retrieval, and then transfer the learned knowledge to the fashion-domain with fashion-related pre-training tasks. Finally, we enhance the pre-trained model from single-query to hybrid-modality query for the CTI-IR task. Furthermore, as the contribution of individual modality in the hybrid-modality query varies for different retrieval scenarios, we propose a self-supervised adaptive weighting strategy to dynamically determine the importance of image and text in the hybrid-modality query for better retrieval. Extensive experiments show that our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题