基于细粒草图图像检索的跨模式融合蒸馏

论文标题

基于细粒草图图像检索的跨模式融合蒸馏

Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval

论文作者

Chaudhuri, Abhra, Mancini, Massimiliano, Chen, Yanbei, Akata, Zeynep, Dutta, Anjan

论文摘要

基于草图的图像检索的表示形式学习主要是通过学习丢弃特定于模态信息的嵌入来解决的。由于不同方式的实例通常可以提供描述基本概念的互补信息，因此我们为视觉变压器（Xmodalvit）提出了一个融合特定于模态信息而不是丢弃它们的跨意见框架。我们的框架第一地图将单个照片的数据点配对，并将模式绘制到融合的表示形式，这些表示可以从两种模式中统一信息。然后，我们通过对比度和关系跨模式知识蒸馏将上述模态融合网络的输入空间分解为单个模态的独立编码者。然后，这些编码器可以应用于跨模式检索等下游任务。我们通过执行各种实验并在三个基于细粒的基于素描的图像检索基准：Shoe-V2，Caire-V2和Skitchy上实现最先进的结果来证明学习表示的表达能力。实施可在https://github.com/abhrac/xmodal-vit上获得。

Representation learning for sketch-based image retrieval has mostly been tackled by learning embeddings that discard modality-specific information. As instances from different modalities can often provide complementary information describing the underlying concept, we propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation. Such encoders can then be applied to downstream tasks like cross-modal retrieval. We demonstrate the expressive capacity of the learned representations by performing a wide range of experiments and achieving state-of-the-art results on three fine-grained sketch-based image retrieval benchmarks: Shoe-V2, Chair-V2 and Sketchy. Implementation is available at https://github.com/abhrac/xmodal-vit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题