夹子驱动的细颗粒文本图像人重新识别

论文标题

夹子驱动的细颗粒文本图像人重新识别

CLIP-Driven Fine-grained Text-Image Person Re-identification

论文作者

Yan, Shuanglin, Dong, Neng, Zhang, Liyan, Tang, Jinhui

论文摘要

TireID的目的是从候选图像池中检索与给定文本查询相对应的图像。现有方法采用单模式预训练的先验知识来促进学习，但缺乏多模式对应关系。此外，由于模态之间存在很大的差距，现有方法将原始模态特征嵌入了相同的潜在空间以进行跨模式对齐。但是，特征嵌入可能会导致模式内信息失真。最近，由于其强大的语义概念学习能力和丰富的多模式知识，Clip引起了研究人员的广泛关注，这可以帮助我们解决上述问题。因此，在本文中，我们提出了一个夹子驱动的细颗粒信息发掘框架（CFINE），以充分利用TireID剪辑的强大知识。为了有效地传递多模式知识，我们执行精细颗粒的信息发掘以挖掘模式内判别线索和模式间对应关系。具体而言，我们首先设计了一个多元融合的全球特征学习模块，以完全挖掘模式内歧视性局部信息，该信息可以通过增强全球图像（文本）和信息性本地贴片（单词）之间的相互作用来强调与身份相关的歧视性线索。其次，提出了跨粒度特征细化（CFR）和细粒的对应发现（FCD）模块，以在模态之间建立交叉粒度和细粒度的相互作用，这些模态可以过滤掉非模式共享的图像贴片/单词和从粗模式到细小的跨模态对应。在推断过程中删除了CFR和FCD，以节省计算成本。请注意，上述过程是在原始模态空间中执行的，而无需进一步的嵌入。多个基准测试的广泛实验证明了我们方法在TireID上的出色性能。

TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondences. Besides, due to the substantial gap between modalities, existing methods embed the original modal features into the same latent space for cross-modal alignment. However, feature embedding may lead to intra-modal information distortion. Recently, CLIP has attracted extensive attention from researchers due to its powerful semantic concept learning capacity and rich multi-modal knowledge, which can help us solve the above problems. Accordingly, in the paper, we propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we perform fine-grained information excavation to mine intra-modal discriminative clues and inter-modal correspondences. Specifically, we first design a multi-grained global feature learning module to fully mine intra-modal discriminative local information, which can emphasize identity-related discriminative clues by enhancing the interactions between global image (text) and informative local patches (words). Secondly, cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules are proposed to establish the cross-grained and fine-grained interactions between modalities, which can filter out non-modality-shared image patches/words and mine cross-modal correspondences from coarse to fine. CFR and FCD are removed during inference to save computational costs. Note that the above process is performed in the original modality space without further feature embedding. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method on TIReID.

下载PDF全文

下载文献需遵守相关版权规定

论文标题