论文标题
在您跳跃之前查看:通过学习一致的跨模式歧管,改善基于文本的人的检索
Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold
论文作者
论文摘要
基于文本的人检索的核心问题是如何弥合多模式数据之间的异质差距。许多以前的方法用于以\ textbf {交叉模式分布共识预测(CDCP)}方式学习潜在的常见歧管映射范式。当将某个模式分布到公共歧管中的映射特征时,相反模态的特征分布是完全不可见的。也就是说,如何实现跨模式分布共识,以便将多模式特征嵌入和对齐在构建的跨模式的公共歧管中,完全取决于模型本身的经验,而不是实际情况。通过这种方法,不可避免地,多模式数据在共同的歧管中不能很好地对齐,这最终导致了次优的检索性能。为了克服此\ textbf {CDCP困境},我们提出了一种称为lbul的新颖算法,以学习基于文本的人检索的一致的跨模式公共歧管(C $^{3} $ M)。正如中文言论所说,我们方法的核心思想是“ \ textit” {san si er hou xing}',即,\ textbf {看之前,请先浏览(lbul)}。 LBUL的常见歧管映射机制包含一个看起来的步骤和跳跃步骤。与基于CDCP的方法相比,LBUL考虑了视觉和文本模态的分布特性,然后将数据从某个模式嵌入到C $^{3} $ M中以获得更坚固的交叉模式分布共识,从而实现了卓越的检索精度。我们在两个基于文本的人检索数据集Cuhk-Pedes和RSTPREID上评估了我们的建议方法。实验结果表明,所提出的LBUL胜过以前的方法,并实现了最先进的性能。
The core problem of text-based person retrieval is how to bridge the heterogeneous gap between multi-modal data. Many previous approaches contrive to learning a latent common manifold mapping paradigm following a \textbf{cross-modal distribution consensus prediction (CDCP)} manner. When mapping features from distribution of one certain modality into the common manifold, feature distribution of the opposite modality is completely invisible. That is to say, how to achieve a cross-modal distribution consensus so as to embed and align the multi-modal features in a constructed cross-modal common manifold all depends on the experience of the model itself, instead of the actual situation. With such methods, it is inevitable that the multi-modal data can not be well aligned in the common manifold, which finally leads to a sub-optimal retrieval performance. To overcome this \textbf{CDCP dilemma}, we propose a novel algorithm termed LBUL to learn a Consistent Cross-modal Common Manifold (C$^{3}$M) for text-based person retrieval. The core idea of our method, just as a Chinese saying goes, is to `\textit{san si er hou xing}', namely, to \textbf{Look Before yoU Leap (LBUL)}. The common manifold mapping mechanism of LBUL contains a looking step and a leaping step. Compared to CDCP-based methods, LBUL considers distribution characteristics of both the visual and textual modalities before embedding data from one certain modality into C$^{3}$M to achieve a more solid cross-modal distribution consensus, and hence achieve a superior retrieval accuracy. We evaluate our proposed method on two text-based person retrieval datasets CUHK-PEDES and RSTPReid. Experimental results demonstrate that the proposed LBUL outperforms previous methods and achieves the state-of-the-art performance.