通过多模式构成修改图像文本检索

论文标题

通过多模式构成修改图像文本检索

Revising Image-Text Retrieval via Multi-Modal Entailment

论文作者

Yan, Xu, Ai, Chunhui, Cao, Ziqiang, Cao, Min, Li, Sujian, Li, Wenjie, Fu, Guohong

论文摘要

出色的图像文本检索模型取决于高质量标记的数据。尽管现有图像文本检索数据集的构建者努力确保标题与链接的图像匹配，但它们无法阻止字幕拟合其他图像。我们观察到，如此多的匹配现象在广泛使用的检索数据集中非常普遍，其中一个标题可以描述多达178张图像。这些较大的匹配失望的数据不仅使训练中的模型混淆，而且会削弱评估精度。受视觉和文本构成任务的启发，我们提出了一个多模式的组合分类器，以确定句子是否由图像和其链接的字幕所带来。随后，我们通过将这些需要的字幕添加为图像的附加标签来修改图像文本检索数据集，并制定通用的可变学习率策略，以教授检索模型以区分所需的字幕和其他负面样本。在实验中，我们手动注释了一个需要校正的图像文本检索数据集进行评估。结果表明，所提出的元素分类器可实现约78％的精度，并始终提高图像文本检索基线的性能。

An outstanding image-text retrieval model depends on high-quality labeled data. While the builders of existing image-text retrieval datasets strive to ensure that the caption matches the linked image, they cannot prevent a caption from fitting other images. We observe that such a many-to-many matching phenomenon is quite common in the widely-used retrieval datasets, where one caption can describe up to 178 images. These large matching-lost data not only confuse the model in training but also weaken the evaluation accuracy. Inspired by visual and textual entailment tasks, we propose a multi-modal entailment classifier to determine whether a sentence is entailed by an image plus its linked captions. Subsequently, we revise the image-text retrieval datasets by adding these entailed captions as additional weak labels of an image and develop a universal variable learning rate strategy to teach a retrieval model to distinguish the entailed captions from other negative samples. In experiments, we manually annotate an entailment-corrected image-text retrieval dataset for evaluation. The results demonstrate that the proposed entailment classifier achieves about 78% accuracy and consistently improves the performance of image-text retrieval baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题