减少资源约束的对比图像捕获中的预测特征抑制

论文标题

减少资源约束的对比图像捕获中的预测特征抑制

Reducing Predictive Feature Suppression in Resource-Constrained Contrastive Image-Caption Retrieval

论文作者

Bleeker, Maurits, Yates, Andrew, de Rijke, Maarten

论文摘要

为了训练图像捕获检索（ICR）方法，对比损失函数是优化函数的常见选择。不幸的是，对比性ICR方法容易受到预测特征抑制的影响。预测功能是正确指示查询和候选项目之间相似性的功能。但是，在训练过程中存在多个预测特征的情况下，编码器模型倾向于抑制冗余的预测特征，因为这些特征不需要学会区分正面和负面对。虽然在训练过程中一些预测功能是多余的，但在评估过程中可能是相关的。我们引入了一种减少资源受限的ICR方法的预测特征抑制方法：潜在目标解码（LTD）。我们在对比度ICR框架中添加了一个额外的解码器，以在通用句子编码器的潜在空间中重建输入字幕，从而阻止图像和字幕编码器抑制预测性特征。我们将LTD目标作为优化约束，以确保重建损失低于界价值，同时主要针对对比度损失进行优化。重要的是，Ltd不取决于其他培训数据或昂贵（硬）负面采矿策略。我们的实验表明，与重建输入空间中的输入标题不同，LTD可减少预测特征抑制，通过获得更高的召回@K，R-Precision和NDCG分数，而不是对比的ICR基线来衡量。此外，我们表明有限公司应作为优化约束实施，而不是双重优化目标。最后，我们表明LTD可以与不同的对比学习损失和各种资源受限的ICR方法一起使用。

To train image-caption retrieval (ICR) methods, contrastive loss functions are a common choice for optimization functions. Unfortunately, contrastive ICR methods are vulnerable to predictive feature suppression. Predictive features are features that correctly indicate the similarity between a query and a candidate item. However, in the presence of multiple predictive features during training, encoder models tend to suppress redundant predictive features, since these features are not needed to learn to discriminate between positive and negative pairs. While some predictive features are redundant during training, these features might be relevant during evaluation. We introduce an approach to reduce predictive feature suppression for resource-constrained ICR methods: latent target decoding (LTD). We add an additional decoder to the contrastive ICR framework, to reconstruct the input caption in a latent space of a general-purpose sentence encoder, which prevents the image and caption encoder from suppressing predictive features. We implement the LTD objective as an optimization constraint, to ensure that the reconstruction loss is below a bound value while primarily optimizing for the contrastive loss. Importantly, LTD does not depend on additional training data or expensive (hard) negative mining strategies. Our experiments show that, unlike reconstructing the input caption in the input space, LTD reduces predictive feature suppression, measured by obtaining higher recall@k, r-precision, and nDCG scores than a contrastive ICR baseline. Moreover, we show that LTD should be implemented as an optimization constraint instead of a dual optimization objective. Finally, we show that LTD can be used with different contrastive learning losses and a wide variety of resource-constrained ICR methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题