Uniclip：统一的对比语言图像预训练的统一框架

论文标题

Uniclip：统一的对比语言图像预训练的统一框架

UniCLIP: Unified Framework for Contrastive Language-Image Pre-training

论文作者

Lee, Janghyeon, Kim, Jongsuk, Shon, Hyounguk, Kim, Bumsoo, Kim, Seung Hwan, Lee, Honglak, Kim, Junmo

论文摘要

具有对比目标的训练前视觉模型已显示出令人鼓舞的结果，这些结果既可以扩展到大型未经切割的数据集，又可以转移到许多下游应用程序。以下一些作品针对通过添加自学意义的术语来提高数据效率的目标，但是在这些作品的各个空间上定义了障碍层间（图像文本）对比度损失和对比度内（图像图像）对比损失，因此许多可行的监督组合被忽略了。为了克服这个问题，我们提出了Uniclip，这是对比鲜明的语言图像预训练的统一框架。 Uniclip将域间对的对比损失和域内对的对比损失成一个单一的通用空间。 Uniclip的三个关键组成部分解决了整合不同域之间对比度损失时发生的差异：（1）增强感知特征嵌入，（2）MP-NCE损失和（3）域依赖性相似性度量。 Uniclip的表现优于以前的视觉语言预训练方法在下游任务的各种单模式和多模式上。在我们的实验中，我们表明每个组成的分支都对最终性能有很大贡献。

Pre-training vision-language models with contrastive objectives has shown promising results that are both scalable to large uncurated datasets and transferable to many downstream applications. Some following works have targeted to improve data efficiency by adding self-supervision terms, but inter-domain (image-text) contrastive loss and intra-domain (image-image) contrastive loss are defined on individual spaces in those works, so many feasible combinations of supervision are overlooked. To overcome this issue, we propose UniCLIP, a Unified framework for Contrastive Language-Image Pre-training. UniCLIP integrates the contrastive loss of both inter-domain pairs and intra-domain pairs into a single universal space. The discrepancies that occur when integrating contrastive loss between different domains are resolved by the three key components of UniCLIP: (1) augmentation-aware feature embedding, (2) MP-NCE loss, and (3) domain dependent similarity measure. UniCLIP outperforms previous vision-language pre-training methods on various single- and multi-modality downstream tasks. In our experiments, we show that each component that comprises UniCLIP contributes well to the final performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题