二重奏：对比度零拍学习的跨模式语义基础

论文标题

二重奏：对比度零拍学习的跨模式语义基础

DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

论文作者

Chen, Zhuo, Huang, Yufeng, Chen, Jiaoyan, Geng, Yuxia, Zhang, Wen, Fang, Yin, Pan, Jeff Z., Chen, Huajun

论文摘要

零击学习（ZSL）旨在预测未见的类别的样本在培训期间从未出现过的类别。用于零拍图像分类的最有效和使用最广泛使用的语义信息之一是属性，这些属性是班级视觉特征的注释。但是，当前的方法通常无法区分图像之间的那些微妙的视觉区分，这不仅是由于细粒度的短缺，而且还因为属性的不平衡和同时出现。在本文中，我们提出了一种名为Duet的基于变压器的端到端ZSL方法，该方法通过自我监督的多模式学习范式从预训练的语言模型（PLM）中整合了潜在的语义知识。具体而言，我们（1）开发了一个跨模式的语义接地网络，以研究模型从图像中解开语义属性的能力。（2）应用属性级对比度学习策略，进一步增强模型对属性的属性视觉特征的歧视；（3）提出了一个多任务学习政策，用于考虑多模型目标。我们发现，我们的二重奏可以在三个标准ZSL基准和配备ZSL基准的知识图上实现最先进的性能。它的组成部分是有效的，并且预测是可以解释的。

Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are annotations for class-level visual characteristics. However, the current methods often fail to discriminate those subtle visual distinctions between images due to not only the shortage of fine-grained annotations, but also the attribute imbalance and co-occurrence. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives. We find that our DUET can achieve state-of-the-art performance on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark. Its components are effective and its predictions are interpretable.

下载PDF全文

下载文献需遵守相关版权规定

论文标题