学习为图像情感分类撰写多元化的提示

论文标题

学习为图像情感分类撰写多元化的提示

Learning to Compose Diversified Prompts for Image Emotion Classification

论文作者

Deng, Sinuo, Wu, Lifang, Shi, Ge, Xing, Lehao, Jian, Meng, Xiang, Ye

论文摘要

对比性语言图像预训练（剪辑）代表了预训练的视觉模型的最新化身。尽管Clip最近在各种下游视觉语言任务（例如视觉质疑）上显示了其出色的力量，但对于图像情感分类（IEC），它仍然没有忽视。适应IEC任务的剪辑面临三个重大挑战，在训练和IEC之间进行了巨大的培训客观差距，在所有情况下共享次优和不变的提示。在本文中，我们提出了一个通用框架，该框架显示了如何有效地应用夹子。我们首先引入了一种及时的调整方法，该方法模仿了剪辑的预处理目标，因此可以利用剪辑中需要的丰富图像和文本语义。然后，我们通过在实例的类别和图像内容上进行调节，多样化提示并避免次优问题来自动构成特定于实例的提示。对六个广泛使用的情感数据集的评估表明，我们所提出的方法在IEC任务上超过了最先进的方法（即，在EOMTIONROI数据集上的准确性最高为9.29％），只有少数参数训练了。我们的代码将用于研究目的。

Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained vision-language models. Although CLIP has recently shown its superior power on a wide range of downstream vision-language tasks like Visual Question Answering, it is still underexplored for Image Emotion Classification (IEC). Adapting CLIP to the IEC task has three significant challenges, tremendous training objective gap between pretraining and IEC, shared suboptimal and invariant prompts for all instances. In this paper, we propose a general framework that shows how CLIP can be effectively applied to IEC. We first introduce a prompt tuning method that mimics the pretraining objective of CLIP and thus can leverage the rich image and text semantics entailed in CLIP. Then we automatically compose instance-specific prompts by conditioning them on the categories and image contents of instances, diversifying prompts and avoiding suboptimal problems. Evaluations on six widely-used affective datasets demonstrate that our proposed method outperforms the state-of-the-art methods to a large margin (i.e., up to 9.29% accuracy gain on EmotionROI dataset) on IEC tasks, with only a few parameters trained. Our codes will be publicly available for research purposes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题