自适应细粒谓词学习场景图生成

论文标题

自适应细粒谓词学习场景图生成

Adaptive Fine-Grained Predicates Learning for Scene Graph Generation

论文作者

Lyu, Xinyu, Gao, Lianli, Zeng, Pengpeng, Shen, Heng Tao, Song, Jingkuan

论文摘要

当前场景图（SGG）模型的性能受到难以弥补的谓词的严重阻碍，例如，女性与女性/站立/站立/步行。由于通用SGG模型倾向于预测头部谓词和重新平衡策略，因此偏爱尾巴类别，因此没有一个可以适当处理难以弥补的谓词。为了解决这个问题，灵感来自细粒度的图像分类，该分类的重点是区分难以弥补的对象，我们提出了一种自适应的细颗粒谓词学习（FGPL-A），旨在区分SGG难以分辨的谓词。首先，我们引入了一个自适应谓词晶格（PL-A），以找出难以分辨的谓词，该谓词可以自适应地探索与模型的动态学习步伐保持一致的谓词相关性。实际上，PL-A是从SGG数据集初始化的，并通过探索模型的当前迷你批量预测来完善。利用PL-A，我们提出了一个自适应类别区分损失（CDL-A）和一个自适应实体区分损失（EDL-A），该损失（EDL-A）逐渐使模型的歧视过程逐渐使模型的歧视过程正规化，并确保模型的动态学习状态，以确保平衡和有效的学习过程。广泛的实验结果表明，我们提出的模型不可吻合的策略可显着提高VG-SGG和GQA-SGG数据集对基准模型的性能，最多可提高175％和76％的平均Recess@100，从而实现新的最新性能。此外，对句子到政策检索和图像字幕任务的实验进一步证明了我们方法的实用性。

The performance of current Scene Graph Generation (SGG) models is severely hampered by hard-to-distinguish predicates, e.g., woman-on/standing on/walking on-beach. As general SGG models tend to predict head predicates and re-balancing strategies prefer tail categories, none of them can appropriately handle hard-to-distinguish predicates. To tackle this issue, inspired by fine-grained image classification, which focuses on differentiating hard-to-distinguish objects, we propose an Adaptive Fine-Grained Predicates Learning (FGPL-A) which aims at differentiating hard-to-distinguish predicates for SGG. First, we introduce an Adaptive Predicate Lattice (PL-A) to figure out hard-to-distinguish predicates, which adaptively explores predicate correlations in keeping with model's dynamic learning pace. Practically, PL-A is initialized from SGG dataset, and gets refined by exploring model's predictions of current mini-batch. Utilizing PL-A, we propose an Adaptive Category Discriminating Loss (CDL-A) and an Adaptive Entity Discriminating Loss (EDL-A), which progressively regularize model's discriminating process with fine-grained supervision concerning model's dynamic learning status, ensuring balanced and efficient learning process. Extensive experimental results show that our proposed model-agnostic strategy significantly boosts performance of benchmark models on VG-SGG and GQA-SGG datasets by up to 175% and 76% on Mean Recall@100, achieving new state-of-the-art performance. Moreover, experiments on Sentence-to-Graph Retrieval and Image Captioning tasks further demonstrate practicability of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题