Gen-VLKT：简化关联并增强HOI检测的相互作用理解

论文标题

Gen-VLKT：简化关联并增强HOI检测的相互作用理解

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

论文作者

Liao, Yue, Zhang, Aixi, Lu, Miao, Wang, Yongliang, Li, Xiaobo, Liu, Si

论文摘要

人类对象相互作用〜（HOI）检测的任务可以分为两个核心问题，即人类对象的关联和相互作用的理解。在本文中，我们揭示并解决了这两个方面传统查询驱动的HOI探测器的缺点。对于该关联，以前的两分支方法遭受了复杂且昂贵的匹配后，而单分支方法却忽略了不同任务中的特征。我们建议指导式网络〜（gen）以达到两支分支的管道而无需匹配。在Gen中，我们设计了一个实例解码器，以检测具有两个独立查询集的人类和对象，以及一个位置引导的嵌入〜（p-ge），以标记人和对象与一对相同的位置。此外，我们设计了一个交互解码器来对交互进行分类，其中相互作用查询是由从每个实例解码器层的输出生成的实例引导的嵌入（I-GE）制成的。为了进行互动理解，以前的方法遭受了长尾分布和零射击发现。本文提出了一种视觉语言知识转移（VLKT）训练策略，以通过从视觉语言预训练的模型剪辑中转移知识来增强相互作用的理解。在特定的情况下，我们提取所有具有夹子的标签的文本嵌入，以初始化分类器并采用模拟损失，以最大程度地减少Gen和Clip之间的视觉特征距离。结果，gen-vlkt在多个数据集上的大幅度优于最大边缘的最新水平，例如，在hico-det上的+5.05地图。源代码可在https://github.com/yueliao/gen-vlkt上找到。

The task of Human-Object Interaction~(HOI) detection could be divided into two core problems, i.e., human-object association and interaction understanding. In this paper, we reveal and address the disadvantages of the conventional query-driven HOI detectors from the two aspects. For the association, previous two-branch methods suffer from complex and costly post-matching, while single-branch methods ignore the features distinction in different tasks. We propose Guided-Embedding Network~(GEN) to attain a two-branch pipeline without post-matching. In GEN, we design an instance decoder to detect humans and objects with two independent query sets and a position Guided Embedding~(p-GE) to mark the human and object in the same position as a pair. Besides, we design an interaction decoder to classify interactions, where the interaction queries are made of instance Guided Embeddings (i-GE) generated from the outputs of each instance decoder layer. For the interaction understanding, previous methods suffer from long-tailed distribution and zero-shot discovery. This paper proposes a Visual-Linguistic Knowledge Transfer (VLKT) training strategy to enhance interaction understanding by transferring knowledge from a visual-linguistic pre-trained model CLIP. In specific, we extract text embeddings for all labels with CLIP to initialize the classifier and adopt a mimic loss to minimize the visual feature distance between GEN and CLIP. As a result, GEN-VLKT outperforms the state of the art by large margins on multiple datasets, e.g., +5.05 mAP on HICO-Det. The source codes are available at https://github.com/YueLiao/gen-vlkt.

下载PDF全文

下载文献需遵守相关版权规定

论文标题