论文标题
使用变压器的端到端人类凝视目标检测
End-to-End Human-Gaze-Target Detection with Transformers
论文作者
论文摘要
在本文中,我们提出了一种有效而有效的人类目光目标(HGT)检测方法,即凝视跟随。当前的方法将HGT检测任务解散为显着对象检测和人类凝视预测的单独分支,采用了两个阶段的框架,必须首先检测到人类头部位置,然后将其送入下一个凝视目标预测子网络。相比之下,我们同时将HGT检测任务重新定义为人类头部位置及其凝视目标。通过这种方式,我们的方法(称为人类目光靶向检测变压器或HGTTR)通过消除所有其他其他组件来简化HGT检测管道。 HGTTR的原因是从全球形象上下文中关注显着对象和人类目光的关系。此外,与需要人头位置作为输入的现有两阶段方法不同,并且一次只能预测一个人的凝视目标,HGTTR可以直接以端到端的方式直接预测所有人及其目光目标的位置。我们提出的方法的有效性和鲁棒性通过在两个标准基准数据集的广泛实验(凉亭)和VideoattentionTarget上进行了验证。没有铃铛和哨声,HGTTR的表现优于大幅度的现有最先进方法(木封型上的6.4映射增益和VideoattentionTarget上的地图增益为10.3),其架构更简单。
In this paper, we propose an effective and efficient method for Human-Gaze-Target (HGT) detection, i.e., gaze following. Current approaches decouple the HGT detection task into separate branches of salient object detection and human gaze prediction, employing a two-stage framework where human head locations must first be detected and then be fed into the next gaze target prediction sub-network. In contrast, we redefine the HGT detection task as detecting human head locations and their gaze targets, simultaneously. By this way, our method, named Human-Gaze-Target detection TRansformer or HGTTR, streamlines the HGT detection pipeline by eliminating all other additional components. HGTTR reasons about the relations of salient objects and human gaze from the global image context. Moreover, unlike existing two-stage methods that require human head locations as input and can predict only one human's gaze target at a time, HGTTR can directly predict the locations of all people and their gaze targets at one time in an end-to-end manner. The effectiveness and robustness of our proposed method are verified with extensive experiments on the two standard benchmark datasets, GazeFollowing and VideoAttentionTarget. Without bells and whistles, HGTTR outperforms existing state-of-the-art methods by large margins (6.4 mAP gain on GazeFollowing and 10.3 mAP gain on VideoAttentionTarget) with a much simpler architecture.