论文标题
GID-NET:检测与全局和实例依赖关系的人类对象相互作用
GID-Net: Detecting Human-Object Interaction with Global and Instance Dependency
论文作者
论文摘要
由于检测和识别个人或对象不足以了解视觉世界,因此学习人类与周围物体的互动方式成为一种核心技术。但是,卷积操作描绘了实例之间的视觉相互作用,因为它们仅构建一个一次处理一个本地社区的块。为了解决这个问题,我们从人类的看法中吸取了观察HOI的知识,以引入一种两阶段的可训练的推理机制,称为GID块。 GID块突破了当地的社区,并从场景中捕获了全球级别和实例级别的像素的长距离依赖性,以帮助检测实例之间的相互作用。此外,我们进行了一个称为GID-NET的多流网络,该网络是一个由人分支,对象分支和相互作用分支组成的人类对象相互作用检测框架。在每个分支机构中,有效地对全球层面和本地级别的语义信息进行了有效的推理和汇总。我们已经将我们提出的GID-NET与两个公共基准的现有最新方法进行了比较,包括V-Coco和Hico-det。结果表明,GID-NET在以上两个基准上的表现优于现有表现最佳的方法,从而验证了其在检测人类对象相互作用方面的疗效。
Since detecting and recognizing individual human or object are not adequate to understand the visual world, learning how humans interact with surrounding objects becomes a core technology. However, convolution operations are weak in depicting visual interactions between the instances since they only build blocks that process one local neighborhood at a time. To address this problem, we learn from human perception in observing HOIs to introduce a two-stage trainable reasoning mechanism, referred to as GID block. GID block breaks through the local neighborhoods and captures long-range dependency of pixels both in global-level and instance-level from the scene to help detecting interactions between instances. Furthermore, we conduct a multi-stream network called GID-Net, which is a human-object interaction detection framework consisting of a human branch, an object branch and an interaction branch. Semantic information in global-level and local-level are efficiently reasoned and aggregated in each of the branches. We have compared our proposed GID-Net with existing state-of-the-art methods on two public benchmarks, including V-COCO and HICO-DET. The results have showed that GID-Net outperforms the existing best-performing methods on both the above two benchmarks, validating its efficacy in detecting human-object interactions.