DPIT：人姿势估计的双皮线整合变压器

论文标题

DPIT：人姿势估计的双皮线整合变压器

DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation

论文作者

Zhao, Shuaitao, Liu, Kun, Huang, Yuhang, Bao, Qian, Zeng, Dan, Liu, Wu

论文摘要

人类姿势估计旨在弄清不同场景中所有人的关键。尽管结果有希望，但目前的方法仍然面临一些挑战。现有的自上而下的方法可以单独处理一个人，而没有不同的人与所在的场景之间的相互作用。因此，当发生严重遮挡时，人类检测的表现会降低。另一方面，现有的自下而上方法同时考虑所有人，并捕获整个图像的全球知识。但是，由于尺度变化，它们的准确性不如自上而下的方法。为了解决这些问题，我们通过整合自上而下和自下而上的管道来探索不同接受场的视觉线索并实现其互补性，提出了一种新颖的双皮线整合变压器（DPIT）。具体而言，DPIT由两个分支组成，自下而上的分支介绍了整个图像以捕获全局视觉信息，而自上而下的分支则从单人类边界框中提取本地视觉的特征表示。然后，从自下而上和自上而下的分支中提取的特征表示形式被馈入变压器编码器，以交互融合全局和本地知识。此外，我们定义了关键点查询，以探索全景和单人类姿势视觉线索，以实现两个管道的相互互补性。据我们所知，这是将自下而上和自上而下的管道与变压器与人类姿势估算的最早的作品之一。对可可和MPII数据集的广泛实验表明，我们的DPIT与最先进的方法相当。

Human pose estimation aims to figure out the keypoints of all people in different scenes. Current approaches still face some challenges despite promising results. Existing top-down methods deal with a single person individually, without the interaction between different people and the scene they are situated in. Consequently, the performance of human detection degrades when serious occlusion happens. On the other hand, existing bottom-up methods consider all people at the same time and capture the global knowledge of the entire image. However, they are less accurate than the top-down methods due to the scale variation. To address these problems, we propose a novel Dual-Pipeline Integrated Transformer (DPIT) by integrating top-down and bottom-up pipelines to explore the visual clues of different receptive fields and achieve their complementarity. Specifically, DPIT consists of two branches, the bottom-up branch deals with the whole image to capture the global visual information, while the top-down branch extracts the feature representation of local vision from the single-human bounding box. Then, the extracted feature representations from bottom-up and top-down branches are fed into the transformer encoder to fuse the global and local knowledge interactively. Moreover, we define the keypoint queries to explore both full-scene and single-human posture visual clues to realize the mutual complementarity of the two pipelines. To the best of our knowledge, this is one of the first works to integrate the bottom-up and top-down pipelines with transformers for human pose estimation. Extensive experiments on COCO and MPII datasets demonstrate that our DPIT achieves comparable performance to the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题