论文标题

3D-SP:单阶段3D视觉接地通过引用点渐进式选择

3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

论文作者

Luo, Junyu, Fu, Jiahui, Kong, Xianghao, Gao, Chen, Ren, Haibing, Shen, Hao, Xia, Huaxia, Liu, Si

论文摘要

3D视觉接地旨在根据自由形式的语言描述将引用的目标对象定位在3D点云场景中。以前的方法主要遵循两个阶段的范式,即语言 - iRrelevant检测和跨模式匹配,受隔离架构的限制。在这样的范式中,由于3D点云的固有属性(不规则和大规模),检测器需要从原始点云中采样关键点,以生成每个关键点的相应对象建议。但是,稀疏的提案可能会忽略检测目标,而密集的建议可能会使匹配模型感到困惑。此外,语言 - 近距离检测阶段只能在目标上采样一小部分关键点,从而恶化目标预测。在本文中,我们提出了一个3D单阶段推荐点渐进式选择(3D-SP)方法,该方法逐渐在语言的指导下选择关键点并直接定位目标。具体而言,我们提出了一个描述感知的关键点采样(DKS)模块,以使专注于与语言相关的对象的点,这是接地的重要线索。此外,我们设计了一个面向目标的渐进式挖掘(TPM)模块,以精细地集中在目标点上,这是通过渐进的模式内关系建模和模式间目标挖掘来实现的。 3D-SP在3D视觉接地任务中桥接检测和匹配之间的差距,将目标定位在一个阶段。实验表明,3D-SPS在扫描仪和NR3D/SR3D数据集上达到了最先进的性能。

3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description. Previous methods mostly follow a two-stage paradigm, i.e., language-irrelevant detection and cross-modal matching, which is limited by the isolated architecture. In such a paradigm, the detector needs to sample keypoints from raw point clouds due to the inherent properties of 3D point clouds (irregular and large-scale), to generate the corresponding object proposal for each keypoint. However, sparse proposals may leave out the target in detection, while dense proposals may confuse the matching model. Moreover, the language-irrelevant detection stage can only sample a small proportion of keypoints on the target, deteriorating the target prediction. In this paper, we propose a 3D Single-Stage Referred Point Progressive Selection (3D-SPS) method, which progressively selects keypoints with the guidance of language and directly locates the target. Specifically, we propose a Description-aware Keypoint Sampling (DKS) module to coarsely focus on the points of language-relevant objects, which are significant clues for grounding. Besides, we devise a Target-oriented Progressive Mining (TPM) module to finely concentrate on the points of the target, which is enabled by progressive intra-modal relation modeling and inter-modal target mining. 3D-SPS bridges the gap between detection and matching in the 3D visual grounding task, localizing the target at a single stage. Experiments demonstrate that 3D-SPS achieves state-of-the-art performance on both ScanRefer and Nr3D/Sr3D datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源