3D-SP：单阶段3D视觉接地通过引用点渐进式选择

论文标题

3D-SP：单阶段3D视觉接地通过引用点渐进式选择

3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

论文作者

Luo, Junyu, Fu, Jiahui, Kong, Xianghao, Gao, Chen, Ren, Haibing, Shen, Hao, Xia, Huaxia, Liu, Si

论文摘要

3D视觉接地旨在根据自由形式的语言描述将引用的目标对象定位在3D点云场景中。以前的方法主要遵循两个阶段的范式，即语言 - iRrelevant检测和跨模式匹配，受隔离架构的限制。在这样的范式中，由于3D点云的固有属性（不规则和大规模），检测器需要从原始点云中采样关键点，以生成每个关键点的相应对象建议。但是，稀疏的提案可能会忽略检测目标，而密集的建议可能会使匹配模型感到困惑。此外，语言 - 近距离检测阶段只能在目标上采样一小部分关键点，从而恶化目标预测。在本文中，我们提出了一个3D单阶段推荐点渐进式选择（3D-SP）方法，该方法逐渐在语言的指导下选择关键点并直接定位目标。具体而言，我们提出了一个描述感知的关键点采样（DKS）模块，以使专注于与语言相关的对象的点，这是接地的重要线索。此外，我们设计了一个面向目标的渐进式挖掘（TPM）模块，以精细地集中在目标点上，这是通过渐进的模式内关系建模和模式间目标挖掘来实现的。 3D-SP在3D视觉接地任务中桥接检测和匹配之间的差距，将目标定位在一个阶段。实验表明，3D-SPS在扫描仪和NR3D/SR3D数据集上达到了最先进的性能。

3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description. Previous methods mostly follow a two-stage paradigm, i.e., language-irrelevant detection and cross-modal matching, which is limited by the isolated architecture. In such a paradigm, the detector needs to sample keypoints from raw point clouds due to the inherent properties of 3D point clouds (irregular and large-scale), to generate the corresponding object proposal for each keypoint. However, sparse proposals may leave out the target in detection, while dense proposals may confuse the matching model. Moreover, the language-irrelevant detection stage can only sample a small proportion of keypoints on the target, deteriorating the target prediction. In this paper, we propose a 3D Single-Stage Referred Point Progressive Selection (3D-SPS) method, which progressively selects keypoints with the guidance of language and directly locates the target. Specifically, we propose a Description-aware Keypoint Sampling (DKS) module to coarsely focus on the points of language-relevant objects, which are significant clues for grounding. Besides, we devise a Target-oriented Progressive Mining (TPM) module to finely concentrate on the points of the target, which is enabled by progressive intra-modal relation modeling and inter-modal target mining. 3D-SPS bridges the gap between detection and matching in the 3D visual grounding task, localizing the target at a single stage. Experiments demonstrate that 3D-SPS achieves state-of-the-art performance on both ScanRefer and Nr3D/Sr3D datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题