论文标题
在点云上的3D密集字幕的上下文建模
Contextual Modeling for 3D Dense Captioning on Point Clouds
论文作者
论文摘要
3D密集的字幕作为新兴视觉语言任务,旨在从一组点云中识别和定位每个对象,并生成一个独特的自然语言句子,以描述每个对象。但是,现有方法主要集中于采矿间关系,同时忽略了上下文信息,尤其是点云中的非对象细节和背景环境,从而导致低质量描述,例如不准确的相对位置信息。在本文中,我们首次尝试利用点云集群特征作为上下文信息来提供点云的非对象详细信息和背景环境,并将其整合到3D密集的字幕任务中。我们以粗略的方式提出了两个单独的模块,即全局上下文建模(GCM)和本地上下文建模(LCM),以执行点云的上下文建模。具体而言,GCM模块使用全局上下文信息捕获了所有对象之间的对象之间的关系,以获取整个点云的更完整的场景信息。 LCM模块利用目标对象的相邻对象和本地上下文信息的影响以丰富对象表示。借助这样的全球和局部上下文建模策略,我们提出的模型可以有效地表征对象表示和上下文信息,从而生成对位置对象的全面描述。扫描仪和NR3D数据集的大量实验表明,我们提出的方法在3D密集的字幕任务上设置了新记录,并验证了我们对点云的上下文建模的有效性。
3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds and generate a distinctive natural language sentence for describing each located object. However, the existing methods mainly focus on mining inter-object relationship, while ignoring contextual information, especially the non-object details and background environment within the point clouds, thus leading to low-quality descriptions, such as inaccurate relative position information. In this paper, we make the first attempt to utilize the point clouds clustering features as the contextual information to supply the non-object details and background environment of the point clouds and incorporate them into the 3D dense captioning task. We propose two separate modules, namely the Global Context Modeling (GCM) and Local Context Modeling (LCM), in a coarse-to-fine manner to perform the contextual modeling of the point clouds. Specifically, the GCM module captures the inter-object relationship among all objects with global contextual information to obtain more complete scene information of the whole point clouds. The LCM module exploits the influence of the neighboring objects of the target object and local contextual information to enrich the object representations. With such global and local contextual modeling strategies, our proposed model can effectively characterize the object representations and contextual information and thereby generate comprehensive and detailed descriptions of the located objects. Extensive experiments on the ScanRefer and Nr3D datasets demonstrate that our proposed method sets a new record on the 3D dense captioning task, and verify the effectiveness of our raised contextual modeling of point clouds.