利用机器人3D场景理解的大型（视觉）语言模型

论文标题

利用机器人3D场景理解的大型（视觉）语言模型

Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

论文作者

Chen, William, Hu, Siyi, Talak, Rajat, Carlone, Luca

论文摘要

抽象的语义3D场景理解是机器人技术中至关重要的问题。由于机器人仍然缺乏有关家庭对象和普通人位置的常识知识，我们研究了使用预训练的语言模型来传授常识的场景理解。我们介绍并比较了仅利用语言（零摄，基于嵌入和结构化语言）或视觉和语言（零镜头和微调）的各种场景分类范式。我们发现，这两个类别中最好的方法都产生$ \ sim 70 \％$室分类精度，超过了纯视觉和图形分类器的性能。我们还发现，这种方法表明了由于使用语言而引起的显着概括和传递能力。

Abstract semantic 3D scene understanding is a problem of critical importance in robotics. As robots still lack the common-sense knowledge about household objects and locations of an average human, we investigate the use of pre-trained language models to impart common sense for scene understanding. We introduce and compare a wide range of scene classification paradigms that leverage language only (zero-shot, embedding-based, and structured-language) or vision and language (zero-shot and fine-tuned). We find that the best approaches in both categories yield $\sim 70\%$ room classification accuracy, exceeding the performance of pure-vision and graph classifiers. We also find such methods demonstrate notable generalization and transfer capabilities stemming from their use of language.

下载PDF全文

下载文献需遵守相关版权规定

论文标题