论文标题
利用机器人3D场景理解的大型(视觉)语言模型
Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding
论文作者
论文摘要
抽象的语义3D场景理解是机器人技术中至关重要的问题。由于机器人仍然缺乏有关家庭对象和普通人位置的常识知识,我们研究了使用预训练的语言模型来传授常识的场景理解。我们介绍并比较了仅利用语言(零摄,基于嵌入和结构化语言)或视觉和语言(零镜头和微调)的各种场景分类范式。我们发现,这两个类别中最好的方法都产生$ \ sim 70 \%$室分类精度,超过了纯视觉和图形分类器的性能。我们还发现,这种方法表明了由于使用语言而引起的显着概括和传递能力。
Abstract semantic 3D scene understanding is a problem of critical importance in robotics. As robots still lack the common-sense knowledge about household objects and locations of an average human, we investigate the use of pre-trained language models to impart common sense for scene understanding. We introduce and compare a wide range of scene classification paradigms that leverage language only (zero-shot, embedding-based, and structured-language) or vision and language (zero-shot and fine-tuned). We find that the best approaches in both categories yield $\sim 70\%$ room classification accuracy, exceeding the performance of pure-vision and graph classifiers. We also find such methods demonstrate notable generalization and transfer capabilities stemming from their use of language.