无调：利用姿势以中心为中心的场景推理和3D中的产生

论文标题

无调：利用姿势以中心为中心的场景推理和3D中的产生

ObPose: Leveraging Pose for Object-Centric Scene Inference and Generation in 3D

论文作者

Wu, Yizhe, Jones, Oiwi Parker, Posner, Ingmar

论文摘要

我们提出了一个无监督的以对象为中心的推理和生成模型，它从RGB-D场景中学习了3D结构的潜在表示。受到2D表示学习的启发，Obpose认为是分解的潜在空间，分别编码对象位置（其中）和外观（什么）。 Obpose进一步利用了通过最小体积原理定义的物体姿势（即位置和方向），作为一种新的感应偏见，用于学习其中的组件。为了实现这一目标，我们提出了一种有效的，体素化的近似方法，直接从神经辐射场（NERF）恢复对象形状。结果，无声将每个场景模型为nerfs的组成，富有代表单个对象。为了评估学习表现的质量，对无监督场景细分的YCB，多曲霉和CLEVR数据表进行了定量评估，以优于3D场景推理中的当前现状（Securf）。生成性结果提供了定性的证明，即相同的无调模型既可以生成新的场景并灵活地编辑它们中的对象。这些能力再次反映了学识渊博的潜伏期的质量以及脱离场景的位置和哪些组成部分的好处。 Obpose编码器中做出的关键设计选择已通过消融验证。

We present ObPose, an unsupervised object-centric inference and generation model which learns 3D-structured latent representations from RGB-D scenes. Inspired by prior art in 2D representation learning, ObPose considers a factorised latent space, separately encoding object location (where) and appearance (what). ObPose further leverages an object's pose (i.e. location and orientation), defined via a minimum volume principle, as a novel inductive bias for learning the where component. To achieve this, we propose an efficient, voxelised approximation approach to recover the object shape directly from a neural radiance field (NeRF). As a consequence, ObPose models each scene as a composition of NeRFs, richly representing individual objects. To evaluate the quality of the learned representations, ObPose is evaluated quantitatively on the YCB, MultiShapeNet, and CLEVR datatasets for unsupervised scene segmentation, outperforming the current state-of-the-art in 3D scene inference (ObSuRF) by a significant margin. Generative results provide qualitative demonstration that the same ObPose model can both generate novel scenes and flexibly edit the objects in them. These capacities again reflect the quality of the learned latents and the benefits of disentangling the where and what components of a scene. Key design choices made in the ObPose encoder are validated with ablations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题