Pix4Point：图像预验证的标准变压器，用于3D点云理解

论文标题

Pix4Point：图像预验证的标准变压器，用于3D点云理解

Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud Understanding

论文作者

Qian, Guocheng, Hamdi, Abdullah, Zhang, Xingdi, Ghanem, Bernard

论文摘要

尽管变形金刚在自然语言处理和计算机视觉方面取得了令人印象深刻的成功，但它们在3D点云上的性能相对较差。这主要是由于变压器的局限性：对广泛的培训数据的苛刻需求。不幸的是，在3D点云的领域中，大数据集的可用性是一个挑战，这加剧了3D任务的训练变压器问题。在这项工作中，我们从两个角度解决了点云变压器的数据问题：（i）引入更多的归纳偏见以减少变形金刚对数据的依赖性，以及（ii）依赖跨模式预处理。更具体地说，我们首先呈现渐进点斑块嵌入，并提出一个新的点云变压器模型，即PVIT。 PVIT具有与变压器相同的主链，但显示出对数据的渴望较小，从而使变压器能够实现与最先进的性能。其次，我们制定了一条简单而有效的管道，称为“ pix4point”，该管道允许利用图像域中预测的变压器来增强下游点云的理解。这是通过代币器和解码器专门用于不同域的模态反应变压器主链实现的。在大量可广泛可用的图像上预估计，在Scanobjectnn，ShapenetPart和S3DIS的3D点云分类，部分分割和语义分割的任务中观察到了显着的PVIT。我们的代码和型号可在https://github.com/guochengqian/pix4point上找到。

While Transformers have achieved impressive success in natural language processing and computer vision, their performance on 3D point clouds is relatively poor. This is mainly due to the limitation of Transformers: a demanding need for extensive training data. Unfortunately, in the realm of 3D point clouds, the availability of large datasets is a challenge, exacerbating the issue of training Transformers for 3D tasks. In this work, we solve the data issue of point cloud Transformers from two perspectives: (i) introducing more inductive bias to reduce the dependency of Transformers on data, and (ii) relying on cross-modality pretraining. More specifically, we first present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT. PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art. Second, we formulate a simple yet effective pipeline dubbed "Pix4Point" that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding. This is achieved through a modality-agnostic Transformer backbone with the help of a tokenizer and decoder specialized in the different domains. Pretrained on a large number of widely available images, significant gains of PViT are observed in the tasks of 3D point cloud classification, part segmentation, and semantic segmentation on ScanObjectNN, ShapeNetPart, and S3DIS, respectively. Our code and models are available at https://github.com/guochengqian/Pix4Point .

下载PDF全文

下载文献需遵守相关版权规定

论文标题