clip2point：通过图像深度预训练将剪辑转移到点云分类

论文标题

clip2point：通过图像深度预训练将剪辑转移到点云分类

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training

论文作者

Huang, Tianyu, Dong, Bowen, Yang, Yunhan, Huang, Xiaoshui, Lau, Rynson W. H., Ouyang, Wanli, Zuo, Wangmeng

论文摘要

由于培训数据有限，跨3D视觉和语言进行的预训练仍在开发中。最近的工作试图将视觉语言预训练模型转移到3D视觉上。 PointClip将点云数据转换为多视图深度图，采用剪辑进行形状分类。但是，其性能受到渲染深度图和图像之间的域间隙以及深度分布的多样性的限制。为了解决这个问题，我们提出了clip2point，这是一种通过对比学习将剪辑传输到3D域的图像深度训练方法，并将其调整为点云分类。我们引入了一个新的深度渲染设置，形成更好的视觉效果，然后渲染52,460对来自Shapenet的图像和深度图进行预训练。 clip2point的预训练方案结合了跨模式学习，以实施捕获表现力的视觉和文本特征以及模式内学习的深度特征，以增强深度聚集的不变性。此外，我们提出了一个新型的双路适配器（DPA）模块，即具有简化适配器的双路径结构，用于几次学习。双路径结构允许联合使用剪辑和夹子2点，而简化的适配器可以很好地适合几个射击任务，而无需进行搜索。实验结果表明，clip2point有效地将夹子知识传输到3D视觉。我们的Clip2Point优于PointClip和其他自我监督的3D网络，在零击和几乎没有射击分类的情况下实现了最先进的结果。

Pre-training across 3D vision and language remains under development because of limited training data. Recent works attempt to transfer vision-language pre-training models to 3D vision. PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification. However, its performance is restricted by the domain gap between rendered depth maps and images, as well as the diversity of depth distributions. To address this issue, we propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain, and adapt it to point cloud classification. We introduce a new depth rendering setting that forms a better visual effect, and then render 52,460 pairs of images and depth maps from ShapeNet for pre-training. The pre-training scheme of CLIP2Point combines cross-modality learning to enforce the depth features for capturing expressive visual and textual features and intra-modality learning to enhance the invariance of depth aggregation. Additionally, we propose a novel Dual-Path Adapter (DPA) module, i.e., a dual-path structure with simplified adapters for few-shot learning. The dual-path structure allows the joint use of CLIP and CLIP2Point, and the simplified adapter can well fit few-shot tasks without post-search. Experimental results show that CLIP2Point is effective in transferring CLIP knowledge to 3D vision. Our CLIP2Point outperforms PointCLIP and other self-supervised 3D networks, achieving state-of-the-art results on zero-shot and few-shot classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题