论文标题
SIMCROSSTRANS:一种简单的交叉模式转移学习,用于使用Convnet或视觉变压器的对象检测
simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers
论文作者
论文摘要
转移学习被广泛用于计算机视觉(CV),自然语言处理(NLP),并取得了巨大的成功。大多数转移学习系统基于相同的模态(例如,CV中的RGB图像和NLP中的文本)。但是,跨模式转移学习(CMTL)系统很少。在这项工作中,我们研究了从2D到3D传感器的CMTL,以探索仅3D传感器系统的上限性能,该系统在机器人导航中起着至关重要的作用,并且在弱光场景中表现良好。虽然大多数CMTL管道从2D到3D视觉都复杂,并且基于卷积神经网络(Convnets),但我们的CMTL管道易于实现,扩展和基于Convnets和Vision Transformers(VITS):1),通过将点云转换为假映射,我们可以根据2D图像使用预先训练的模型从2D图像中使用预先训练的模型。这使我们的系统易于实现和扩展。 2)最近VIT表现出良好的性能和稳健性,这是3D视觉系统性能差的关键原因之一。我们探索了具有相似模型大小的VIT和CONCNET,以研究性能差异。我们将方法命名为SimCrossTrans:使用Convnet或Vits简单的跨模式转移学习。 Sun RGB-D数据集的实验显示:使用SimCrossTrans,我们将基于Convnets和Vits分别获得$ 13.2 \%$和$ 16.1 \%$的绝对性能增益。我们还观察到基于VIT的$ 9.7 \%$比Convnets的$ 9.7 \%$,显示了SimCrosstrans与VIT的力量。带有VIT的SimCrossTrans超过了先前的最先进(SOTA),$++15.4 \%$ MAP50。与以前的2D检测基于SOTA的RGB图像相比,我们的深度图像系统仅具有$ 1 \%$差距。代码,培训/推理日志和模型可在https://github.com/liketheflower/simcrosstrans上公开获取
Transfer learning is widely used in computer vision (CV), natural language processing (NLP) and achieves great success. Most transfer learning systems are based on the same modality (e.g. RGB image in CV and text in NLP). However, the cross-modality transfer learning (CMTL) systems are scarce. In this work, we study CMTL from 2D to 3D sensor to explore the upper bound performance of 3D sensor only systems, which play critical roles in robotic navigation and perform well in low light scenarios. While most CMTL pipelines from 2D to 3D vision are complicated and based on Convolutional Neural Networks (ConvNets), ours is easy to implement, expand and based on both ConvNets and Vision transformers(ViTs): 1) By converting point clouds to pseudo-images, we can use an almost identical network from pre-trained models based on 2D images. This makes our system easy to implement and expand. 2) Recently ViTs have been showing good performance and robustness to occlusions, one of the key reasons for poor performance of 3D vision systems. We explored both ViT and ConvNet with similar model sizes to investigate the performance difference. We name our approach simCrossTrans: simple cross-modality transfer learning with ConvNets or ViTs. Experiments on SUN RGB-D dataset show: with simCrossTrans we achieve $13.2\%$ and $16.1\%$ absolute performance gain based on ConvNets and ViTs separately. We also observed the ViTs based performs $9.7\%$ better than the ConvNets one, showing the power of simCrossTrans with ViT. simCrossTrans with ViTs surpasses the previous state-of-the-art (SOTA) by a large margin of $+15.4\%$ mAP50. Compared with the previous 2D detection SOTA based RGB images, our depth image only system only has a $1\%$ gap. The code, training/inference logs and models are publicly available at https://github.com/liketheflower/simCrossTrans