论文标题
深度是单眼3D检测所需的全部
Depth Is All You Need for Monocular 3D Detection
论文作者
论文摘要
从单个图像中检测到3D检测的最新进展的关键因素是单眼深度估计。现有方法的重点是如何通过生成伪点或为图像特征提供注意提示来明确利用深度。最近的作品将深度预测作为一项预处理的任务,并在训练3D检测的同时微调深度表示。但是,适应不足,并且通过手动标签限制了规模。在这项工作中,我们建议在无监督的时尚中进一步将深度表示与目标域。我们的方法利用训练时间期间常见的LiDAR或RGB视频来微调深度表示,从而改善了3D探测器。尤其是在使用RGB视频时,我们表明我们首先生成伪深度标签的两阶段培训至关重要,因为这两个任务之间的损失分布不一致。借助两种参考数据,我们的多任务学习方法可以改善Kitti和Nuscenes上的最新技术状态,同时匹配其单个任务子网络的测试时间复杂性。
A key contributor to recent progress in 3D detection from single images is monocular depth estimation. Existing methods focus on how to leverage depth explicitly, by generating pseudo-pointclouds or providing attention cues for image features. More recent works leverage depth prediction as a pretraining task and fine-tune the depth representation while training it for 3D detection. However, the adaptation is insufficient and is limited in scale by manual labels. In this work, we propose to further align depth representation with the target domain in unsupervised fashions. Our methods leverage commonly available LiDAR or RGB videos during training time to fine-tune the depth representation, which leads to improved 3D detectors. Especially when using RGB videos, we show that our two-stage training by first generating pseudo-depth labels is critical because of the inconsistency in loss distribution between the two tasks. With either type of reference data, our multi-task learning approach improves over the state of the art on both KITTI and NuScenes, while matching the test-time complexity of its single task sub-network.