单眼3D检测几何约束和半监督训练

论文标题

单眼3D检测几何约束和半监督训练

Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training

论文作者

Li, Peixuan

论文摘要

在这项工作中，我们提出了一种新型的单镜头和基于关键点的框架，用于仅使用称为KM3D-NET的RGB图像检测单眼3D对象。我们设计了一个完全卷积的模型来预测对象关键点，维度和方向，然后将这些估计与透视图几何约束结合到计算位置属性。此外，我们将几何约束重新定义为可区分版本，并将其嵌入网络中，以减少运行时间，同时以端到端方式保持模型输出的一致性。从这个简单的结构中受益，我们就为稀缺标记的培训数据的设置提出了一种有效的半监督培训策略。在此策略中，我们在不同的输入增强条件和网络正则化下对相同未标记的图像进行了两个共享重量KM3D-NET的共识预测。特别是，我们将坐标依赖性的增强统一为对象的差恢复位置的仿射转换，并为网络正则化提出了一个关键点驱动模块。我们的模型仅需要没有合成数据，实例分割，CAD模型或深度生成器的RGB图像。然而，对流行的Kitti 3D检测数据集进行了广泛的实验表明，KM3D-NET在效率和准确性方面都超过了所有先前最先进的方法。而且，据我们所知，这是第一次在单眼3D对象检测中应用半监督学习。我们甚至超过了以前的大多数完全监督的方法，只有13个\％的Kitti标记数据。

In this work, we propose a novel single-shot and keypoints-based framework for monocular 3D objects detection using only RGB images, called KM3D-Net. We design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine these estimations with perspective geometry constraints to compute position attribute. Further, we reformulate the geometric constraints as a differentiable version and embed it into the network to reduce running time while maintaining the consistency of model outputs in an end-to-end fashion. Benefiting from this simple structure, we then propose an effective semi-supervised training strategy for the setting where labeled training data is scarce. In this strategy, we enforce a consensus prediction of two shared-weights KM3D-Net for the same unlabeled image under different input augmentation conditions and network regularization. In particular, we unify the coordinate-dependent augmentations as the affine transformation for the differential recovering position of objects and propose a keypoints-dropout module for the network regularization. Our model only requires RGB images without synthetic data, instance segmentation, CAD model, or depth generator. Nevertheless, extensive experiments on the popular KITTI 3D detection dataset indicate that the KM3D-Net surpasses all previous state-of-the-art methods in both efficiency and accuracy by a large margin. And also, to the best of our knowledge, this is the first time that semi-supervised learning is applied in monocular 3D objects detection. We even surpass most of the previous fully supervised methods with only 13\% labeled data on KITTI.

下载PDF全文

下载文献需遵守相关版权规定

论文标题