bevstereo：通过动态时间立体声的多视图3D对象检测中的深度估计

论文标题

bevstereo：通过动态时间立体声的多视图3D对象检测中的深度估计

BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection with Dynamic Temporal Stereo

论文作者

Li, Yinhao, Bao, Han, Ge, Zheng, Yang, Jinrong, Sun, Jianjian, Li, Zeming

论文摘要

在深度感知的固有歧义的范围内，基于现代摄像机的3D对象检测方法属于性能瓶颈。从直觉上讲，利用时间多视角立体声（MVS）技术是解决这种歧义的自然知识。但是，在适用于3D对象检测场景时，MV的传统尝试在两个方面存在缺陷：1）所有视图中的亲和力测量损失昂贵的计算成本； 2）很难处理对象通常是移动的室外场景。为此，我们引入了一种有效的时间立体声方法，以动态选择匹配候选者的尺度，从而大大减少计算开销。更进一步，我们设计了一种迭代算法，以更新更有价值的候选人，从而适应迁移候选人。我们将我们提出的方法实例化，以进行多视图3D检测器，即Bevstereo。 Bevstereo在Nuscenes数据集的仅相机轨道上实现了新的最新性能（即52.5％的地图和61.0％NDS）。同时，广泛的实验反映我们的方法比当代MVS方法更好地处理复杂的室外场景。代码已在https://github.com/megvii astection/bevstereo上发布。

Bounded by the inherent ambiguity of depth perception, contemporary camera-based 3D object detection methods fall into the performance bottleneck. Intuitively, leveraging temporal multi-view stereo (MVS) technology is the natural knowledge for tackling this ambiguity. However, traditional attempts of MVS are flawed in two aspects when applying to 3D object detection scenes: 1) The affinity measurement among all views suffers expensive computation cost; 2) It is difficult to deal with outdoor scenarios where objects are often mobile. To this end, we introduce an effective temporal stereo method to dynamically select the scale of matching candidates, enable to significantly reduce computation overhead. Going one step further, we design an iterative algorithm to update more valuable candidates, making it adaptive to moving candidates. We instantiate our proposed method to multi-view 3D detector, namely BEVStereo. BEVStereo achieves the new state-of-the-art performance (i.e., 52.5% mAP and 61.0% NDS) on the camera-only track of nuScenes dataset. Meanwhile, extensive experiments reflect our method can deal with complex outdoor scenarios better than contemporary MVS approaches. Codes have been released at https://github.com/Megvii-BaseDetection/BEVStereo.

下载PDF全文

下载文献需遵守相关版权规定

论文标题