论文标题
BEVDET4D:在多相机3D对象检测中利用时间提示
BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection
论文作者
论文摘要
单帧数据包含有限信息,该信息限制了现有基于视觉的多相机3D对象检测范例的性能。为了从根本上推动该区域的性能边界,提出了一种新颖的范式Bevdet4d,以将可扩展的BEVDET范式从仅空间的3D空间提升到时空4D空间。我们使用一些修改来升级幼稚的BEVDET框架,仅将上一个帧中的功能与当前帧中的相应框架融合在一起。通过这种方式,借助额外的计算预算,我们可以通过查询和比较两个候选功能来访问时间提示。除此之外,我们通过消除学习目标中的自我运动和时间的因素来简化速度预测的任务。结果,具有鲁棒泛化性能的BEVDET4D降低了速度误差高达-62.9%。这使得基于视力的方法首次与在这方面依赖LiDAR或雷达的方法相提并论。在挑战基准NUSCENES上,我们报告了54.5%NDS的新记录,其高性能配置称为BEVDET4D-BASE,该型号超过了先前领先的方法BEVDET基准杆,而BEVDET基本含量为 +7.3%NDS。源代码可在https://github.com/huangjunjie2017/bevdet上公开研究。
Single frame data contains finite information which limits the performance of the existing vision-based multi-camera 3D object detection paradigms. For fundamentally pushing the performance boundary in this area, a novel paradigm dubbed BEVDet4D is proposed to lift the scalable BEVDet paradigm from the spatial-only 3D space to the spatial-temporal 4D space. We upgrade the naive BEVDet framework with a few modifications just for fusing the feature from the previous frame with the corresponding one in the current frame. In this way, with negligible additional computing budget, we enable BEVDet4D to access the temporal cues by querying and comparing the two candidate features. Beyond this, we simplify the task of velocity prediction by removing the factors of ego-motion and time in the learning target. As a result, BEVDet4D with robust generalization performance reduces the velocity error by up to -62.9%. This makes the vision-based methods, for the first time, become comparable with those relied on LiDAR or radar in this aspect. On challenge benchmark nuScenes, we report a new record of 54.5% NDS with the high-performance configuration dubbed BEVDet4D-Base, which surpasses the previous leading method BEVDet-Base by +7.3% NDS. The source code is publicly available for further research at https://github.com/HuangJunJie2017/BEVDet .