JPERCEIVER：驾驶场景中深度，姿势和布局估计的联合感知网络

论文标题

JPERCEIVER：驾驶场景中深度，姿势和布局估计的联合感知网络

JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes

论文作者

Zhao, Haimei, Zhang, Jing, Zhang, Sen, Tao, Dacheng

论文摘要

深度估计，视觉探测器（VO）和Bird's-eye-view（BEV）场景布局估计估计介绍了三个关键任务，这是驾驶现场感知的三个关键任务，这对于自动驾驶中的运动计划和导航至关重要。尽管它们彼此互补，但先前的工作通常专注于每个任务，并且很少处理所有三个任务。一种天真的方法是以顺序或平行的方式独立实现它们，但是有许多缺点，即1）深度和vo结果遭受固有的规模歧义问题； 2）尽管深度图包含用于推断场景布局的有用几何线索，但从前视图像直接预测了BEV布局，而无需使用任何与深度相关的信息。在本文中，我们通过提出一个名为jperceiver的新型关节感知框架来解决这些问题，该框架可以同时估算从单眼视频序列中估算尺度感知的深度和vo以及BEV布局。它利用了跨视图几何变换（CGT），以基于精心设计的量表损失来传播从道路布局到深度的绝对尺度。同时，设计了一个跨视图和跨模式转移（CCT）模块，以通过注意机制利用深度线索来推理道路和车辆布局。可以以端到端的多任务学习方式对JPERCEVER进行培训，其中CGT量表损失和CCT模块可以促进任务间知识转移，从而使每个任务的功能学习受益。关于Argoverse，Nuscenes和Kitti的实验表明，在准确性，模型大小和推理速度方面，JPERCEIVER在上述所有三个任务上的优越性。代码和模型可在〜\ href {https://github.com/sunnyhelen/jperceiver} {https://github.com/sunnyhelen/jperceiver}中获得。

Depth estimation, visual odometry (VO), and bird's-eye-view (BEV) scene layout estimation present three critical tasks for driving scene perception, which is fundamental for motion planning and navigation in autonomous driving. Though they are complementary to each other, prior works usually focus on each individual task and rarely deal with all three tasks together. A naive way is to accomplish them independently in a sequential or parallel manner, but there are many drawbacks, i.e., 1) the depth and VO results suffer from the inherent scale ambiguity issue; 2) the BEV layout is directly predicted from the front-view image without using any depth-related information, although the depth map contains useful geometry clues for inferring scene layouts. In this paper, we address these issues by proposing a novel joint perception framework named JPerceiver, which can simultaneously estimate scale-aware depth and VO as well as BEV layout from a monocular video sequence. It exploits the cross-view geometric transformation (CGT) to propagate the absolute scale from the road layout to depth and VO based on a carefully-designed scale loss. Meanwhile, a cross-view and cross-modal transfer (CCT) module is devised to leverage the depth clues for reasoning road and vehicle layout through an attention mechanism. JPerceiver can be trained in an end-to-end multi-task learning way, where the CGT scale loss and CCT module promote inter-task knowledge transfer to benefit feature learning of each task. Experiments on Argoverse, Nuscenes and KITTI show the superiority of JPerceiver over existing methods on all the above three tasks in terms of accuracy, model size, and inference speed. The code and models are available at~\href{https://github.com/sunnyHelen/JPerceiver}{https://github.com/sunnyHelen/JPerceiver}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题