X-Align：鸟类视图段的跨模式横截面对齐

论文标题

X-Align：鸟类视图段的跨模式横截面对齐

X-Align: Cross-Modal Cross-View Alignment for Bird's-Eye-View Segmentation

论文作者

Borse, Shubhankar, Klingner, Marvin, Kumar, Varun Ravi, Cai, Hong, Almuzairee, Abdulaziz, Yogamani, Senthil, Porikli, Fatih

论文摘要

鸟眼视图（BEV）网格是对道路组件感知的常见表示形式，例如自动驾驶中的可驱动区域。大多数现有的方法仅依赖于相机仅在BEV空间中执行分割，这在根本上受到缺乏可靠的深度信息的限制。最新作品利用相机和激光雷达方式，但使用基于串联的机制来融合其功能。在本文中，我们通过增强单峰特征的对齐方式来解决这些问题，以帮助特征融合，并增强摄像机的透视图（PV）和BEV表示之间的比对。 We propose X-Align, a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation consisting of the following components: (i) a novel Cross-Modal Feature Alignment (X-FA) loss, (ii) an attention-based Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View分割比对（X-SA）损失以改善PV-to-Bev变换。我们在两个常用的基准数据集（即Nuscenes和Kitti-360）中评估了我们提出的方法。值得注意的是，X-Align在Nuscenes上显着超过3个绝对MIOU点的最先进。我们还提供广泛的消融研究，以证明各个组件的有效性。

Bird's-eye-view (BEV) grid is a common representation for the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. Latest works leverage both camera and LiDAR modalities, but sub-optimally fuse their features using simple, concatenation-based mechanisms. In this paper, we address these problems by enhancing the alignment of the unimodal features in order to aid feature fusion, as well as enhancing the alignment between the cameras' perspective view (PV) and BEV representations. We propose X-Align, a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation consisting of the following components: (i) a novel Cross-Modal Feature Alignment (X-FA) loss, (ii) an attention-based Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View Segmentation Alignment (X-SA) losses to improve the PV-to-BEV transformation. We evaluate our proposed method across two commonly used benchmark datasets, i.e., nuScenes and KITTI-360. Notably, X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes. We also provide extensive ablation studies to demonstrate the effectiveness of the individual components.

下载PDF全文

下载文献需遵守相关版权规定

论文标题