论文标题
RTS3D:从4D功能一致性嵌入空间进行自动驾驶的实时立体声3D检测
RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving
论文作者
论文摘要
尽管使用伪LIDAR表示的最新基于图像的3D对象检测方法显示出很大的功能,但与基于激光雷达的方法相比,效率和准确性仍然存在明显的差距。此外,对独立深度估计器的过度依赖,在训练阶段需要大量像素注释,并在推论阶段进行更多计算,从而限制了现实世界中的缩放应用。 在本文中,我们提出了一种来自名为RTS3D的立体图像中有效,准确的3D对象检测方法。与伪巨头类似方法中的3D占用空间不同,我们将新颖的4D功能一致性嵌入(FCE)空间设计为3D场景的中间表示,而无需深度监督。 FCE空间通过探索立体声对扭曲的多尺度特征一致性来编码对象的结构和语义信息。此外,设计了语义引导的RBF(径向基函数)和一个结构感知的注意模块,以减少没有实例掩模监督的FCE空间噪声的影响。 KITTI基准测试的实验表明,RTS3D是立体声图像3D检测的第一个真实实时系统(FPS $> $ 24),同时与先前的最新方法相比,平均精确度获得了$ 10 \%$改进。该代码将在https://github.com/banconxuan/rts3d上找到
Although the recent image-based 3D object detection methods using Pseudo-LiDAR representation have shown great capabilities, a notable gap in efficiency and accuracy still exist compared with LiDAR-based methods. Besides, over-reliance on the stand-alone depth estimator, requiring a large number of pixel-wise annotations in the training stage and more computation in the inferencing stage, limits the scaling application in the real world. In this paper, we propose an efficient and accurate 3D object detection method from stereo images, named RTS3D. Different from the 3D occupancy space in the Pseudo-LiDAR similar methods, we design a novel 4D feature-consistent embedding (FCE) space as the intermediate representation of the 3D scene without depth supervision. The FCE space encodes the object's structural and semantic information by exploring the multi-scale feature consistency warped from stereo pair. Furthermore, a semantic-guided RBF (Radial Basis Function) and a structure-aware attention module are devised to reduce the influence of FCE space noise without instance mask supervision. Experiments on the KITTI benchmark show that RTS3D is the first true real-time system (FPS$>$24) for stereo image 3D detection meanwhile achieves $10\%$ improvement in average precision comparing with the previous state-of-the-art method. The code will be available at https://github.com/Banconxuan/RTS3D