论文标题
Cromo:单眼深度估计的跨模式学习
CroMo: Cross-Modal Learning for Monocular Depth Estimation
论文作者
论文摘要
基于学习的深度估计见证了多个方向的最新进展。从使用单眼视频的自我审视到提供最高精度的监督方法。通过组合多个信号的信息来获得互补的监督,可以进一步提高性能和鲁棒性。在本文中,我们系统地研究了与传感器和模式设计选择以及相关模型培训策略相关的关键权衡。我们的研究导致我们采用了一种新方法,能够从极化,飞行时间和结构化输入中连接特定于模式的优势。我们提出了一种能够从单眼极化估算深度的新型管道,我们评估了各种训练信号。可区分分析模型的反转,从而将场景几何形状与极化和TOF信号联系起来,并实现了自我监管和跨模式学习。在没有现有的多模式数据集的情况下,我们使用定制的多模式相机钻机检查了我们的方法并收集Cromo;第一个以视频速率捕获的,由同步的立体偏振,间接TOF和结构性轻度深度组成的数据集。关于挑战性视频场景的广泛实验证实了定性和定量管道的优势,我们能够超过竞争性的单眼深度估计方法。
Learning-based depth estimation has witnessed recent progress in multiple directions; from self-supervision using monocular video to supervised methods offering highest accuracy. Complementary to supervision, further boosts to performance and robustness are gained by combining information from multiple signals. In this paper we systematically investigate key trade-offs associated with sensor and modality design choices as well as related model training strategies. Our study leads us to a new method, capable of connecting modality-specific advantages from polarisation, Time-of-Flight and structured-light inputs. We propose a novel pipeline capable of estimating depth from monocular polarisation for which we evaluate various training signals. The inversion of differentiable analytic models thereby connects scene geometry with polarisation and ToF signals and enables self-supervised and cross-modal learning. In the absence of existing multimodal datasets, we examine our approach with a custom-made multi-modal camera rig and collect CroMo; the first dataset to consist of synchronized stereo polarisation, indirect ToF and structured-light depth, captured at video rates. Extensive experiments on challenging video scenes confirm both qualitative and quantitative pipeline advantages where we are able to outperform competitive monocular depth estimation method.