可概括的多视图场景表示的深度野外网络

论文标题

可概括的多视图场景表示的深度野外网络

Depth Field Networks for Generalizable Multi-view Scene Representation

论文作者

Guizilini, Vitor, Vasiljevic, Igor, Fang, Jiading, Ambrus, Rares, Shakhnarovich, Greg, Walter, Matthew, Gaidon, Adrien

论文摘要

现代的3D计算机视觉利用学习来增强几何推理，将图像数据映射到经典结构，例如成本量或比较限制，以改善匹配。这些体系结构根据特定问题进行了专业化，因此需要特定于任务的大量调整，通常会导致域的概括性能。最近，通才变压器架构通过将几何学先验编码为输入而不是强制约束，在诸如光流和深度估计之类的任务中取得了令人印象深刻的结果。在本文中，我们扩展了这一想法，并建议学习一个隐式，多视图一致的场景表示，并在增加视图多样性之前引入了一系列3D数据增强技术作为几何感应。我们还表明，引入视图合成作为辅助任务进一步改善了深度估计。我们的深度磁场网络（定义）实现了最新的目的，可以实现立体声和视频深度估计，而无需显式几何限制，并通过广泛的边距改善了零局部域的概括。

Modern 3D computer vision leverages learning to boost geometric reasoning, mapping image data to classical structures such as cost volumes or epipolar constraints to improve matching. These architectures are specialized according to the particular problem, and thus require significant task-specific tuning, often leading to poor domain generalization performance. Recently, generalist Transformer architectures have achieved impressive results in tasks such as optical flow and depth estimation by encoding geometric priors as inputs rather than as enforced constraints. In this paper, we extend this idea and propose to learn an implicit, multi-view consistent scene representation, introducing a series of 3D data augmentation techniques as a geometric inductive prior to increase view diversity. We also show that introducing view synthesis as an auxiliary task further improves depth estimation. Our Depth Field Networks (DeFiNe) achieve state-of-the-art results in stereo and video depth estimation without explicit geometric constraints, and improve on zero-shot domain generalization by a wide margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题