我可以走多远？：一种自我监督的方法，用于确定性视频深度预测

论文标题

我可以走多远？：一种自我监督的方法，用于确定性视频深度预测

How Far Can I Go ? : A Self-Supervised Approach for Deterministic Video Depth Forecasting

论文作者

Nag, Sauradip, Shah, Nisarg, Qi, Anran, Ramachandra, Raghavendra

论文摘要

在本文中，我们提出了一种新颖的自我监督方法，可以预测未来，未观察到的现实世界中的深度估计。这项工作是第一个探索自我监督的学习，以估算视频未来未观察到的框架的单眼深度。现有作品依靠大量注释的样本来生成对看不见框架深度的概率预测。但是，由于其要求大量注释的视频样本，因此这使它变得不现实。此外，案件的概率性质，一个过去可能会有多个未来结果通常会导致深度估计不正确。与以前的方法不同，我们将未观察到的框架的深度估计为视图合成问题，该问题将看不见的视频框架的深度估计视为辅助任务，同时使用学识渊博的姿势将视图恢复回去。这种方法不仅具有成本效益 - 我们不使用任何基础真相深度进行培训（因此实用），而且不使用确定性（过去的框架映射到不久的将来）。为了解决此任务，我们首先开发了一个新颖的深度预测网络DEFNET，该深度通过预测潜在特征来估计未观察到未来的深度。其次，我们开发了基于渠道注意的姿势估计网络，该网络估计未观察到的框架的姿势。使用这个学到的姿势，将估计的深度图重建回图像域，从而形成了自我监管的解决方案。我们提出的方法在短期和中期预测的环境中与最先进的替代方案相比，ABS REL度量的明显改善，并在Kitti和CityScapes上进行了标准。代码可从https://github.com/sauradip/depthforecasting获得

In this paper we present a novel self-supervised method to anticipate the depth estimate for a future, unobserved real-world urban scene. This work is the first to explore self-supervised learning for estimation of monocular depth of future unobserved frames of a video. Existing works rely on a large number of annotated samples to generate the probabilistic prediction of depth for unseen frames. However, this makes it unrealistic due to its requirement for large amount of annotated depth samples of video. In addition, the probabilistic nature of the case, where one past can have multiple future outcomes often leads to incorrect depth estimates. Unlike previous methods, we model the depth estimation of the unobserved frame as a view-synthesis problem, which treats the depth estimate of the unseen video frame as an auxiliary task while synthesizing back the views using learned pose. This approach is not only cost effective - we do not use any ground truth depth for training (hence practical) but also deterministic (a sequence of past frames map to an immediate future). To address this task we first develop a novel depth forecasting network DeFNet which estimates depth of unobserved future by forecasting latent features. Second, we develop a channel-attention based pose estimation network that estimates the pose of the unobserved frame. Using this learned pose, estimated depth map is reconstructed back into the image domain, thus forming a self-supervised solution. Our proposed approach shows significant improvements in Abs Rel metric compared to state-of-the-art alternatives on both short and mid-term forecasting setting, benchmarked on KITTI and Cityscapes. Code is available at https://github.com/sauradip/depthForecasting

下载PDF全文

下载文献需遵守相关版权规定

论文标题