更少的是：一致的视频深度估计与蒙版框架建模

论文标题

更少的是：一致的视频深度估计与蒙版框架建模

Less is More: Consistent Video Depth Estimation with Masked Frames Modeling

论文作者

Wang, Yiran, Pan, Zhiyu, Li, Xingyi, Cao, Zhiguo, Xian, Ke, Zhang, Jianming

论文摘要

时间一致性是视频深度估计的主要挑战。以前的作品基于额外的光流或相机姿势，这是耗时的。相比之下，我们获得了更少的信息的一致性。由于视频固有地存在着沉重的时间冗余，因此可以从附近的框架中恢复缺失的框架。受此启发的启发，我们提出了框架屏蔽网络（FMNET），这是一种时空变压器网络，可根据其相邻框架预测蒙版框架的深度。通过重建掩盖的时间特征，FMNET可以学习固有的框架间相关性，从而导致一致性。与先前的艺术相比，实验结果表明，我们的方法可以达到可比的空间准确性和更高的时间一致性，而没有任何其他信息。我们的工作为一致的视频深度估计提供了新的视角。我们的官方项目页面是https://github.com/raymondwang987/fmnet。

Temporal consistency is the key challenge of video depth estimation. Previous works are based on additional optical flow or camera poses, which is time-consuming. By contrast, we derive consistency with less information. Since videos inherently exist with heavy temporal redundancy, a missing frame could be recovered from neighboring ones. Inspired by this, we propose the frame masking network (FMNet), a spatial-temporal transformer network predicting the depth of masked frames based on their neighboring frames. By reconstructing masked temporal features, the FMNet can learn intrinsic inter-frame correlations, which leads to consistency. Compared with prior arts, experimental results demonstrate that our approach achieves comparable spatial accuracy and higher temporal consistency without any additional information. Our work provides a new perspective on consistent video depth estimation. Our official project page is https://github.com/RaymondWang987/FMNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题