Depthformer：通过本地全球信息融合的单眼深度估计的多尺度视觉变压器

论文标题

Depthformer：通过本地全球信息融合的单眼深度估计的多尺度视觉变压器

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

论文作者

Agarwal, Ashutosh, Arora, Chetan

论文摘要

基于注意力的模型（例如变压器）在密集的预测任务（例如语义分割）上表现出出色的性能，因为它们可以捕获图像中的长期依赖性。但是，到目前为止，很少探索变压器对单眼深度预测的好处。本文基于室内NYUV2数据集和室外KITTI数据集的深度估计任务的各种基于变压器的模型。我们提出了一种新型的基于注意力的架构，即单眼深度估计的深度构建器，该估计使用多头自我注意力来生成多尺度特征图，这些图由我们提出的解码器网络有效地组合。我们还提出了一个跨键模块，该模块将深度范围划分为每个图像可自适应估计的中心值的箱。估计的最终深度是每个像素的垃圾箱中心的线性组合。 TransBins模块在编码阶段使用变压器模块利用全局接收场。 NYUV2和KITTI深度估计基准的实验结果表明，我们所提出的方法在根平方误差（RMSE）方面将最新方法分别提高了3.3％和3.3％。代码可在https://github.com/ashutosh1807/depthformer.git上找到。

Attention-based models such as transformers have shown outstanding performance on dense prediction tasks, such as semantic segmentation, owing to their capability of capturing long-range dependency in an image. However, the benefit of transformers for monocular depth prediction has seldom been explored so far. This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset. We propose a novel attention-based architecture, Depthformer for monocular depth estimation that uses multi-head self-attention to produce the multiscale feature maps, which are effectively combined by our proposed decoder network. We also propose a Transbins module that divides the depth range into bins whose center value is estimated adaptively per image. The final depth estimated is a linear combination of bin centers for each pixel. Transbins module takes advantage of the global receptive field using the transformer module in the encoding stage. Experimental results on NYUV2 and KITTI depth estimation benchmark demonstrate that our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE). Code is available at https://github.com/ashutosh1807/Depthformer.git.

下载PDF全文

下载文献需遵守相关版权规定

论文标题