论文标题
SIDERT:用于单个图像深度估计的实时纯变压器体系结构
SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation
论文作者
论文摘要
由于上下文建模对于估算单个图像的深度至关重要,因此研究人员为获得全球环境付出了巨大的努力。许多全球操作都是为传统的基于CNN的架构而设计的,以克服卷积的当地。最初设计用于捕获长期依赖的注意机制或变压器可能是一个更好的选择,但通常会使体系结构复杂化,并可能导致推理速度降低。在这项工作中,我们提出了一种称为Sidert的纯变压架构,可以实时实现出色的预测。为了捕获更好的全球环境,跨尺度关注(CSA)和多尺度改进(MSR)模块旨在协作以有效地融合不同尺度的功能。 CSA模块专注于高语义相似性的融合功能,而MSR模块旨在将功能融合在相应的位置。这两个模块包含一些无需卷积的可学习参数,该参数基于建立轻巧但有效的模型。该体系结构可实时实现最先进的表演(51.3 fps),并且在较小的骨干SWIN-T(83.1 fps)上的性能下降变得更快。此外,其性能超过了先前的最先进利润率,提高了Kitti的Absrel度量为6.9%,而NYU则提高了9.7%。据我们所知,这是第一项表明基于变压器的网络可以在单个图像深度估计字段中实时实现最先进的性能。代码将很快提供。
Since context modeling is critical for estimating depth from a single image, researchers put tremendous effort into obtaining global context. Many global manipulations are designed for traditional CNN-based architectures to overcome the locality of convolutions. Attention mechanisms or transformers originally designed for capturing long-range dependencies might be a better choice, but usually complicates architectures and could lead to a decrease in inference speed. In this work, we propose a pure transformer architecture called SideRT that can attain excellent predictions in real-time. In order to capture better global context, Cross-Scale Attention (CSA) and Multi-Scale Refinement (MSR) modules are designed to work collaboratively to fuse features of different scales efficiently. CSA modules focus on fusing features of high semantic similarities, while MSR modules aim to fuse features at corresponding positions. These two modules contain a few learnable parameters without convolutions, based on which a lightweight yet effective model is built. This architecture achieves state-of-the-art performances in real-time (51.3 FPS) and becomes much faster with a reasonable performance drop on a smaller backbone Swin-T (83.1 FPS). Furthermore, its performance surpasses the previous state-of-the-art by a large margin, improving AbsRel metric 6.9% on KITTI and 9.7% on NYU. To the best of our knowledge, this is the first work to show that transformer-based networks can attain state-of-the-art performance in real-time in the single image depth estimation field. Code will be made available soon.