Depthformer：基于变压器的分割网络的多模式位置编码和交叉输入注意力

论文标题

Depthformer：基于变压器的分割网络的多模式位置编码和交叉输入注意力

DepthFormer: Multimodal Positional Encodings and Cross-Input Attention for Transformer-Based Segmentation Networks

论文作者

Barbato, Francesco, Rizzoli, Giulia, Zanuttigh, Pietro

论文摘要

语义分割的大多数方法仅使用来自彩色摄像机的信息来解析场景，但是最近的进步表明，使用深度数据可以进一步改善性能。在这项工作中，我们专注于基于变压器的深度学习体系结构，这些体系结构已经在细分任务上实现了最新的性能，我们建议通过将其嵌入到位置编码中来使用深度信息。有效地，我们将网络扩展到多模式数据，而无需添加任何参数，并且以一种自然的方式利用了变形金刚的自我发场模块的强度。我们还研究了注意模块内部执行跨模式操作的想法，将深度和颜色分支之间的密钥输入交换。我们的方法始终提高了CityScapes基准的表演。

Most approaches for semantic segmentation use only information from color cameras to parse the scenes, yet recent advancements show that using depth data allows to further improve performances. In this work, we focus on transformer-based deep learning architectures, that have achieved state-of-the-art performances on the segmentation task, and we propose to employ depth information by embedding it in the positional encoding. Effectively, we extend the network to multimodal data without adding any parameters and in a natural way that makes use of the strength of transformers' self-attention modules. We also investigate the idea of performing cross-modality operations inside the attention module, swapping the key inputs between the depth and color branches. Our approach consistently improves performances on the Cityscapes benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题