RTFormer：用于实时语义分割的有效设计

论文标题

RTFormer：用于实时语义分割的有效设计

RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer

论文作者

Wang, Jian, Gou, Chenhui, Wu, Qiman, Feng, Haocheng, Han, Junyu, Ding, Errui, Wang, Jingdong

论文摘要

最近，基于变压器的网络在语义分段中显示出令人印象深刻的结果。然而，对于实时的语义分割，由于变压器的耗时计算机制，基于CNN的方法仍在该领域中占主导地位。我们提出了RTFormer，这是一种用于实时语义segmenation的有效的双分辨率变压器，与基于CNN的模型相比，在性能和效率之间实现了更好的权衡。为了在类似GPU的设备上实现高推断效率，我们的RTFormer通过线性复杂性利用GPU友好的关注并丢弃了多头机制。此外，我们发现，通过从低分辨率分支中传播的高级知识来收集高分辨率分支的全球环境信息更有效地为高分辨率分支。主流基准的广泛实验证明了我们提出的RTFormer的有效性，它在CityScapes，Camvid和Cocostuff上实现了最新的实验，并在ADE20K上显示出令人鼓舞的结果。代码可在Paddleseg上获得：https：//github.com/paddlepaddle/paddleseg。

Recently, transformer-based networks have shown impressive results in semantic segmentation. Yet for real-time semantic segmentation, pure CNN-based approaches still dominate in this field, due to the time-consuming computation mechanism of transformer. We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation, which achieves better trade-off between performance and efficiency than CNN-based models. To achieve high inference efficiency on GPU-like devices, our RTFormer leverages GPU-Friendly Attention with linear complexity and discards the multi-head mechanism. Besides, we find that cross-resolution attention is more efficient to gather global context information for high-resolution branch by spreading the high level knowledge learned from low-resolution branch. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer, it achieves state-of-the-art on Cityscapes, CamVid and COCOStuff, and shows promising results on ADE20K. Code is available at PaddleSeg: https://github.com/PaddlePaddle/PaddleSeg.

下载PDF全文

下载文献需遵守相关版权规定

论文标题