限制：使用变压器的无卷积参考图像分割

论文标题

限制：使用变压器的无卷积参考图像分割

ReSTR: Convolution-free Referring Image Segmentation Using Transformers

论文作者

Kim, Namyup, Kim, Dongwon, Lan, Cuiling, Zeng, Wenjun, Kwak, Suha

论文摘要

参考图像分割是一个高级的语义分割任务，其中目标不是预定义的类，而是自然语言进行描述。此任务的大多数现有方法都在很大程度上依赖于卷积神经网络，但是在语言表达式中实体之间捕获长距离依赖性很难，并且不够灵活，无法对两种不同模态之间的相互作用进行建模。为了解决这些问题，我们提出了第一个无卷积模型，用于使用称为限制的变压器引用图像分割。由于它通过变压器编码器提取两种模式的特征，因此它可以捕获每个模态内实体之间的远程依赖性。同样，通过自我发项式编码器来限制两种方式的特征，这在融合过程中可以在两种方式之间进行灵活和适应性的相互作用。融合功能被馈送到分段模块，该模块根据手头的图像和语言表达式适应。评估限制并将其与所有公共基准的先前工作进行了比较，在所有公共基准测试中，它的表现都超过了所有现有模型。

Referring image segmentation is an advanced semantic segmentation task where target is not a predefined class but is described in natural language. Most of existing methods for this task rely heavily on convolutional neural networks, which however have trouble capturing long-range dependencies between entities in the language expression and are not flexible enough for modeling interactions between the two different modalities. To address these issues, we present the first convolution-free model for referring image segmentation using transformers, dubbed ReSTR. Since it extracts features of both modalities through transformer encoders, it can capture long-range dependencies between entities within each modality. Also, ReSTR fuses features of the two modalities by a self-attention encoder, which enables flexible and adaptive interactions between the two modalities in the fusion process. The fused features are fed to a segmentation module, which works adaptively according to the image and language expression in hand. ReSTR is evaluated and compared with previous work on all public benchmarks, where it outperforms all existing models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题