语言驱动的语义细分

论文标题

语言驱动的语义细分

Language-driven Semantic Segmentation

论文作者

Li, Boyi, Weinberger, Kilian Q., Belongie, Serge, Koltun, Vladlen, Ranftl, René

论文摘要

我们提出了LSEG，这是一种用于语言驱动语义图像分割的新型模型。 LSEG使用文本编码器来计算描述性输入标签（例如“草”或“构建”）的嵌入以及基于变压器的图像编码器，该图像编码器计算输入图像的每个像素嵌入。对图像编码器的训练，具有对比度的目标，可以使像素嵌入与相应语义类的文本嵌入。文本嵌入提供了灵活的标签表示，其中语义上相似的标签映射到嵌入空间中的相似区域（例如“ CAT”和“ Furry”）。这使LSEG可以在测试时间概括为以前看不见的类别，而无需再进行重新培训，甚至不需要一个额外的培训样本。我们证明，与现有的零和少数语义分割方法相比，我们的方法实现了高度竞争性的零击性能，甚至在提供固定标签集时，我们的方法甚至与传统分割算法的准确性匹配。代码和演示可在https://github.com/isl-org/lang-seg上找到。

We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., "grass" or "building") together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g., "cat" and "furry"). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We demonstrate that our approach achieves highly competitive zero-shot performance compared to existing zero- and few-shot semantic segmentation methods, and even matches the accuracy of traditional segmentation algorithms when a fixed label set is provided. Code and demo are available at https://github.com/isl-org/lang-seg.

下载PDF全文

下载文献需遵守相关版权规定

论文标题