文本对图像生成以细粒度的用户关注为基础

论文标题

文本对图像生成以细粒度的用户关注为基础

Text-to-Image Generation Grounded by Fine-Grained User Attention

论文作者

Koh, Jing Yu, Baldridge, Jason, Lee, Honglak, Yang, Yinfei

论文摘要

本地化叙述是一个数据集，具有详细的自然语言描述，并将图像与鼠标痕迹配对，可为短语提供稀疏，细粒度的视觉接地。我们提出了TREC，这是一个顺序模型，可利用此接地来生成图像。 TREC使用描述来检索分割掩码并预测与鼠标痕迹对齐的对象标签。这些对齐方式用于选择和定位面具，以生成完全覆盖的分段画布。最终图像是由使用此画布的分割到图像发生器产生的。在自动指标和人类评估上，这种多步，基于检索的方法的表现优于现有的直接文本到图像生成模型：总体而言，其生成的图像更真实，更好地匹配描述。

Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces that provide a sparse, fine-grained visual grounding for phrases. We propose TReCS, a sequential model that exploits this grounding to generate images. TReCS uses descriptions to retrieve segmentation masks and predict object labels aligned with mouse traces. These alignments are used to select and position masks to generate a fully covered segmentation canvas; the final image is produced by a segmentation-to-image generator using this canvas. This multi-step, retrieval-based approach outperforms existing direct text-to-image generation models on both automatic metrics and human evaluations: overall, its generated images are more photo-realistic and better match descriptions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题