论文标题
spatext:可控图像生成的空间文本表示
SpaText: Spatio-Textual Representation for Controllable Image Generation
论文作者
论文摘要
最近的文本到图像扩散模型能够产生令人信服的前所未有的质量结果。但是,几乎不可能以细粒度的方式控制不同区域/对象的形状或它们的布局。以前提供此类控件的尝试受到了对固定标签的依赖的阻碍。为此,我们提出了Spatext-一种使用开放式录音库的新方法,用于使用开放式摄影场景控制。除了描述整个场景的全局文本提示外,用户还提供了一个分割图,其中每个感兴趣的区域都由自由形式的自然语言描述注释。由于缺乏对图像中每个区域的详细文本描述的大规模数据集,我们选择利用当前的大规模文本对图像数据集,并将我们的方法基于新型的基于剪辑的新型空间文本表示,并在两个先进的基于基于pixel的基于基于Pixel和Lettent的基于最新的扩散模型上表现出其有效性。此外,我们还展示了如何将扩散模型中的无分类指导方法扩展为多条件情况,并提出替代加速推理算法。最后,我们提供了几种自动评估指标,除了FID分数和用户研究外,还使用它们来评估我们的方法并表明它可以通过自由形式的文本场景控制获得图像生成的最新结果。
Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText - a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control.