制作人士：基于场景的文本对象一代与人类先验

论文标题

制作人士：基于场景的文本对象一代与人类先验

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

论文作者

Gafni, Oran, Polyak, Adam, Ashual, Oron, Sheynin, Shelly, Parikh, Devi, Taigman, Yaniv

论文摘要

最近的文本到图像生成方法提供了文本和图像域之间简单而令人兴奋的转换功能。尽管这些方法逐渐改善了生成的图像保真度和文本相关性，但几个关键差距仍未得到解答，限制了适用性和质量。 We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case.我们的模型实现了最先进的FID和人类评估结果，以512x512像素的分辨率解锁产生高保真图像的能力，从而显着提高了视觉质量。通过场景可控性，我们介绍了几个新功能：（i）场景编辑，（ii）带有锚场景的文本编辑，（iii）克服分布外的文本提示，以及（iv）故事插图一代，如我们所写的故事所示。

Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity and text relevancy, several pivotal gaps remain unanswered, limiting applicability and quality. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels, significantly improving visual quality. Through scene controllability, we introduce several new capabilities: (i) Scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation, as demonstrated in the story we wrote.

下载PDF全文

下载文献需遵守相关版权规定

论文标题