图像值得一个词：使用文本反演个性化文本到图像生成

论文标题

图像值得一个词：使用文本反演个性化文本到图像生成

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

论文作者

Gal, Rinon, Alaluf, Yuval, Atzmon, Yuval, Patashnik, Or, Bermano, Amit H., Chechik, Gal, Cohen-Or, Daniel

论文摘要

文本对图像模型提供了前所未有的自由，可以通过自然语言指导创作。然而，尚不清楚如何行使这种自由以产生特定独特概念，修改其外观或以新角色和新颖的场景构成它们的图像。换句话说，我们问：我们如何使用语言指导的模型将猫变成绘画，或者想象基于我们喜欢的玩具的新产品？在这里，我们提出了一种简单的方法，可以允许这种创造性自由。仅使用3-5个用户提供的概念（例如对象或样式）的图像，我们学会通过在冷冻文本到图像模型的嵌入空间中通过新的“单词”表示它。这些“单词”可以构成自然语言句子，以直观的方式指导个性化的创作。值得注意的是，我们发现有证据表明，单个单词嵌入足以捕获独特而多样的概念。我们将我们的方法比较了各种基线，并证明它可以更忠实地描绘出一系列应用程序和任务的概念。我们的代码，数据和新单词将在以下网址提供：https：//textual-inversion.github.io

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks. Our code, data and new words will be available at: https://textual-inversion.github.io

下载PDF全文

下载文献需遵守相关版权规定

论文标题