lafite2：很少射击文本形象一代

论文标题

lafite2：很少射击文本形象一代

Lafite2: Few-shot Text-to-Image Generation

论文作者

Zhou, Yufan, Li, Chunyuan, Chen, Changyou, Gao, Jianfeng, Xu, Jinhui

论文摘要

近年来，文本到图像生成模型已经取得了长足的进步，现在可以从任意文本中产生令人印象深刻的逼真图像。大多数此类模型都经过网络尺度图像文本配对数据集的培训，这对于许多研究人员来说可能不起作用。在本文中，我们提出了一种新的方法，用于预训练仅图像数据集的文本对图像生成模型。它考虑了一个检索 - 然后优化的过程来综合伪文本特征：对于给定的图像，首先检索相关的伪文本特征，然后优化以进行更好的对齐。提出的方法的低要求产生了高灵活性和可用性：它可能对广泛的设置有益，包括几幅，半监督和完全监督的学习；它可以应用于不同模型，包括生成对抗网络（GAN）和扩散模型。广泛的实验说明了所提出的方法的有效性。在MS-Coco数据集上，我们的GAN模型获得了6.78的Fréchet成立距离（FID），这是在完全监督的设置下GAN的新的最先进的（SOTA）。我们的扩散模型分别在零射击和监督设置上获得了8.42和4.28的FID，这与模型尺寸较小的SOTA扩散模型具有竞争力。

Text-to-image generation models have progressed considerably in recent years, which can now generate impressive realistic images from arbitrary text. Most of such models are trained on web-scale image-text paired datasets, which may not be affordable for many researchers. In this paper, we propose a novel method for pre-training text-to-image generation model on image-only datasets. It considers a retrieval-then-optimization procedure to synthesize pseudo text features: for a given image, relevant pseudo text features are first retrieved, then optimized for better alignment. The low requirement of the proposed method yields high flexibility and usability: it can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning; it can be applied on different models including generative adversarial networks (GANs) and diffusion models. Extensive experiments illustrate the effectiveness of the proposed method. On MS-COCO dataset, our GAN model obtains Fréchet Inception Distance (FID) of 6.78 which is the new state-of-the-art (SoTA) of GANs under fully-supervised setting. Our diffusion model obtains FID of 8.42 and 4.28 on zero-shot and supervised setting respectively, which are competitive to SoTA diffusion models with a much smaller model size.

下载PDF全文

下载文献需遵守相关版权规定

论文标题