swinv2-imagen：文本对图像生成的层次视觉变压器扩散模型

论文标题

swinv2-imagen：文本对图像生成的层次视觉变压器扩散模型

Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation

论文作者

Li, Ruijun, Li, Weihua, Yang, Yi, Wei, Hanyu, Jiang, Jianhua, Bai, Quan

论文摘要

最近，已经证明，扩散模型在许多研究中在文本到图像综合任务中表现出色，立即为图像生成提供了新的研究机会。 Google的成像遵循了这一研究趋势，并胜过Dalle2是文本到图像生成的最佳模型。但是，Imagen仅使用T5语言模型进行文本处理，该模型无法确保学习文本的语义信息。此外，成像剂利用的有效UNET并不是图像处理中的最佳选择。为了解决这些问题，我们提出了Swinv2-Imagen，这是一种基于层次视觉变压器的新型文本到图像扩散模型，以及包含语义布局的场景图。在提出的模型中，提取了实体和关系的特征向量并参与扩散模型，从而有效提高了生成的图像的质量。最重要的是，我们还引入了一个基于Swin-Transformer的UNET体系结构，称为Swinv2-Unet，该体系结构可以解决CNN卷积操作所引起的问题。进行了广泛的实验，以通过使用三个现实世界数据集（即Mscoco，Cub和MM-Celeba-HQ）来评估所提出模型的性能。实验结果表明，所提出的SWINV2-IMAGEN模型的表现优于几种流行的最新方法。

Recently, diffusion models have been proven to perform remarkably well in text-to-image synthesis tasks in a number of studies, immediately presenting new study opportunities for image generation. Google's Imagen follows this research trend and outperforms DALLE2 as the best model for text-to-image generation. However, Imagen merely uses a T5 language model for text processing, which cannot ensure learning the semantic information of the text. Furthermore, the Efficient UNet leveraged by Imagen is not the best choice in image processing. To address these issues, we propose the Swinv2-Imagen, a novel text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout. In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model, effectively improving the quality of generated images. On top of that, we also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations. Extensive experiments are conducted to evaluate the performance of the proposed model by using three real-world datasets, i.e., MSCOCO, CUB and MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen model outperforms several popular state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题