所有都是值得的话：扩散模型的VIT主干

论文标题

所有都是值得的话：扩散模型的VIT主干

All are Worth Words: A ViT Backbone for Diffusion Models

论文作者

Bao, Fan, Nie, Shen, Xue, Kaiwen, Cao, Yue, Li, Chongxuan, Su, Hang, Zhu, Jun

论文摘要

视觉变压器（VIT）在各种视觉任务中表现出了希望，而基于卷积神经网络（CNN）的U-NET在扩散模型中仍然占主导地位。我们为使用扩散模型设计了一个简单且基于VIT的架构（称为U-Vit）。 U-Vit的特征是将所有输入处理（包括时间，条件和嘈杂的图像贴片）视为令牌，并在浅层和深层之间使用长的跳过连接。我们在无条件和类别的图像生成以及文本到图像生成任务中评估U-VIT，如果U-Vit不优于类似大小的CNN的U-NET，则U-Vit是可比的。尤其是，具有U-VIT的潜在扩散模型在ImageNet 256x256上的类别形成图像生成中达到2.29的创纪录的FID得分，而在MS-Coco上的文本到图像生成中，有5.48在无需访问大型外部数据集的方法的培训期间，在MS-Coco上的文本到图像生成。我们的结果表明，对于基于扩散的图像建模，较长的跳过连接至关重要，而基于CNN的U-NET中的下采样和上采样运算符并非总是必需的。我们认为，U-Vit可以在扩散模型中对主臂的未来研究和大规模跨模式数据集的好处生成模型提供见解。

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题