论文标题
多功能扩散:文本,图像和变化全部在一个扩散模型中
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
论文作者
论文摘要
扩散模型的最新进展在许多一代任务中树立了令人印象深刻的里程碑,而诸如DALL-E2,Imagen和稳定扩散之类的流行作品引起了极大的兴趣。尽管景观发生了迅速的变化,但最近的新方法集中在扩展和性能而不是容量上,因此需要单独的模型来进行单独的任务。在这项工作中,我们将现有的单流扩散管道扩展到多任务多模式网络,称为多种模式扩散(VD),该网络处理了一个统一模型中文本到图像,图像到文本的多个流量,图像到文本和变化。 VD的管道设计实例化了一个统一的多流扩散框架,该框架由可共享和可交换的层模块组成,这些模块可以超出图像和文本超出图像和文本。通过广泛的实验,我们证明了VD成功实现了以下方面:a)VD优于基线方法,并以有竞争力的质量处理其所有基本任务; b)VD实现了新颖的扩展,例如样式和语义的解开,二元和多上下文融合等; c)我们的多流多模式框架对图像和文本的成功可能会激发基于进一步扩散的通用AI研究。我们的代码和模型在https://github.com/shi-labs/versatile-diffusion上进行开源。
Recent advances in diffusion models have set an impressive milestone in many generation tasks, and trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model. The pipeline design of VD instantiates a unified multi-flow diffusion framework, consisting of sharable and swappable layer modules that enable the crossmodal generality beyond images and text. Through extensive experiments, we demonstrate that VD successfully achieves the following: a) VD outperforms the baseline approaches and handles all its base tasks with competitive quality; b) VD enables novel extensions such as disentanglement of style and semantics, dual- and multi-context blending, etc.; c) The success of our multi-flow multimodal framework over images and text may inspire further diffusion-based universal AI research. Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.