论文标题
空间多条件图像生成
Spatially Multi-conditional Image Generation
论文作者
论文摘要
在大多数情况下,有条件的图像生成可以被认为是对图像理解过程的反转。由于通用图像理解涉及解决多个任务,因此自然是通过多条件来生成图像。但是,由于(实际上)可用条件标签的异质性和稀疏性,多条件图像生成是一个非常具有挑战性的问题。在这项工作中,我们提出了一种新型的神经结构,以解决空间多条件标签的异质性和稀疏性问题。我们选择的空间条件(例如语义和深度)是由它具有更好地控制图像生成过程的承诺所驱动的。所提出的方法在操作像素方面使用类似变压器的架构,它接收可用的标签作为输入令牌,以将其合并在学识渊博的标签均匀空间中。然后,合并的标签用于通过有条件的生成对抗训练来产生图像。在此过程中,通过简单地将与所需位置的缺失标签相对应的输入令牌掉下来处理标签的稀疏性,这要归功于提出的像素智能操作体系结构。我们在三个基准数据集上进行的实验证明了我们方法在最先进的方面具有明显的优势,并比较了基准。源代码将公开可用。
In most scenarios, conditional image generation can be thought of as an inversion of the image understanding process. Since generic image understanding involves solving multiple tasks, it is natural to aim at generating images via multi-conditioning. However, multi-conditional image generation is a very challenging problem due to the heterogeneity and the sparsity of the (in practice) available conditioning labels. In this work, we propose a novel neural architecture to address the problem of heterogeneity and sparsity of the spatially multi-conditional labels. Our choice of spatial conditioning, such as by semantics and depth, is driven by the promise it holds for better control of the image generation process. The proposed method uses a transformer-like architecture operating pixel-wise, which receives the available labels as input tokens to merge them in a learned homogeneous space of labels. The merged labels are then used for image generation via conditional generative adversarial training. In this process, the sparsity of the labels is handled by simply dropping the input tokens corresponding to the missing labels at the desired locations, thanks to the proposed pixel-wise operating architecture. Our experiments on three benchmark datasets demonstrate the clear superiority of our method over the state-of-the-art and compared baselines. The source code will be made publicly available.