减少变压器中的信息丢失，以插入多元化图像

论文标题

减少变压器中的信息丢失，以插入多元化图像

Reduce Information Loss in Transformers for Pluralistic Image Inpainting

论文作者

Liu, Qiankun, Tan, Zhentao, Chen, Dongdong, Chu, Qi, Dai, Xiyang, Chen, Yinpeng, Liu, Mengchen, Yuan, Lu, Yu, Nenghai

论文摘要

变形金刚最近在多元化图像上取得了巨大的成功。但是，我们发现现有的基于变压器的解决方案将每个像素视为令牌，因此从两个方面遇到了信息损失问题：1）他们将输入图像置于较低的分辨率中，以考虑效率考虑，导致信息丢失和额外的掩盖区域边界额外错位。 2）他们将$ 256^3 $ RGB像素量化为量化像素的少量（例如512）。量化像素的索引用作变压器的输入和预测目标的令牌。尽管使用额外的CNN网络来对低分辨率的结果进行启动和完善示例，但很难将丢失的信息回收回去。要尽可能保留输入信息，我们提出了一个新的基于变压器的框架“ PUT”。具体而言，为了避免在维持计算效率的同时输入降采样，我们设计了一个基于补丁的自动编码器P-VQVAE，编码器将掩盖的图像转换为非封闭的补丁令牌，解码器将恢复从涂漆的标记中恢复蒙面区域，同时保持未固定的区域，使其保持不固定的区域。为了消除量化引起的信息损失，应用了未量化的变压器（UQ-Transformer），该变压器（UQ-Transformer）直接将来自p-VQVAE编码的特征作为输入，而无需量化，并将仅量化代币视为预测目标。广泛的实验表明，在图像保真度上，尤其是对于大型蒙版区域和复杂的大规模数据集，使最先进的方法胜过最先进的方法。代码可从https://github.com/liuqk3/put获得

Transformers have achieved great success in pluralistic image inpainting recently. However, we find existing transformer based solutions regard each pixel as a token, thus suffer from information loss issue from two aspects: 1) They downsample the input image into much lower resolutions for efficiency consideration, incurring information loss and extra misalignment for the boundaries of masked regions. 2) They quantize $256^3$ RGB pixels to a small number (such as 512) of quantized pixels. The indices of quantized pixels are used as tokens for the inputs and prediction targets of transformer. Although an extra CNN network is used to upsample and refine the low-resolution results, it is difficult to retrieve the lost information back.To keep input information as much as possible, we propose a new transformer based framework "PUT". Specifically, to avoid input downsampling while maintaining the computation efficiency, we design a patch-based auto-encoder P-VQVAE, where the encoder converts the masked image into non-overlapped patch tokens and the decoder recovers the masked regions from inpainted tokens while keeping the unmasked regions unchanged. To eliminate the information loss caused by quantization, an Un-Quantized Transformer (UQ-Transformer) is applied, which directly takes the features from P-VQVAE encoder as input without quantization and regards the quantized tokens only as prediction targets. Extensive experiments show that PUT greatly outperforms state-of-the-art methods on image fidelity, especially for large masked regions and complex large-scale datasets. Code is available at https://github.com/liuqk3/PUT

下载PDF全文

下载文献需遵守相关版权规定

论文标题