向纯粹无监督的外观和形状的脱离人物图像产生

论文标题

向纯粹无监督的外观和形状的脱离人物图像产生

Towards Purely Unsupervised Disentanglement of Appearance and Shape for Person Images Generation

论文作者

Yang, Hongtao, Zhang, Tong, Huang, Wenbing, He, Xuming, Porikli, Fatih

论文摘要

在探索人类图像的外观和形状的脱离时，已经有很多研究兴趣。大多数现有的努力通过使用注释的训练图像或使用外部线索（例如人体骨架，身体细分或布料贴片等）来追求这一目标，我们的目标是在本文中，我们旨在以更无情的方式解决这一挑战 - 我们不需要任何外部注释或任何外部任务特定线索。为此，我们制定一个类似编码器的网络，同时从输入图像中提取形状和外观特征，并通过三个损失训练参数：特征对抗性损失，颜色一致性损失和重建损失。特征对抗性损失主要是在提取的形状和外观特征之间几乎没有相互信息，而颜色一致性损失是鼓励以不同形状来调节的人外观的不变性。更重要的是，我们的无监督（无监督的学习在不同的任务中都有许多解释。要清楚，本文，我们将无监督的学习称为没有特定任务的人类注释，成对或任何形式的弱监督的学习。在不使用固定输入的人类骨架的情况下，我们的网络可以更好地保留有条件的人类姿势，同时需要更少的监督。关于DeepFashion和Market1501的实验结果表明，所提出的方法可实现清洁的分离，并能够与最先进的弱监督甚至监督方法合成可比质量的新型图像。

There have been a fairly of research interests in exploring the disentanglement of appearance and shape from human images. Most existing endeavours pursuit this goal by either using training images with annotations or regulating the training process with external clues such as human skeleton, body segmentation or cloth patches etc. In this paper, we aim to address this challenge in a more unsupervised manner---we do not require any annotation nor any external task-specific clues. To this end, we formulate an encoder-decoder-like network to extract both the shape and appearance features from input images at the same time, and train the parameters by three losses: feature adversarial loss, color consistency loss and reconstruction loss. The feature adversarial loss mainly impose little to none mutual information between the extracted shape and appearance features, while the color consistency loss is to encourage the invariance of person appearance conditioned on different shapes. More importantly, our unsupervised (Unsupervised learning has many interpretations in different tasks. To be clear, in this paper, we refer unsupervised learning as learning without task-specific human annotations, pairs or any form of weak supervision.) framework utilizes learned shape features as masks which are applied to the input itself in order to obtain clean appearance features. Without using fixed input human skeleton, our network better preserves the conditional human posture while requiring less supervision. Experimental results on DeepFashion and Market1501 demonstrate that the proposed method achieves clean disentanglement and is able to synthesis novel images of comparable quality with state-of-the-art weakly-supervised or even supervised methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题