自我监督的场景脱颖而出

论文标题

自我监督的场景脱颖而出

Self-Supervised Scene De-occlusion

论文作者

Zhan, Xiaohang, Pan, Xingang, Dai, Bo, Liu, Ziwei, Lin, Dahua, Loy, Chen Change

论文摘要

自然的场景理解是一项具有挑战性的任务，尤其是在遇到部分遮挡的多个对象的图像时。通过改变对象排序和定位来产生此障碍。现有的场景理解范式只能解析可见的部分，从而导致不完整和非结构化的场景解释。在本文中，我们调查了场景去二拟截定的问题，该问题旨在恢复潜在的遮挡顺序并完成闭塞物体的无形部分。我们首次尝试通过一个小说和统一的框架来解决问题，该框架恢复了隐藏的场景结构，而无需订购和阿莫达尔注释作为监督。这是通过部分完成网络（PCNET）-mask（M）和-Content（C）来实现的，该网络学会分别以自我监督的方式恢复对象掩码和内容的分数。基于PCNET-M和PCNET-C，我们设计了一种新颖的推理方案，以通过渐进的订购恢复，Amodal完成和内容完成来实现场景去估算。在实际场景上进行了广泛的实验，证明了我们对其他替代方案的卓越表现。值得注意的是，我们以自我监督方式训练的方法可以取得与完全监督的方法相当的结果。所提出的场景去估算框架使许多应用程序受益，包括高质量和可控的图像操纵和场景重新编辑（见图1），以及现有的模态掩膜注释转换为Amodal掩膜注释。

Natural scene understanding is a challenging task, particularly when encountering images of multiple objects that are partially occluded. This obstacle is given rise by varying object ordering and positioning. Existing scene understanding paradigms are able to parse only the visible parts, resulting in incomplete and unstructured scene interpretation. In this paper, we investigate the problem of scene de-occlusion, which aims to recover the underlying occlusion ordering and complete the invisible parts of occluded objects. We make the first attempt to address the problem through a novel and unified framework that recovers hidden scene structures without ordering and amodal annotations as supervisions. This is achieved via Partial Completion Network (PCNet)-mask (M) and -content (C), that learn to recover fractions of object masks and contents, respectively, in a self-supervised manner. Based on PCNet-M and PCNet-C, we devise a novel inference scheme to accomplish scene de-occlusion, via progressive ordering recovery, amodal completion and content completion. Extensive experiments on real-world scenes demonstrate the superior performance of our approach to other alternatives. Remarkably, our approach that is trained in a self-supervised manner achieves comparable results to fully-supervised methods. The proposed scene de-occlusion framework benefits many applications, including high-quality and controllable image manipulation and scene recomposition (see Fig. 1), as well as the conversion of existing modal mask annotations to amodal mask annotations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题