论文标题
学习视频介绍的联合时空转换
Learning Joint Spatial-Temporal Transformations for Video Inpainting
论文作者
论文摘要
在视频帧中完成缺少区域的高质量视频介绍是一项有希望而又具有挑战性的任务。最先进的方法采用注意模型来完成框架,通过搜索参考帧中缺少的内容,并通过框架进一步完成整个视频。但是,这些方法可能会沿空间和时间维度不一致的关注结果,这通常会导致视频中的模糊和时间伪像。在本文中,我们建议学习用于视频介绍的联合时空变压器网络(STTN)。具体而言,我们同时通过自我注意力在所有输入框架中填充缺失区域,并建议通过空间暂时的对抗损失来优化STTN。为了显示所提出的模型的优越性,我们通过使用标准固定掩码和更现实的移动对象掩模进行定量和定性评估。演示视频可在https://github.com/researchmm/sttn上找到。
High-quality video inpainting that completes missing regions in video frames is a promising yet challenging task. State-of-the-art approaches adopt attention models to complete a frame by searching missing contents from reference frames, and further complete whole videos frame by frame. However, these approaches can suffer from inconsistent attention results along spatial and temporal dimensions, which often leads to blurriness and temporal artifacts in videos. In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. Specifically, we simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss. To show the superiority of the proposed model, we conduct both quantitative and qualitative evaluations by using standard stationary masks and more realistic moving object masks. Demo videos are available at https://github.com/researchmm/STTN.