TTVFI：视频框架插值的学习轨迹感知变压器

论文标题

TTVFI：视频框架插值的学习轨迹感知变压器

TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation

论文作者

Liu, Chengxu, Yang, Huan, Fu, Jianlong, Qian, Xueming

论文摘要

视频框架插值（VFI）旨在综合两个连续帧之间的中间框架。最先进的方法通常采用两步解决方案，其中包括1）通过基于流动的运动估计来生成本地光线的像素，2）将扭曲的像素混合以通过深神经合成网络形成全帧。但是，由于两个连续的帧不一致，新帧的扭曲功能通常不会对齐，这会导致扭曲和模糊的框架，尤其是在发生大型且复杂的运动时。为了解决这个问题，在本文中，我们提出了一种新型的视频框架插值变压器（TTVFI）。特别是，我们以不一致的动作为查询令牌制定了扭曲的特征，并将运动轨迹中的相关区域从两个原始的连续帧中提出到键和值。在沿轨迹的相关令牌上学习了自我注意力，通过端到端训练将原始特征融合到中间框架中。实验结果表明，我们的方法在四个广泛使用的VFI基准中优于其他最先进的方法。代码和预培训模型都将很快发布。

Video frame interpolation (VFI) aims to synthesize an intermediate frame between two consecutive frames. State-of-the-art approaches usually adopt a two-step solution, which includes 1) generating locally-warped pixels by flow-based motion estimations, 2) blending the warped pixels to form a full frame through deep neural synthesis networks. However, due to the inconsistent warping from the two consecutive frames, the warped features for new frames are usually not aligned, which leads to distorted and blurred frames, especially when large and complex motions occur. To solve this issue, in this paper we propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI). In particular, we formulate the warped features with inconsistent motions as query tokens, and formulate relevant regions in a motion trajectory from two original consecutive frames into keys and values. Self-attention is learned on relevant tokens along the trajectory to blend the pristine features into intermediate frames through end-to-end training. Experimental results demonstrate that our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks. Both code and pre-trained models will be released soon.

下载PDF全文

下载文献需遵守相关版权规定

论文标题