论文标题
具有可学习位置特征的可扩展神经视频表示
Scalable Neural Video Representations with Learnable Positional Features
论文作者
论文摘要
使用基于坐标的神经表示(CNR)的复杂信号的简洁表示已经取得了长足的进步,最近的一些努力着重于将其扩展用于处理视频。在这里,主要的挑战是如何(a)减轻培训CNR的计算信息,以(b)实现高质量的视频编码,而(c)保持参数效率。为了满足所有要求(a),(b)和(c)同时,我们提出了具有可学习位置特征(NVP)的神经视频表示,这是一种新颖的CNR,通过引入“可学习的位置特征”,可有效地将视频摊销为潜在代码。具体而言,我们首先提出了一个基于设计2D潜在钥匙帧的CNR体系结构,以学习每个时空轴的常见视频内容,从而极大地改善了这三个要求。然后,我们建议利用现有的强大图像和视频编解码器作为潜在代码的计算/内存有效压缩过程。我们证明了NVP在流行的UVG基准上的优势;与先前的艺术相比,NVP不仅训练2倍(少于5分钟),而且还超过其编码质量为34.07 $ \ rightarrow $ 34.57(以PSNR度量测量),甚至使用$ 8倍$ 8倍的参数。我们还显示了NVP的有趣属性,例如视频介绍,视频框架插值等。
Succinct representation of complex signals using coordinate-based neural representations (CNRs) has seen great progress, and several recent efforts focus on extending them for handling videos. Here, the main challenge is how to (a) alleviate a compute-inefficiency in training CNRs to (b) achieve high-quality video encoding while (c) maintaining the parameter-efficiency. To meet all requirements (a), (b), and (c) simultaneously, we propose neural video representations with learnable positional features (NVP), a novel CNR by introducing "learnable positional features" that effectively amortize a video as latent codes. Specifically, we first present a CNR architecture based on designing 2D latent keyframes to learn the common video contents across each spatio-temporal axis, which dramatically improves all of those three requirements. Then, we propose to utilize existing powerful image and video codecs as a compute-/memory-efficient compression procedure of latent codes. We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07$\rightarrow$34.57 (measured with the PSNR metric), even using $>$8 times fewer parameters. We also show intriguing properties of NVP, e.g., video inpainting, video frame interpolation, etc.