具有可学习位置特征的可扩展神经视频表示

论文标题

具有可学习位置特征的可扩展神经视频表示

Scalable Neural Video Representations with Learnable Positional Features

论文作者

Kim, Subin, Yu, Sihyun, Lee, Jaeho, Shin, Jinwoo

论文摘要

使用基于坐标的神经表示（CNR）的复杂信号的简洁表示已经取得了长足的进步，最近的一些努力着重于将其扩展用于处理视频。在这里，主要的挑战是如何（a）减轻培训CNR的计算信息，以（b）实现高质量的视频编码，而（c）保持参数效率。为了满足所有要求（a），（b）和（c）同时，我们提出了具有可学习位置特征（NVP）的神经视频表示，这是一种新颖的CNR，通过引入“可学习的位置特征”，可有效地将视频摊销为潜在代码。具体而言，我们首先提出了一个基于设计2D潜在钥匙帧的CNR体系结构，以学习每个时空轴的常见视频内容，从而极大地改善了这三个要求。然后，我们建议利用现有的强大图像和视频编解码器作为潜在代码的计算/内存有效压缩过程。我们证明了NVP在流行的UVG基准上的优势；与先前的艺术相比，NVP不仅训练2倍（少于5分钟），而且还超过其编码质量为34.07 $ \ rightarrow $ 34.57（以PSNR度量测量），甚至使用$ 8倍$ 8倍的参数。我们还显示了NVP的有趣属性，例如视频介绍，视频框架插值等。

Succinct representation of complex signals using coordinate-based neural representations (CNRs) has seen great progress, and several recent efforts focus on extending them for handling videos. Here, the main challenge is how to (a) alleviate a compute-inefficiency in training CNRs to (b) achieve high-quality video encoding while (c) maintaining the parameter-efficiency. To meet all requirements (a), (b), and (c) simultaneously, we propose neural video representations with learnable positional features (NVP), a novel CNR by introducing "learnable positional features" that effectively amortize a video as latent codes. Specifically, we first present a CNR architecture based on designing 2D latent keyframes to learn the common video contents across each spatio-temporal axis, which dramatically improves all of those three requirements. Then, we propose to utilize existing powerful image and video codecs as a compute-/memory-efficient compression procedure of latent codes. We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07$\rightarrow$34.57 (measured with the PSNR metric), even using $>$8 times fewer parameters. We also show intriguing properties of NVP, e.g., video inpainting, video frame interpolation, etc.

下载PDF全文

下载文献需遵守相关版权规定

论文标题