视频生成的时间一致的变压器

论文标题

视频生成的时间一致的变压器

Temporally Consistent Transformers for Video Generation

论文作者

Yan, Wilson, Hafner, Danijar, James, Stephen, Abbeel, Pieter

论文摘要

为了生成准确的视频，算法必须了解世界上的空间和时间依赖性。当前的算法可以在短范围内进行准确的预测，但往往会遇到时间不一致。当生成的内容脱离视图并随后重新访问时，该模型会发明不同的内容。尽管有这一严重的限制，但仍未建立关于复杂数据的基准，用于严格评估长期依赖性的视频生成。在本文中，我们通过在程序迷宫，Minecraft Worlds和Indoror Scans的3D场景中走动，策划3个具有长距离依赖性的挑战视频数据集。我们对当前模型进行全面评估，并观察它们在时间一致性中的局限性。此外，我们介绍了时间一致的变压器（TECO），这是一个生成模型，可大大提高长期一致性，同时还会减少采样时间。通过将其输入序列压缩为更少的嵌入，应用颞变压器并使用空间遮罩扩展，TECO在许多指标上都优于现有模型。视频可在网站上找到：https：//wilson1yan.github.io/teco

To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world. Current algorithms enable accurate predictions over short horizons but tend to suffer from temporal inconsistencies. When generated content goes out of view and is later revisited, the model invents different content instead. Despite this severe limitation, no established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies. In this paper, we curate 3 challenging video datasets with long-range dependencies by rendering walks through 3D scenes of procedural mazes, Minecraft worlds, and indoor scans. We perform a comprehensive evaluation of current models and observe their limitations in temporal consistency. Moreover, we introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time. By compressing its input sequence into fewer embeddings, applying a temporal transformer, and expanding back using a spatial MaskGit, TECO outperforms existing models across many metrics. Videos are available on the website: https://wilson1yan.github.io/teco

下载PDF全文

下载文献需遵守相关版权规定

论文标题