与动作图的组成视频合成

论文标题

与动作图的组成视频合成

Compositional Video Synthesis with Action Graphs

论文作者

Bar, Amir, Herzig, Roei, Wang, Xiaolong, Rohrbach, Anna, Chechik, Gal, Darrell, Trevor, Globerson, Amir

论文摘要

动作视频是复杂的信号，其中包含空间和时间上丰富的组成结构。当前的视频生成方法缺乏在多个协调且潜在的同时定时操作上调节生成的能力。为了应对这一挑战，我们建议在称为“动作图”的图形结构中表示动作，并介绍新的````动作'''''''''合成任务。我们针对此任务的生成模型（AG2VID）删除了运动和外观功能，并通过合并动作的调度机制有助于及时且协调一致的视频生成。我们训练和评估AG2VID在餐饮和某些东西的V2数据集上，并表明与基线相比，所得视频具有更好的视觉质量和语义一致性。最后，我们的模型通过综合学习动作的新成分来证明零拍的能力。有关代码和预算模型，请参见项目页面https://roeiherz.github.io/ag2video

Videos of actions are complex signals containing rich compositional structure in space and time. Current video generation methods lack the ability to condition the generation on multiple coordinated and potentially simultaneous timed actions. To address this challenge, we propose to represent the actions in a graph structure called Action Graph and present the new ``Action Graph To Video'' synthesis task. Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation. We train and evaluate AG2Vid on the CATER and Something-Something V2 datasets, and show that the resulting videos have better visual quality and semantic consistency compared to baselines. Finally, our model demonstrates zero-shot abilities by synthesizing novel compositions of the learned actions. For code and pretrained models, see the project page https://roeiherz.github.io/AG2Video

下载PDF全文

下载文献需遵守相关版权规定

论文标题