Cogvideo：通过变压器进行文本到视频的大规模预处理

论文标题

Cogvideo：通过变压器进行文本到视频的大规模预处理

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

论文作者

Hong, Wenyi, Ding, Ming, Zheng, Wendi, Liu, Xinghan, Tang, Jie

论文摘要

经过大规模的变压器在文本（GPT-3）和文本形象（Dall-E和Cogview）一代中创建了里程碑。它在视频生成上的应用仍面临许多挑战：潜在的巨大计算成本使从头开始培训无法承受；文本视频数据集的稀缺性和弱相关性阻碍了理解复杂运动语义的模型。在这项工作中，我们提出了9B参数变压器COGVIDEO，通过继承验证的文本对图像模型Cogview2训练。我们还提出了多帧速率层次结构培训策略，以更好地对齐文本和视频剪辑。作为（可能）第一个开源大规模预处理的文本对视频模型，Cogvideo在机器和人类评估的较大范围内优于所有公开模型。

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题