论文标题
自动交通:学习推荐视频过渡效果
AutoTransition: Learning to Recommend Video Transition Effects
论文作者
论文摘要
视频过渡效果被广泛用于视频编辑中,以连接用于创建凝聚力和视觉吸引力的视频的镜头。但是,由于缺乏摄影知识和设计技能,非专业人士选择最佳过渡是一个挑战。在本文中,我们介绍了执行自动视频过渡建议(VTR)的主要工作:给定一系列原始视频镜头和伴侣音频,建议每对相邻拍摄的视频过渡。为了解决此任务,我们使用有关编辑软件的公开视频模板收集了一个大规模的视频过渡数据集。然后,我们将VTR作为从视觉/音频到视频过渡的多模式检索问题,并提出了一个新型的多模式匹配框架,由两个部分组成。首先,我们通过视频过渡分类任务了解视频过渡的嵌入。然后,我们提出了一个模型,以学习从视觉/音频输入到视频过渡的匹配对应关系。具体而言,所提出的模型采用多模式变压器来融合视觉和音频信息,并在顺序过渡输出中捕获上下文提示。通过定量和定性实验,我们清楚地证明了我们方法的有效性。值得注意的是,在综合用户研究中,我们的方法获得了与专业编辑者相比的可比分数,同时通过\ textbf {300 \ scale box {1.25} {$ \ times $}}提高视频编辑效率。我们希望我们的工作能够激发其他研究人员从事这项新任务。数据集和代码在\ url {https://github.com/acherstyx/autotransition}上公开。
Video transition effects are widely used in video editing to connect shots for creating cohesive and visually appealing videos. However, it is challenging for non-professionals to choose best transitions due to the lack of cinematographic knowledge and design skills. In this paper, we present the premier work on performing automatic video transitions recommendation (VTR): given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots. To solve this task, we collect a large-scale video transition dataset using publicly available video templates on editing softwares. Then we formulate VTR as a multi-modal retrieval problem from vision/audio to video transitions and propose a novel multi-modal matching framework which consists of two parts. First we learn the embedding of video transitions through a video transition classification task. Then we propose a model to learn the matching correspondence from vision/audio inputs to video transitions. Specifically, the proposed model employs a multi-modal transformer to fuse vision and audio information, as well as capture the context cues in sequential transition outputs. Through both quantitative and qualitative experiments, we clearly demonstrate the effectiveness of our method. Notably, in the comprehensive user study, our method receives comparable scores compared with professional editors while improving the video editing efficiency by \textbf{300\scalebox{1.25}{$\times$}}. We hope our work serves to inspire other researchers to work on this new task. The dataset and codes are public at \url{https://github.com/acherstyx/AutoTransition}.