通过多模式自学学习的渐进视频摘要

论文标题

通过多模式自学学习的渐进视频摘要

Progressive Video Summarization via Multimodal Self-supervised Learning

论文作者

Haopeng, Li, Qiuhong, Ke, Mingming, Gong, Drummond, Tom

论文摘要

现代视频摘要方法基于深度神经网络，需要大量的带注释的数据进行培训。但是，现有的视频摘要数据集是小规模的，很容易导致深层模型过度。考虑到大规模数据集的注释是耗时的，我们提出了一个多模式的自学学习框架，以获取视频的语义表示，这使视频摘要任务受益。具体而言，自我监督的学习是通过在粗粒和细粒度的时尚中探索视频和文本之间的语义一致性，并在视频中恢复蒙版框架。多模式框架在由视频文本对组成的新收集的数据集上进行了训练。此外，我们介绍了一种渐进式视频摘要方法，在其中逐步确定了视频中的重要内容以生成更好的摘要。广泛的实验证明了我们方法在等级相关系数和F评分方面的有效性和优势。

Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both coarse-grained and fine-grained fashions, as well as recovering masked frames in the videos. The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients and F-score.

下载PDF全文

下载文献需遵守相关版权规定

论文标题