用于视频文档的多模式摘要

论文标题

用于视频文档的多模式摘要

Multi-modal Summarization for Video-containing Documents

论文作者

Fu, Xiyan, Wang, Jun, Yang, Zhenglu

论文摘要

多媒体数据的摘要变得越来越重要，因为它是许多现实世界应用程序的基础，例如问答，网络搜索等。但是，大多数现有的多模式摘要工作都使用了从图像而不是视频中提取的视觉互补功能，从而丢失了丰富的信息。因此，我们提出了一项新颖的多模式摘要任务，以从文档及其相关视频中进行总结。在这项工作中，我们还建立了一个基线通用模型，具有有效的策略，即Bi-Hop的关注和改善的晚期融合机制，以弥合不同模式之间的差距，并同时采用文本和视频摘要。全面的实验表明，所提出的模型对多模式摘要有益，并且优于现有方法。此外，我们收集了一个新颖的数据集，它为未来的研究提供了新的资源，该研究由文档和视频产生。

Summarization of multimedia data becomes increasingly significant as it is the basis for many real-world applications, such as question answering, Web search, and so forth. Most existing multi-modal summarization works however have used visual complementary features extracted from images rather than videos, thereby losing abundant information. Hence, we propose a novel multi-modal summarization task to summarize from a document and its associated video. In this work, we also build a baseline general model with effective strategies, i.e., bi-hop attention and improved late fusion mechanisms to bridge the gap between different modalities, and a bi-stream summarization strategy to employ text and video summarization simultaneously. Comprehensive experiments show that the proposed model is beneficial for multi-modal summarization and superior to existing methods. Moreover, we collect a novel dataset and it provides a new resource for future study that results from documents and videos.

下载PDF全文

下载文献需遵守相关版权规定

论文标题