论文标题
TLDW:新闻视频的极端多模式摘要
TLDW: Extreme Multimodal Summarisation of News Videos
论文作者
论文摘要
多模式输出的多模式摘要由于多媒体数据的快速增长而引起了人们的注意。尽管已经提出了几种总结视觉文本内容的方法,但它们的多模式输出在极端层面上不够简洁,无法解决信息过载问题。为了实现极端多模式摘要的结尾,我们引入了一个新的任务,具有多模式输出(XMSMO)的新任务,对于TL; DW-太长的情况;没有看,类似于tl; dr。 XMSMO的目的是将视频文档对汇总成一个非常短的摘要,该摘要由一个封面框架作为视觉摘要和一句话作为文本摘要。我们提出了一个新型的无监督分层最佳运输网络(热网),该网络由三个组成部分组成:分层多模式编码器,分层多模式融合解码器和最佳传输求解器。我们的方法是通过从最佳运输计划下的语义分布之间的距离的角度优化视觉和文本覆盖范围的,而无需使用参考摘要而训练。为了促进有关此任务的研究,我们通过收集4,891个视频文件对来收集一个大规模的数据集XMSMO-NEWS。实验结果表明,我们的方法在Rouge和IOU指标方面实现了有希望的性能。
Multimodal summarisation with multimodal output is drawing increasing attention due to the rapid growth of multimedia data. While several methods have been proposed to summarise visual-text contents, their multimodal outputs are not succinct enough at an extreme level to address the information overload issue. To the end of extreme multimodal summarisation, we introduce a new task, eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR. XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary. We propose a novel unsupervised Hierarchical Optimal Transport Network (HOT-Net) consisting of three components: hierarchical multimodal encoders, hierarchical multimodal fusion decoders, and optimal transport solvers. Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans. To facilitate the study on this task, we collect a large-scale dataset XMSMO-News by harvesting 4,891 video-document pairs. The experimental results show that our method achieves promising performance in terms of ROUGE and IoU metrics.