论文标题
多模式密集的视频字幕
Multi-modal Dense Video Captioning
论文作者
论文摘要
密集的视频字幕是从未修剪视频中本地化有趣事件的任务,并为每个本地化事件制作文本描述(字幕)。密集的视频字幕上的大多数作品仅基于视觉信息,并且完全忽略了音轨。但是,尤其是音频和言语是人类观察者在理解环境方面的重要提示。在本文中,我们提出了一种新的密集视频字幕方法,该方法能够利用任何数量的方式进行事件描述。具体来说,我们展示了音频和语音方式如何改善密集的视频字幕模型。我们应用自动语音识别(ASR)系统,以获取语音的时间对齐文本描述(类似于字幕),并将其视为视频帧和相应音轨的单独输入。我们将字幕任务作为机器翻译问题提出,并利用最近提出的变压器体系结构将多模式输入数据转换为文本描述。我们演示了我们在活动网字幕数据集上的模型的性能。消融研究表明,音频和语音成分做出了相当大的贡献,表明这些模式包含与视频框架的大量互补信息。此外,我们通过利用从原始YouTube视频获得的类别标签来对活动网字幕结果进行深入分析。代码公开可用:github.com/v-iashin/mdvc
Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual information and completely ignore the audio track. However, audio, and speech, in particular, are vital cues for a human observer in understanding an environment. In this paper, we present a new dense video captioning approach that is able to utilize any number of modalities for event description. Specifically, we show how audio and speech modalities may improve a dense video captioning model. We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track. We formulate the captioning task as a machine translation problem and utilize recently proposed Transformer architecture to convert multi-modal input data into textual descriptions. We demonstrate the performance of our model on ActivityNet Captions dataset. The ablation studies indicate a considerable contribution from audio and speech components suggesting that these modalities contain substantial complementary information to video frames. Furthermore, we provide an in-depth analysis of the ActivityNet Caption results by leveraging the category tags obtained from original YouTube videos. Code is publicly available: github.com/v-iashin/MDVC