Mugen：视频审计文本多模式理解和发电的游乐场

论文标题

Mugen：视频审计文本多模式理解和发电的游乐场

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

论文作者

Hayes, Thomas, Zhang, Songyang, Yin, Xi, Pang, Guan, Sheng, Sasha, Yang, Harry, Ge, Songwei, Hu, Qiyuan, Parikh, Devi

论文摘要

多模式的视频审计文本理解和生成可以从狭窄但富裕的数据集中受益。狭窄允许研究社区可以取得进展的咬合大小的挑战。丰富性确保我们在核心挑战中取得进步。为此，我们提出了使用开源平台游戏Coinrun收集的大规模视频ADIO-TEXT数据集Mugen [11]。我们通过引入音频并实现新的互动来使游戏更加丰富。我们训练了RL代理具有不同目标的RL代理，以导航游戏并与13个对象和字符进行交互。这使我们能够自动提取大量不同的视频和相关音频。我们采样了375K视频剪辑（每个3.2s），并从人类注释者那里收集文本说明。每个视频都有从游戏引擎自动提取的其他注释，例如每个帧的准确语义图和模板文本描述。总的来说，Mugen可以帮助进步多模式理解和发电的许多任务。我们基于涉及视频ADIO-TEXT检索和发电的任务进行代表性方法。我们的数据集和代码以：https：//mugen-org.github.io/发布。

Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich. The narrowness allows bite-sized challenges that the research community can make progress on. The richness ensures we are making progress along the core challenges. To this end, we present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun [11]. We made substantial modifications to make the game richer by introducing audio and enabling new interactions. We trained RL agents with different objectives to navigate the game and interact with 13 objects and characters. This allows us to automatically extract a large collection of diverse videos and associated audio. We sample 375K video clips (3.2s each) and collect text descriptions from human annotators. Each video has additional annotations that are extracted automatically from the game engine, such as accurate semantic maps for each frame and templated textual descriptions. Altogether, MUGEN can help progress research in many tasks in multimodal understanding and generation. We benchmark representative approaches on tasks involving video-audio-text retrieval and generation. Our dataset and code are released at: https://mugen-org.github.io/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题