论文标题
除了接地:提取跨模式的细粒度事件层次结构
Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities
论文作者
论文摘要
事件描述了我们世界上重要的事件。自然地,了解多媒体内容中提到的事件以及它们如何相关的事件形成了理解我们世界的重要方式。如果文本和视觉(视频)域之间的事件相同(通过接地),并且在同一语义层面上,那么现有文献可以推断出。但是,接地无法捕获由于许多语义级别所引用的相同事件而存在的复杂的跨事关系。例如,在图1中,“战争”的抽象事件通过子事件“坦克射击”(在视频中)和飞机“ shot”(在文本中)表现出较低的语义级别,从而导致事件之间的层次结构,多模式关系。 在本文中,我们提出了从多模式(视频和文本)数据提取事件层次结构的任务,以捕获相同事件如何以不同语义级别的不同方式表现出来。这揭示了事件的结构,对于理解它们至关重要。为了支持有关此任务的研究,我们介绍了多模式分层事件(MultiHieve)数据集。与先前的视频数据集不同,Multihieve由新闻视频排对组成,这使其在事件层次结构中丰富。我们密集地注释数据集的一部分以构建测试基准。我们展示了该任务上最新的单峰和多模式基线的局限性。此外,我们通过新的弱监督模型来解决这些局限性,仅利用Multihieve的未注释的视频对象对。我们对我们提出的方法进行了彻底的评估,该方法证明了这项任务的性能提高,并突出了未来研究的机会。
Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, in Figure 1, the abstract event of "war" manifests at a lower semantic level through subevents "tanks firing" (in video) and airplane "shot" (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research.