悲剧加时间：从弱标记的视频中捕获意想不到的人类活动

论文标题

悲剧加时间：从弱标记的视频中捕获意想不到的人类活动

Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos

论文作者

Chakravarthy, Arnav, Fang, Zhiyuan, Yang, Yezhou

论文摘要

在包含无意识的动作的视频中，代理商无法实现其期望的目标。在这样的视频中，计算机视觉系统要了解高级概念，例如目标指导行为，这是从很小的时候开始的人类中存在的能力。在人工智能的代理商中灌输这种能力将通过允许他们评估人类的行动在目的论镜头下，从而使他们成为更好的社会学习者。为了验证深度学习模型执行此任务的能力，我们策划了基于OOPS数据集的W-OOPS数据集[15]。 W-EOPS由2,100个无意的人类行动视频组成，其中44个目标指导和30个无意的视频级活动标签通过人类注释收集。由于昂贵的细分市场注释程序，我们提出了一种弱监督算法，用于在视频中本地定位目标指导以及无意的临时区域，仅利用仅视频级标签的视频。特别是，我们采用了一种基于注意机制的策略，该策略可以预测临时区域对分类任务的贡献最大。同时，我们设计的重叠正规化使该模型可以专注于视频的不同部分，以推断目标指导和无意的活动，同时保证其时间顺序。广泛的定量实验验证了我们定位方法的有效性。我们进一步进行了视频字幕实验，该实验表明所提出的本地化模块确实有助于目的论的理解。

In videos that contain actions performed unintentionally, agents do not achieve their desired goals. In such videos, it is challenging for computer vision systems to understand high-level concepts such as goal-directed behavior, an ability present in humans from a very early age. Inculcating this ability in artificially intelligent agents would make them better social learners by allowing them to evaluate human action under a teleological lens. To validate the ability of deep learning models to perform this task, we curate the W-Oops dataset, built upon the Oops dataset [15]. W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. Due to the expensive segment annotation procedure, we propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video leveraging solely video-level labels. In particular, we employ an attention mechanism-based strategy that predicts the temporal regions which contribute the most to a classification task. Meanwhile, our designed overlap regularization allows the model to focus on distinct portions of the video for inferring the goal-directed and unintentional activity while guaranteeing their temporal ordering. Extensive quantitative experiments verify the validity of our localization method. We further conduct a video captioning experiment which demonstrates that the proposed localization module does indeed assist teleological action understanding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题