学会从时刻定位行动

论文标题

学会从时刻定位行动

Learning to Localize Actions from Moments

论文作者

Long, Fuchen, Yao, Ting, Qiu, Zhaofan, Tian, Xinmei, Luo, Jiebo, Mei, Tao

论文摘要

有了动作矩的知识（即，每个包含一个动作实例的修剪剪辑剪辑），人类可以在未修剪的视频中常规地将动作定位。然而，大多数实用方法仍然需要所有培训视频都标记为时间注释（动作类别和时间边界），并以完全监督的方式开发模型，尽管标签了昂贵的努力并且不适用于新类别。在本文中，我们介绍了一种新的转移学习类型设计，以学习大量动作类别的动作本地化，但仅在感兴趣类别的动作时刻和来自一小部分动作类别的未修剪视频的时间注释中。具体来说，我们提出了将这种设计集成到一个阶段动作本地化框架中的Action Herald Networks（Ahernet）。从技术上讲，重量传递功能是独特的，以在综合上下文时刻或未修剪的视频中建立动作矩或前景视频段的分类和动作定位之间的转换。每一刻的上下文都是通过对抗机制来学习的，以将生成的特征与未修剪视频中背景的特征区分开来。在ActivityNet V1.3和从Thumos14到ActivityNet v1.3的整个学习中进行了广泛的实验。我们的Ahernet展示了与最完全监督的动作定位方法相比，甚至比较了优势。更值得注意的是，我们训练Ahernet，将600个类别的动作定位在动力学600中的动作时刻的杠杆作用以及ActivityNet v1.3中200类的时间注释中。源代码和数据可在\ url {https://github.com/fuchenustc/ahernet}中获得。

With the knowledge of action moments (i.e., trimmed video clips that each contains an action instance), humans could routinely localize an action temporally in an untrimmed video. Nevertheless, most practical methods still require all training videos to be labeled with temporal annotations (action category and temporal boundary) and develop the models in a fully-supervised manner, despite expensive labeling efforts and inapplicable to new categories. In this paper, we introduce a new design of transfer learning type to learn action localization for a large set of action categories, but only on action moments from the categories of interest and temporal annotations of untrimmed videos from a small set of action classes. Specifically, we present Action Herald Networks (AherNet) that integrate such design into an one-stage action localization framework. Technically, a weight transfer function is uniquely devised to build the transformation between classification of action moments or foreground video segments and action localization in synthetic contextual moments or untrimmed videos. The context of each moment is learnt through the adversarial mechanism to differentiate the generated features from those of background in untrimmed videos. Extensive experiments are conducted on the learning both across the splits of ActivityNet v1.3 and from THUMOS14 to ActivityNet v1.3. Our AherNet demonstrates the superiority even comparing to most fully-supervised action localization methods. More remarkably, we train AherNet to localize actions from 600 categories on the leverage of action moments in Kinetics-600 and temporal annotations from 200 classes in ActivityNet v1.3. Source code and data are available at \url{https://github.com/FuchenUSTC/AherNet}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题