SF-NET：时间动作本地化的单帧监督

论文标题

SF-NET：时间动作本地化的单帧监督

SF-Net: Single-Frame Supervision for Temporal Action Localization

论文作者

Ma, Fan, Zhu, Linchao, Yang, Yi, Zha, Shengxin, Kundu, Gourab, Feiszli, Matt, Shou, Zheng

论文摘要

在本文中，我们研究了一种监督的中间形式，即单帧监督，以进行时间行动定位（TAL）。为了获得单帧的监督，要求注释者在操作的时间窗口中仅识别一个框架。这可以大大降低获得完全监督的人工成本，这需要注释行动边界。与仅注释视频级标签的弱监督相比，单帧的监督引入了额外的时间动作信号，同时保持较低的注释开销。为了充分利用这种单帧监督，我们提出了一个称为SF-NET的统一系统。首先，我们建议预测每个视频框架的动作得分。除了典型的类别得分外，动作得分可以提供有关发生潜在动作的全面信息，并帮助推断期间的时间边界完善。其次，我们根据单帧注释挖掘伪动作和背景框架。我们通过将每个带注释的单一框架自适应地扩展到其附近的上下文框架，并从多个视频中所有未经注释的框架中挖掘出伪背景框架来识别伪动作框架。这些伪标记的框架与地面框架标记的框架一起进一步用于训练分类器。在Thumos14，GTEA和BEOID的广泛实验中，SF-NET在最先进的弱监督方法上可以显着改善，从段定位和单帧定位方面。值得注意的是，SF-NET与完全监督的同行取得了可比的结果，这需要更多的资源注释。该代码可在https://github.com/flowerfan/sf-net上找到。

In this paper, we study an intermediate form of supervision, i.e., single-frame supervision, for temporal action localization (TAL). To obtain the single-frame supervision, the annotators are asked to identify only a single frame within the temporal window of an action. This can significantly reduce the labor cost of obtaining full supervision which requires annotating the action boundary. Compared to the weak supervision that only annotates the video-level label, the single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead. To make full use of such single-frame supervision, we propose a unified system called SF-Net. First, we propose to predict an actionness score for each video frame. Along with a typical category score, the actionness score can provide comprehensive information about the occurrence of a potential action and aid the temporal boundary refinement during inference. Second, we mine pseudo action and background frames based on the single-frame annotations. We identify pseudo action frames by adaptively expanding each annotated single frame to its nearby, contextual frames and we mine pseudo background frames from all the unannotated frames across multiple videos. Together with the ground-truth labeled frames, these pseudo-labeled frames are further used for training the classifier. In extensive experiments on THUMOS14, GTEA, and BEOID, SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization. Notably, SF-Net achieves comparable results to its fully-supervised counterpart which requires much more resource intensive annotations. The code is available at https://github.com/Flowerfan/SF-Net.

下载PDF全文

下载文献需遵守相关版权规定

论文标题