论文标题
时间动作定位的结构化注意力组成
Structured Attention Composition for Temporal Action Localization
论文作者
论文摘要
时间动作本地化旨在从未修剪视频中定位行动实例。现有作品设计了各种有效的模块,以基于外观和运动功能精确定位动作实例。但是,通过将这两种功能视为同等重要性,以前的作品无法充分利用每个模态特征,从而使学习的模型仍然是最佳的。为了解决这个问题,我们会尽早努力从多模式特征学习的角度研究时间动作定位,这是基于观察结果,即不同的动作表现出对外观或运动方式的特定偏好。具体而言,我们构建了一个新型的结构化注意组成模块。与常规的关注不同,所提出的模块不会独立推断关注和形式关注。取而代之的是,通过将情态注意力与框架注意力作为注意力分配过程之间的关系,结构化注意组成模块学会了编码框架模式结构,并使用它来分别根据最佳传输理论将推断的框架注意力和模态注意。最终的帧模式注意是通过两个个体关注的组成获得的。所提出的结构化注意组成模块可以将其作为插件模块部署到现有的动作本地化框架中。对两个广泛使用基准的广泛实验表明,所提出的结构化注意力始终改善了四种最新的时间动作定位方法,并在Thumos14上构建了新的最先进的性能。代码为https://github.com/vividle/structured-compention-composition。
Temporal action localization aims at localizing action instances from untrimmed videos. Existing works have designed various effective modules to precisely localize action instances based on appearance and motion features. However, by treating these two kinds of features with equal importance, previous works cannot take full advantage of each modality feature, making the learned model still sub-optimal. To tackle this issue, we make an early effort to study temporal action localization from the perspective of multi-modality feature learning, based on the observation that different actions exhibit specific preferences to appearance or motion modality. Specifically, we build a novel structured attention composition module. Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently. Instead, by casting the relationship between the modality attention and the frame attention as an attention assignment process, the structured attention composition module learns to encode the frame-modality structure and uses it to regularize the inferred frame attention and modality attention, respectively, upon the optimal transport theory. The final frame-modality attention is obtained by the composition of the two individual attentions. The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks. Extensive experiments on two widely used benchmarks show that the proposed structured attention composition consistently improves four state-of-the-art temporal action localization methods and builds new state-of-the-art performance on THUMOS14. Code is availabel at https://github.com/VividLe/Structured-Attention-Composition.