论文标题
细粒度的时间对比度学习,用于弱监督的时间动作定位
Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization
论文作者
论文摘要
我们针对弱监督行动定位(WSAL)的任务,在模型培训期间只有视频级别的动作标签。尽管取得了最近的进展,但现有方法主要包含按分类范式进行本地化的范式,并忽略了视频序列之间富有成果的细颗粒的时间区别,因此在分类学习和分类到分类至集体化的适应性中遭受了严重的歧义。本文认为,通过上下文比较序列与序列区别的学习为WSAL提供了基本的归纳偏见,并有助于识别连贯的动作实例。具体而言,在一个可区分的动态编程公式下,设计了两个互补的对比目标,包括细粒度序列距离(FSD)对比和最长的常见子序列(LCS)对比,其中第一个考虑通过使用Match,Insert和dellete opertors和第二个常见的子序列,将各种动作/背景提案的关系视为各种动作/背景建议。这两个对比模块都可以相互增强,并共同享受歧视性动作 - 背景分离的优点,并减轻分类和本地化之间的任务差距。广泛的实验表明,我们的方法在两个流行的基准上实现了最先进的性能。我们的代码可在https://github.com/mengyuanchen21/cvpr2022-ftcl上找到。
We target at the task of weakly-supervised action localization (WSAL), where only video-level action labels are available during model training. Despite the recent progress, existing methods mainly embrace a localization-by-classification paradigm and overlook the fruitful fine-grained temporal distinctions between video sequences, thus suffering from severe ambiguity in classification learning and classification-to-localization adaption. This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in WSAL and helps identify coherent action instances. Specifically, under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting, where the first one considers the relations of various action/background proposals by using match, insert, and delete operators and the second one mines the longest common subsequences between two videos. Both contrasting modules can enhance each other and jointly enjoy the merits of discriminative action-background separation and alleviated task gap between classification and localization. Extensive experiments show that our method achieves state-of-the-art performance on two popular benchmarks. Our code is available at https://github.com/MengyuanChen21/CVPR2022-FTCL.