论文标题
多尺度的自对比度学习,用于弱监督的基于查询的视频接地,用硬采矿进行硬采矿
Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding
论文作者
论文摘要
基于查询的视频接地是视频理解中的一项重要但又具有挑战性的任务,该任务旨在根据句子查询将目标细分定位在未修剪视频中。大多数以前的作品通过以细分级别的标签以完全监督的方式解决此任务,从而取得了重大进展,这需要高标签成本。尽管最近的一些努力开发了只需要视频级知识的弱监督方法,但它们通常与多个预定义的细分市场提案匹配,并选择最佳的建议,并且选择了最佳的框架级别的细节,以区分整个视频中具有高可重复性和相似性的框架。为了减轻上述局限性,我们提出了一个自我对比的学习框架,以解决基于查询的视频接地任务,在弱监督的环境下。首先,我们提出了一种新的接地方案,而不是利用冗余段建议,该方案将学习框架匹配分数,指的是查询语义,以通过仅使用视频级注释来预测可能的前景框架。其次,由于某些预测的帧(即边界帧)相对粗糙,并且表现出与相邻框架相似的外观,因此我们提出了一个粗到细的对比度学习范式,以学习更多区分性帧的表示形式,以区分假阳性框架。特别是,我们迭代地探索了多尺度的硬性样品,这些样本接近表示空间中的正样品,以区分细粒框架细节,从而强制执行更准确的片段接地。对两个具有挑战性的基准测试的广泛实验证明了我们提出的方法与最先进的方法相比具有优势。
Query-based video grounding is an important yet challenging task in video understanding, which aims to localize the target segment in an untrimmed video according to a sentence query. Most previous works achieve significant progress by addressing this task in a fully-supervised manner with segment-level labels, which require high labeling cost. Although some recent efforts develop weakly-supervised methods that only need the video-level knowledge, they generally match multiple pre-defined segment proposals with query and select the best one, which lacks fine-grained frame-level details for distinguishing frames with high repeatability and similarity within the entire video. To alleviate the above limitations, we propose a self-contrastive learning framework to address the query-based video grounding task under a weakly-supervised setting. Firstly, instead of utilizing redundant segment proposals, we propose a new grounding scheme that learns frame-wise matching scores referring to the query semantic to predict the possible foreground frames by only using the video-level annotations. Secondly, since some predicted frames (i.e., boundary frames) are relatively coarse and exhibit similar appearance to their adjacent frames, we propose a coarse-to-fine contrastive learning paradigm to learn more discriminative frame-wise representations for distinguishing the false positive frames. In particular, we iteratively explore multi-scale hard negative samples that are close to positive samples in the representation space for distinguishing fine-grained frame-wise details, thus enforcing more accurate segment grounding. Extensive experiments on two challenging benchmarks demonstrate the superiority of our proposed method compared with the state-of-the-art methods.