Tencent Text-Video检索：层次结构跨模式与多层表示

论文标题

Tencent Text-Video检索：层次结构跨模式与多层表示

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

论文作者

Jiang, Jie, Min, Shaobo, Kong, Weijie, Gong, Dihong, Wang, Hongfa, Li, Zhifeng, Liu, Wei

论文摘要

文本视频检索在多模式理解中起着重要的作用，近年来引起了人们的关注。大多数现有的方法着重于在整个视频和完整的字幕句子之间构建对比对，同时忽略了细粒度的跨模式关系，例如剪贴画或框架字。在本文中，我们提出了一种新的方法，称为层次交叉模式相互作用（HCMI），以探索视频词句，剪辑式和文本视频检索之间的多层跨模式关系。考虑到固有的语义框架关系，HCMI执行自我注意力，以探索框架级相关性，并自适应地将群集相关的帧与剪辑级别和视频级表示。通过这种方式，HCMI构建了框架clip-video粒度的多级视频表示，以捕获细粒度的视频内容，并在文字词句中以文本模式为单词句子的多级文本表示。借助用于视频和文本的多层次表示，层次对比度学习旨在探索细粒度的跨模式关系，即框架字，剪贴画和视频句子，这使HCMI能够实现视频和文本方式之间的全面语义比较。 Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题