论文标题
海:通过文本查询进行视频检索的句子编码器组件
SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries
论文作者
论文摘要
通过文本查询(称为临时视频搜索(AVS))检索未标记的视频是多媒体数据管理和检索的核心主题。 AVS的成功依靠跨模式表示学习,该学习将查询句子和视频都编码为通用空间以进行语义相似性计算。受到以前几乎没有作品在结合多句话编码器的最初成功的启发,本文通过开发一种有效利用多种句子编码器的新的和一般的方法来向前迈出了一步。我们称其为句子编码器组件(SEA)的拟议方法的新颖性是两个方面。首先,与仅使用单个公共空间的先前艺术不同,SEA支持在多个编码器特定的公共空间中匹配的文本视频匹配。这样的属性可以防止匹配被特定的编码器主导,该特定编码器比其他编码器要长得多。其次,为了探索单个共同空间之间的互补性,我们建议多空间多层学习。随着对四个基准测试的广泛实验(MSR-VTT,Trecvid AVS 2016-2019,TGIF和MSVD)显示,SEA超过了最先进的。此外,海洋非常容易实施。所有这些使SEA成为AVS的吸引力解决方案,并有望通过收获新的句子编码器来不断地推进任务。
Retrieving unlabeled videos by textual queries, known as Ad-hoc Video Search (AVS), is a core theme in multimedia data management and retrieval. The success of AVS counts on cross-modal representation learning that encodes both query sentences and videos into common spaces for semantic similarity computation. Inspired by the initial success of previously few works in combining multiple sentence encoders, this paper takes a step forward by developing a new and general method for effectively exploiting diverse sentence encoders. The novelty of the proposed method, which we term Sentence Encoder Assembly (SEA), is two-fold. First, different from prior art that use only a single common space, SEA supports text-video matching in multiple encoder-specific common spaces. Such a property prevents the matching from being dominated by a specific encoder that produces an encoding vector much longer than other encoders. Second, in order to explore complementarities among the individual common spaces, we propose multi-space multi-loss learning. As extensive experiments on four benchmarks (MSR-VTT, TRECVID AVS 2016-2019, TGIF and MSVD) show, SEA surpasses the state-of-the-art. In addition, SEA is extremely ease to implement. All this makes SEA an appealing solution for AVS and promising for continuously advancing the task by harvesting new sentence encoders.