通过文本进行视频检索的双重编码

论文标题

通过文本进行视频检索的双重编码

Dual Encoding for Video Retrieval by Text

论文作者

Dong, Jianfeng, Li, Xirong, Xu, Chaoxi, Yang, Xun, Yang, Gang, Wang, Xun, Wang, Meng

论文摘要

本文通过文本攻击视频检索的挑战性问题。在这样的检索范式中，最终用户通过自然语言句子的形式搜索无标记的视频，没有提供视觉示例。给定视频作为帧和查询序列作为单词序列，有效的序列到序列的交叉模式匹配至关重要。为此，首先需要将这两种方式编码为实价载体，然后投影到一个共同的空间中。在本文中，我们通过提出一个双重深度编码网络来实现这一目标，该网络将视频和查询编码为自己的强大密集表示。我们的新颖性是两倍。首先，与诉诸特定的单层编码器的先前艺术不同，所提出的网络执行多级编码，以粗到精细的方式代表两种模式的丰富内容。其次，与传统的共同空间学习算法不同，该算法是基于概念或潜在空间的，我们介绍了混合空间学习，该算法结合了潜在空间的高性能和概念空间的良好解释性。双重编码在概念上是简单的，实际上是有效的，并且通过混合空间学习训练。在四个具有挑战性的视频数据集上进行了广泛的实验，显示了新方法的可行性。

This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题