利用视频文本检索的视觉语义推理

论文标题

利用视频文本检索的视觉语义推理

Exploiting Visual Semantic Reasoning for Video-Text Retrieval

论文作者

Feng, Zerun, Zeng, Zhimin, Guo, Caili, Li, Zheng

论文摘要

视频检索是一个充满挑战的研究主题，桥接了视野和语言领域，并且近年来引起了广泛的关注。以前的作品通过直接从帧级功能进行编码来代表视频。实际上，视频由各种现有方法较少关注的各种和丰富的语义关系组成。为了解决这个问题，我们提出了一个视觉语义增强的推理网络（Visern），以利用框架区域之间的推理。具体而言，我们将框架区域视为顶点，并构建一个完全连接的语义相关图。然后，我们通过基于新型随机步行规则的图形卷积网络进行推理，以生成涉及语义关系的区域特征。有了推理的好处，考虑了区域之间的语义相互作用，而冗余的影响被抑制了。最后，将区域特征汇总为形成框架级特征，以进一步编码以测量视频文本相似性。在两个公共基准数据集上进行的广泛实验通过实现强大的语义推理来实现最先进的性能来验证我们方法的有效性。

Video retrieval is a challenging research topic bridging the vision and language areas and has attracted broad attention in recent years. Previous works have been devoted to representing videos by directly encoding from frame-level features. In fact, videos consist of various and abundant semantic relations to which existing methods pay less attention. To address this issue, we propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions. Specifically, we consider frame regions as vertices and construct a fully-connected semantic correlation graph. Then, we perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations. With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed. Finally, the region features are aggregated to form frame-level features for further encoding to measure video-text similarity. Extensive experiments on two public benchmark datasets validate the effectiveness of our method by achieving state-of-the-art performance due to the powerful semantic reasoning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题