MDMMT-2：视频检索的多域多模式变压器，迈向概括的又一步

论文标题

MDMMT-2：视频检索的多域多模式变压器，迈向概括的又一步

MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

论文作者

Kunitsyn, Alexander, Kalashnikov, Maksim, Dzabraev, Maksim, Ivaniuta, Andrei

论文摘要

在这项工作中，我们在MSR-VTT，LSMDC，MSVD，YouCook2和TGIF上提供了有关文本到视频检索任务的新最新，该任务是通过单个模型获得的。结合了三种不同的数据源：弱监督视频，人群标记的文本图像对和文本视频对。仔细分析可用的预训练网络有助于选择最佳的先验知识。我们介绍了三阶段的培训程序，该程序提供了高转移知识效率，并允许在培训期间使用嘈杂的数据集，而没有事先知识退化。此外，使用双位置编码用于更好地融合不同的方式，并提出了一种简单的方法来处理处理。

In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones. We introduce three-stage training procedure that provides high transfer knowledge efficiency and allows to use noisy datasets during training without prior knowledge degradation. Additionally, double positional encoding is used for better fusion of different modalities and a simple method for non-square inputs processing is suggested.

下载PDF全文

下载文献需遵守相关版权规定

论文标题