结构化的视频令牌 @ ego4d pnr时间定位挑战2022

论文标题

结构化的视频令牌 @ ego4d pnr时间定位挑战2022

Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022

论文作者

Ben-Avraham, Elad, Herzig, Roei, Mangalam, Karttikeya, Bar, Amir, Rohrbach, Anna, Karlinsky, Leonid, Darrell, Trevor, Globerson, Amir

论文摘要

该技术报告描述了无回报（PNR）时间定位挑战的EGO4D点的SVIT方法。我们提出了一个学习框架的结构（简称SVIT），该结构证明了仅在训练过程中使用的少量图像的结构可以改善视频模型。 SVIT依靠两个关键见解。首先，由于图像和视频都包含结构化信息，因此我们丰富了一个具有\ emph {对象令牌}的变压器模型，可以在图像和视频中使用。其次，视频中各个帧的场景表示应与静止图像的场景表示“对齐”。这是通过“框架夹一致性”损失实现的，该损失确保了图像和视频之间结构化信息的流动。 SVIT在挑战测试集中获得了强劲的性能，并具有0.656绝对时间定位误差。

This technical report describes the SViT approach for the Ego4D Point of No Return (PNR) Temporal Localization Challenge. We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos. Second, the scene representations of individual frames in video should "align" with those of still images. This is achieved via a "Frame-Clip Consistency" loss, which ensures the flow of structured information between images and videos. SViT obtains strong performance on the challenge test set with 0.656 absolute temporal localization error.

下载PDF全文

下载文献需遵守相关版权规定

论文标题