论文标题
结构化的视频令牌 @ ego4d pnr时间定位挑战2022
Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022
论文作者
论文摘要
该技术报告描述了无回报(PNR)时间定位挑战的EGO4D点的SVIT方法。我们提出了一个学习框架的结构(简称SVIT),该结构证明了仅在训练过程中使用的少量图像的结构可以改善视频模型。 SVIT依靠两个关键见解。首先,由于图像和视频都包含结构化信息,因此我们丰富了一个具有\ emph {对象令牌}的变压器模型,可以在图像和视频中使用。其次,视频中各个帧的场景表示应与静止图像的场景表示“对齐”。这是通过“框架夹一致性”损失实现的,该损失确保了图像和视频之间结构化信息的流动。 SVIT在挑战测试集中获得了强劲的性能,并具有0.656绝对时间定位误差。
This technical report describes the SViT approach for the Ego4D Point of No Return (PNR) Temporal Localization Challenge. We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos. Second, the scene representations of individual frames in video should "align" with those of still images. This is achieved via a "Frame-Clip Consistency" loss, which ensures the flow of structured information between images and videos. SViT obtains strong performance on the challenge test set with 0.656 absolute temporal localization error.