FITCLIP：精炼大规模预处理的图像文本模型，用于零拍视频理解任务

论文标题

FITCLIP：精炼大规模预处理的图像文本模型，用于零拍视频理解任务

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

论文作者

Castro, Santiago, Heilbron, Fabian Caba

论文摘要

大规模预测的图像文本模型在少数任务中显示出令人难以置信的零击性能，包括视频识别和文本对视频检索等视频。但是，这些模型尚未适应视频，主要是因为它们没有考虑到时间维度，而是因为视频帧与典型图像不同（例如，包含运动模糊和较小的清晰度）。在本文中，我们提出了一种微调策略，以完善这些大规模预处理的图像文本模型，以零拍到视频理解任务。我们表明，通过仔细调整这些模型，我们可以在两个零射击动作识别任务和三个零击文本到视频检索任务上获得大量改进。该代码可从https://github.com/bryant1410/fitclip获得

Large-scale pretrained image-text models have shown incredible zero-shot performance in a handful of tasks, including video ones such as action recognition and text-to-video retrieval. However, these models have not been adapted to video, mainly because they do not account for the time dimension but also because video frames are different from the typical images (e.g., containing motion blur, and less sharpness). In this paper, we present a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks. We show that by carefully adapting these models we obtain considerable improvements on two zero-shot Action Recognition tasks and three zero-shot Text-to-video Retrieval tasks. The code is available at https://github.com/bryant1410/fitclip

下载PDF全文

下载文献需遵守相关版权规定

论文标题