论文标题

休息:检索和自我训练以识别生成行动

REST: REtrieve & Self-Train for generative action recognition

论文作者

Bulat, Adrian, Sanchez, Enrique, Martinez, Brais, Tzimiropoulos, Georgios

论文摘要

这项工作是在培训一种生成动作/视频识别模型上,其输出是描述视频的自由形式的特定动作标题(而不是动作类标签)。生成的方法具有实用的优势,例如产生更细粒度和可读的产出,并且自然是开放世界的。为此,我们提议适应视频/动作识别的预训练的生成视觉和语言(V&L)基础模型。据我们所知,最近有几次尝试调整了经过对比学习(例如剪辑)训练的V&L模型(例如,剪辑),但据我们所知,我们提出了一种设定了实现这一目标的生成模型的第一种方法。我们首先表明,生成模型的直接微调生产行动类别遭受严重过度拟合。为了减轻这一点,我们介绍了REST,这是一个由两个关键组成部分组成的培训框架:一种无监督的方法,用于通过伪抗管产生和自我训练,将生成模型适应动作/视频,即不使用任何动作特定的标签; (b)基于剪辑的检索方法,用于发现每个视频训练该模型的各种伪符合的方法。重要的是,我们表明这两个组件对于获得高精度都是必需的。我们在零射击行动识别的问题上评估休息,在该问题中,我们表明我们的方法与基于对比的学习方法相比非常有竞争力。代码将提供。

This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. While recently there have been a few attempts to adapt V&L models trained with contrastive learning (e.g. CLIP) for video/action, to the best of our knowledge, we propose the very first method that sets outs to accomplish this goal for a generative model. We firstly show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. To alleviate this, we introduce REST, a training framework consisting of two key components: an unsupervised method for adapting the generative model to action/video by means of pseudo-caption generation and Self-training, i.e. without using any action-specific labels; (b) a Retrieval approach based on CLIP for discovering a diverse set of pseudo-captions for each video to train the model. Importantly, we show that both components are necessary to obtain high accuracy. We evaluate REST on the problem of zero-shot action recognition where we show that our approach is very competitive when compared to contrastive learning-based methods. Code will be made available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源