休息：检索和自我训练以识别生成行动

论文标题

休息：检索和自我训练以识别生成行动

REST: REtrieve & Self-Train for generative action recognition

论文作者

Bulat, Adrian, Sanchez, Enrique, Martinez, Brais, Tzimiropoulos, Georgios

论文摘要

这项工作是在培训一种生成动作/视频识别模型上，其输出是描述视频的自由形式的特定动作标题（而不是动作类标签）。生成的方法具有实用的优势，例如产生更细粒度和可读的产出，并且自然是开放世界的。为此，我们提议适应视频/动作识别的预训练的生成视觉和语言（V＆L）基础模型。据我们所知，最近有几次尝试调整了经过对比学习（例如剪辑）训练的V＆L模型（例如，剪辑），但据我们所知，我们提出了一种设定了实现这一目标的生成模型的第一种方法。我们首先表明，生成模型的直接微调生产行动类别遭受严重过度拟合。为了减轻这一点，我们介绍了REST，这是一个由两个关键组成部分组成的培训框架：一种无监督的方法，用于通过伪抗管产生和自我训练，将生成模型适应动作/视频，即不使用任何动作特定的标签；（b）基于剪辑的检索方法，用于发现每个视频训练该模型的各种伪符合的方法。重要的是，我们表明这两个组件对于获得高精度都是必需的。我们在零射击行动识别的问题上评估休息，在该问题中，我们表明我们的方法与基于对比的学习方法相比非常有竞争力。代码将提供。

This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing more fine-grained and human-readable output, and being naturally open-world. To this end, we propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition. While recently there have been a few attempts to adapt V&L models trained with contrastive learning (e.g. CLIP) for video/action, to the best of our knowledge, we propose the very first method that sets outs to accomplish this goal for a generative model. We firstly show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting. To alleviate this, we introduce REST, a training framework consisting of two key components: an unsupervised method for adapting the generative model to action/video by means of pseudo-caption generation and Self-training, i.e. without using any action-specific labels; (b) a Retrieval approach based on CLIP for discovering a diverse set of pseudo-captions for each video to train the model. Importantly, we show that both components are necessary to obtain high accuracy. We evaluate REST on the problem of zero-shot action recognition where we show that our approach is very competitive when compared to contrastive learning-based methods. Code will be made available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题