论文标题
参数有效的多模式变压器用于视频表示学习
Parameter Efficient Multimodal Transformers for Video Representation Learning
论文作者
论文摘要
变形金刚在语言领域的最新成功促使它适应了多模式设置,在该设置中,新的视觉模型与已经预审前的语言模型一起培训。但是,由于变形金刚的过度内存要求,现有工作通常可以修复语言模型并仅训练视觉模块,这限制了其以端到端方式学习跨模式信息的能力。在这项工作中,我们专注于在视听视频表示学习的背景下减少多模式变压器的参数。我们通过跨层和方式共享变压器的参数来缓解高内存要求;我们将变压器分解为特定于模态和模态共享的零件,以便模型单独和一起学习每个模态的动力学,并根据低级别近似提出一种新的参数共享方案。我们表明,我们的方法可将变压器的参数降低到97 $ \%$,从而使我们可以从头开始训练模型端到端。我们还提出了一种基于在CNN嵌入空间上测量的实例相似性,我们的模型与变压器一起学习的实例相似性。为了展示我们的方法,我们将模型预先在动力学700的30秒剪辑(480帧)上,然后将其转移到视听分类任务中。
The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the parameters of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters of the Transformers up to 97$\%$, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks.