多模型：多模式多任务蒙版自动编码器

论文标题

多模型：多模式多任务蒙版自动编码器

MultiMAE: Multi-modal Multi-task Masked Autoencoders

论文作者

Bachmann, Roman, Mizrahi, David, Atanov, Andrei, Zamir, Amir

论文摘要

我们提出了一种称为多模式多任务掩盖自动编码器（Multimae）的预训练策略。它与标准屏蔽的自动编码在两个关键方面不同：i）除了RGB图像（因此“多模式”）之外，它可以选择接受输入中的其他信息模式，并且II）相应地包括预测除RGB图像之外的多个输出（因此“多任务”）。我们利用屏蔽（跨图像贴片和输入方式）使训练多模型可拖动，并确保网络确实学到了交叉模式预测性编码。我们显示这种训练前策略会导致一个灵活，简单和有效的框架，并改善了向下游任务的转移结果。特别是，当可用RGB图像以外的其他信息或RGB以外的其他信息可用时，可以灵活地使用相同的精确预训练网络 - 在所有配置中，与BaseLines相比，所有配置都具有竞争力或更高的结果。为了避免需要具有多种模式和任务的培训数据集，我们完全使用伪标签训练多模型，这使得该框架广泛适用于任何RGB数据集。实验是对多个传输任务（图像分类，语义分割，深度估计）和数据集（Imagenet，ADE20K，Taskymonomy，Taskomony，Hypersim，NYUV2）进行的。结果表明，该模型在跨模式/任务预测性编码和转移中具有令人印象深刻的能力。

We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can optionally accept additional modalities of information in the input besides the RGB image (hence "multi-modal"), and II) its training objective accordingly includes predicting multiple outputs besides the RGB image (hence "multi-task"). We make use of masking (across image patches and input modalities) to make training MultiMAE tractable as well as to ensure cross-modality predictive coding is indeed learned by the network. We show this pre-training strategy leads to a flexible, simple, and efficient framework with improved transfer results to downstream tasks. In particular, the same exact pre-trained network can be flexibly used when additional information besides RGB images is available or when no information other than RGB is available - in all configurations yielding competitive to or significantly better results than the baselines. To avoid needing training datasets with multiple modalities and tasks, we train MultiMAE entirely using pseudo labeling, which makes the framework widely applicable to any RGB dataset. The experiments are performed on multiple transfer tasks (image classification, semantic segmentation, depth estimation) and datasets (ImageNet, ADE20K, Taskonomy, Hypersim, NYUv2). The results show an intriguingly impressive capability by the model in cross-modal/task predictive coding and transfer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题