猿：对齐审计的编码器以快速学习对齐的多模式表示

论文标题

猿：对齐审计的编码器以快速学习对齐的多模式表示

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

论文作者

Rosenfeld, Elan, Nakkiran, Preetum, Pouransari, Hadi, Tuzel, Oncel, Faghri, Fartash

论文摘要

学习对齐的多模式表示的最新进展主要是由训练大型，嘈杂的配对模式数据集的大型神经网络驱动的。在这项工作中，我们询问是否有可能在培训时间和数据较少的情况下取得相似的结果。我们通过利用现有的经过预定的单峰编码器以及与感兴趣的下游任务相关的一致性数据来实现这一目标。我们研究了一种自然的方法，可以通过小型辅助功能对齐现有的编码器，并且我们发现，在许多情况下，这种方法与（或跑赢大度）的状态具有竞争力，同时又不易过度拟合，训练成本较低，并且更强大，并且更强大地转移分配。通过正确选择的对齐分布，我们的方法超过了对公共数据的ImageNet零照片分类的艺术状态，同时使用两个数量级的时间和数据和培训少77％。

Recent advances in learning aligned multimodal representations have been primarily driven by training large neural networks on massive, noisy paired-modality datasets. In this work, we ask whether it is possible to achieve similar results with substantially less training time and data. We achieve this by taking advantage of existing pretrained unimodal encoders and careful curation of alignment data relevant to the downstream task of interest. We study a natural approach to aligning existing encoders via small auxiliary functions, and we find that this method is competitive with (or outperforms) state of the art in many settings while being less prone to overfitting, less costly to train, and more robust to distribution shift. With a properly chosen alignment distribution, our method surpasses prior state of the art for ImageNet zero-shot classification on public data while using two orders of magnitude less time and data and training 77% fewer parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题