论文标题
要了解为什么掩盖重建预处理有助于下游任务
Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks
论文作者
论文摘要
对于无监督的预训练,掩盖重建预处理(MRP)方法,例如MAE和DATA2VEC,随机掩盖输入补丁,然后通过自动编码器重建这些蒙版贴片的像素或语义特征。然后,为了进行下游任务,对经过验证的编码器进行微调显着超过了经过从头开始训练的传统``监督学习''(SL)。但是,目前尚不清楚1)MRP如何在训练阶段执行语义特征学习以及2)为什么它有助于下游任务。为了解决这些问题,我们首先从理论上表明,在两/单层卷积编码器/解码器的自动编码器上,MRP可以捕获预读取数据集中每个潜在语义类别的所有区分特征。然后,考虑到预处理数据集的尺寸且多样性较高,因此涵盖了下游数据集中的大多数功能,在微调阶段,预处理的编码器可以在下游数据集中捕获尽可能多的功能,并且不会因理论保证而丢失这些功能。相比之下,SL仅由于彩票假设而随机捕获某些功能。因此,事实证明,MRP在分类任务上的性能比SL更好。实验结果证明了我们的数据假设以及我们的理论意义。
For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional ``supervised learning'' (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.