自我介绍以进一步预训练变压器

论文标题

自我介绍以进一步预训练变压器

Self-Distillation for Further Pre-training of Transformers

论文作者

Lee, Seanie, Kang, Minki, Lee, Juho, Hwang, Sung Ju, Kawaguchi, Kenji

论文摘要

将大量的大量变压器模型预先培训，并在标记的数据集上对其进行微调，以实现各种愿景和自然语言处理任务的成功策略。但是，如果在数据域之间存在较大的差异，则可以进行预训练的模型的直接微调次优。为了解决这个问题，先前的一些研究提出了进一步的预训练策略，在微调之前，我们继续在目标未标记的数据集中预先培训该模型。但是，所有这些都完全专注于语言模型，我们从经验上发现，视觉变压器很容易受到过度拟合，因为我们继续在目标未标记的数据上预处理模型。为了应对这一限制，我们提出自我提议作为进一步训练阶段的正规化。具体而言，我们首先在目标未标记的数据上进一步预先培训初始预训练的模型，然后将其视为自我抗议的老师。然后，我们采用与学生相同的初始预培训模型，并强制执行其隐藏的表示，以与教师的近亲相近，同时以蒙版的自动编码目标优化学生。我们从经验上验证了在各种基准数据集上进行自我介绍的功效，以实现图像和文本分类任务。在实验上，我们表明我们提出的方法的表现优于所有相关基线。从理论上讲，我们使用简化的模型分析了提出的方法，以了解如何进行进一步的预训练可以有助于改善下游任务的性能。

Pre-training a large transformer model on a massive amount of unlabeled data and fine-tuning it on labeled datasets for diverse downstream tasks has proven to be a successful strategy, for a variety of vision and natural language processing tasks. However, direct fine-tuning of the pre-trained model may be suboptimal if there exist large discrepancies across data domains for pre-training and fine-tuning. To tackle this issue, several previous studies have proposed further pre-training strategies, where we continue to pre-train the model on the target unlabeled dataset before fine-tuning. However, all of them solely focus on language models and we empirically find that a Vision Transformer is vulnerable to overfitting as we continue to pretrain the model on target unlabeled data. In order to tackle this limitation, we propose self-distillation as a regularization for a further pre-training stage. Specifically, we first further pre-train the initial pre-trained model on the target unlabeled data and then consider it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks. Experimentally, we show that our proposed method outperforms all the relevant baselines. Theoretically, we analyze the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the performance of the downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题