论文标题

自我介绍以进一步预训练变压器

Self-Distillation for Further Pre-training of Transformers

论文作者

Lee, Seanie, Kang, Minki, Lee, Juho, Hwang, Sung Ju, Kawaguchi, Kenji

论文摘要

将大量的大量变压器模型预先培训,并在标记的数据集上对其进行微调,以实现各种愿景和自然语言处理任务的成功策略。但是,如果在数据域之间存在较大的差异,则可以进行预训练的模型的直接微调次优。为了解决这个问题,先前的一些研究提出了进一步的预训练策略,在微调之前,我们继续在目标未标记的数据集中预先培训该模型。但是,所有这些都完全专注于语言模型,我们从经验上发现,视觉变压器很容易受到过度拟合,因为我们继续在目标未标记的数据上预处理模型。为了应对这一限制,我们提出自我提议作为进一步训练阶段的正规化。具体而言,我们首先在目标未标记的数据上进一步预先培训初始预训练的模型,然后将其视为自我抗议的老师。然后,我们采用与学生相同的初始预培训模型,并强制执行其隐藏的表示,以与教师的近亲相近,同时以蒙版的自动编码目标优化学生。我们从经验上验证了在各种基准数据集上进行自我介绍的功效,以实现图像和文本分类任务。在实验上,我们表明我们提出的方法的表现优于所有相关基线。从理论上讲,我们使用简化的模型分析了提出的方法,以了解如何进行进一步的预训练可以有助于改善下游任务的性能。

Pre-training a large transformer model on a massive amount of unlabeled data and fine-tuning it on labeled datasets for diverse downstream tasks has proven to be a successful strategy, for a variety of vision and natural language processing tasks. However, direct fine-tuning of the pre-trained model may be suboptimal if there exist large discrepancies across data domains for pre-training and fine-tuning. To tackle this issue, several previous studies have proposed further pre-training strategies, where we continue to pre-train the model on the target unlabeled dataset before fine-tuning. However, all of them solely focus on language models and we empirically find that a Vision Transformer is vulnerable to overfitting as we continue to pretrain the model on target unlabeled data. In order to tackle this limitation, we propose self-distillation as a regularization for a further pre-training stage. Specifically, we first further pre-train the initial pre-trained model on the target unlabeled data and then consider it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks. Experimentally, we show that our proposed method outperforms all the relevant baselines. Theoretically, we analyze the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the performance of the downstream tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源