有关变压器配置和训练目标的研究

论文标题

有关变压器配置和训练目标的研究

A Study on Transformer Configuration and Training Objective

论文作者

Xue, Fuzhao, Chen, Jianghai, Sun, Aixin, Ren, Xiaozhe, Zheng, Zangwei, He, Xiaoxin, Chen, Yongming, Jiang, Xin, You, Yang

论文摘要

基于变压器的模型在许多任务，尤其是视觉和语言任务上取得了令人印象深刻的结果。在许多模型培训情况下，通常采用常规配置。例如，我们通常将隐藏尺寸（即模型宽度）的基本模型设置为768，而变压器层的数量（即模型深度）为12。在本文中，我们重新审视这些常规配置。通过理论分析和实验评估，我们表明，蒙面的自动编码器可有效减轻深度变压器训练中过度平滑的问题。基于这一发现，我们提出了竹子，即使用较深且较窄的变压器配置的想法，用于掩盖自动编码器训练。在Imagenet上，重新设计的模型具有如此简单的配置变化，可实现87.1％的TOP-1精度，并且胜过Mae和Beit等SOTA模型。在语言任务上，重新设计的模型在胶水数据集上平均贵于默认设置的BERT平均设置为1.1点。

Transformer-based models have delivered impressive results on many tasks, particularly vision and language tasks. In many model training situations, conventional configurations are typically adopted. For example, we often set the base model with hidden dimensions (i.e. model width) to be 768 and the number of transformer layers (i.e. model depth) to be 12. In this paper, we revisit these conventional configurations. Through theoretical analysis and experimental evaluation, we show that the masked autoencoder is effective in alleviating the over-smoothing issue in deep transformer training. Based on this finding, we propose Bamboo, an idea of using deeper and narrower transformer configurations, for masked autoencoder training. On ImageNet, with such a simple change in configuration, re-designed model achieves 87.1% top-1 accuracy and outperforms SoTA models like MAE and BEiT. On language tasks, re-designed model outperforms BERT with default setting by 1.1 points on average, on GLUE datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题