Flexibert：当前的变压器体系结构是否过于均匀且刚性？

论文标题

Flexibert：当前的变压器体系结构是否过于均匀且刚性？

FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?

论文作者

Tuli, Shikhar, Dedhia, Bhishma, Tuli, Shreshth, Jha, Niraj K.

论文摘要

大量语言模型的存在使得为自定义任务挑战的最佳方法选择了一个问题。大多数最先进的方法都利用基于变压器的模型（例如BERT）或其变体。但是，培训此类模型并探索其超参数空间的计算昂贵。先前的工作提出了几种神经体系结构搜索（NAS）方法，这些方法采用绩效预测因子（例如替代模型）来解决此问题；但是，分析仅限于在整个网络中使用固定维度的均匀模型。这导致了亚最佳体系结构。为了解决这一限制，我们提出了一套异质和灵活的模型，即Flexibert，它们具有多种编码层，具有各种可能的操作和不同的隐藏尺寸。对于在此扩展的设计空间中，我们提出了一种新的基于图形相似的嵌入方案，为了获得更好的替代建模。我们还提出了一种名为Boshnas的新型NAS政策，该政策利用了这种新方案，贝叶斯建模和二阶优化，以快速训练并使用神经代理模型来融合最佳体系结构。一组全面的实验表明，与传统型号相比，提议的政策应用于Flexibert设计空间，将性能边界向上推动。我们提出的模型之一Flexibert-Mini的参数比Bert-Mini少3％，并且胶水得分高8.9％。具有等效性能作为最佳均匀模型的Flexibert模型的尺寸较小2.6倍。另一个提出的模型Flexibert-Large取得了最先进的结果，在胶水基准上，基线模型的表现至少超过5.7％。

The existence of a plethora of language models makes the problem of selecting the best one for a custom task challenging. Most state-of-the-art methods leverage transformer-based models (e.g., BERT) or their variants. Training such models and exploring their hyperparameter space, however, is computationally expensive. Prior work proposes several neural architecture search (NAS) methods that employ performance predictors (e.g., surrogate models) to address this issue; however, analysis has been limited to homogeneous models that use fixed dimensionality throughout the network. This leads to sub-optimal architectures. To address this limitation, we propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions. For better-posed surrogate modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme. We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization, to quickly train and use a neural surrogate model to converge to the optimal architecture. A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models. FlexiBERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score. A FlexiBERT model with equivalent performance as the best homogeneous model achieves 2.6x smaller size. FlexiBERT-Large, another proposed model, achieves state-of-the-art results, outperforming the baseline models by at least 5.7% on the GLUE benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题