注意力集中：BERT是否需要中间层？

论文标题

注意力集中：BERT是否需要中间层？

Undivided Attention: Are Intermediate Layers Necessary for BERT?

论文作者

Sridhar, Sharath Nittur, Sarah, Anthony

论文摘要

近来，基于BERT的模型在解决各种自然语言处理（NLP）任务（例如阅读理解，自然语言推断，情感分析等）方面非常成功。所有基于BERT的建筑都具有自我发明的块，然后是中间层的块作为基本建筑组成部分。但是，在文献中，将这些中间层包含的强有力的理由仍然缺失。在这项工作中，我们研究了中间层对下游任务的整体网络性能的重要性。我们表明，减少中间层的数量并修改BERT基碱的体系结构会导致下游任务的微调精度损失最小，同时减少了模型的参数和训练时间。此外，我们使用集中的内核对齐和探测线性分类器来深入了解我们的架构修饰，并证明去除中间层对微调的精度几乎没有影响。

In recent times, BERT-based models have been extremely successful in solving a variety of natural language processing (NLP) tasks such as reading comprehension, natural language inference, sentiment analysis, etc. All BERT-based architectures have a self-attention block followed by a block of intermediate layers as the basic building component. However, a strong justification for the inclusion of these intermediate layers remains missing in the literature. In this work we investigate the importance of intermediate layers on the overall network performance of downstream tasks. We show that reducing the number of intermediate layers and modifying the architecture for BERT-BASE results in minimal loss in fine-tuning accuracy for downstream tasks while decreasing the number of parameters and training time of the model. Additionally, we use centered kernel alignment and probing linear classifiers to gain insight into our architectural modifications and justify that removal of intermediate layers has little impact on the fine-tuned accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题