论文标题
Boostingbert:将多级增长集成到NLP任务的BERT中
BoostingBERT:Integrating Multi-Class Boosting into BERT for NLP Tasks
论文作者
论文摘要
作为预先训练的变压器模型,BERT(Transformers的双向编码器表示)在多个NLP任务上实现了突破性的性能。另一方面,Boosting是一种流行的集合学习技术,结合了许多基本分类器,并已被证明可以在许多机器学习任务中产生更好的概括性能。一些作品表明,Bert的合奏可以进一步提高应用程序性能。但是,当前的合奏方法专注于包装或堆叠,并且在探索提升方面并没有太大的努力。在这项工作中,我们提出了一种新颖的增强BERT模型,以将多级增强功能整合到Bert中。我们提出的模型使用预先训练的变压器作为基本分类器,选择更艰难的训练集,以微调并获得NLP任务中培训前语言知识和增强合奏的好处。我们在胶水数据集和3个流行的中国NLU基准上评估了所提出的模型。实验结果表明,我们提出的模型在所有数据集上都显着优于BERT,并证明了其在许多NLP任务中的有效性。 Boostingbert用罗伯塔(Roberta)作为基本分类器代替伯特基地(Bert Base),实现了新的NLP任务。我们还使用“教师学生”框架内的知识蒸馏来减少Boostingbert的计算开销和模型存储,同时保持其用于实际应用的性能。
As a pre-trained Transformer model, BERT (Bidirectional Encoder Representations from Transformers) has achieved ground-breaking performance on multiple NLP tasks. On the other hand, Boosting is a popular ensemble learning technique which combines many base classifiers and has been demonstrated to yield better generalization performance in many machine learning tasks. Some works have indicated that ensemble of BERT can further improve the application performance. However, current ensemble approaches focus on bagging or stacking and there has not been much effort on exploring the boosting. In this work, we proposed a novel Boosting BERT model to integrate multi-class boosting into the BERT. Our proposed model uses the pre-trained Transformer as the base classifier to choose harder training sets to fine-tune and gains the benefits of both the pre-training language knowledge and boosting ensemble in NLP tasks. We evaluate the proposed model on the GLUE dataset and 3 popular Chinese NLU benchmarks. Experimental results demonstrate that our proposed model significantly outperforms BERT on all datasets and proves its effectiveness in many NLP tasks. Replacing the BERT base with RoBERTa as base classifier, BoostingBERT achieves new state-of-the-art results in several NLP Tasks. We also use knowledge distillation within the "teacher-student" framework to reduce the computational overhead and model storage of BoostingBERT while keeping its performance for practical application.