论文标题
通过减少表示的混乱,更好的预训练
Better Pre-Training by Reducing Representation Confusion
论文作者
论文摘要
在这项工作中,我们重新访问了基于变压器的预训练的语言模型,并分别在编码和模型表示中分别确定了两种不同类型的信息混乱。首先,我们表明,在相对位置编码中,有关相对距离和方向的关节建模会在两个异质信息之间产生混淆。它可能使模型无法捕获相同距离和相反方向的关联语义,从而影响下游任务的性能。其次,我们注意到用掩模语言建模(MLM)预训练目标输出相似的令牌表示(不同令牌的最后一个隐藏状态)和头部表示(不同头部的注意权重)的BERT,这可能会使不同的代币和头部有限的信息表达的信息多样性。在上述调查的激励下,我们提出了两种新型技术来改善预训练的语言模型:解耦方向相对位置(DDRP)编码和MTH训练预训练目标。 DDRP将相对距离特征和定向特征分解为经典相对位置编码。 MTH应用了除MLM以外的两个新型辅助正规化器,以扩大(a)不同令牌的最后一个隐藏状态和(b)不同头部的注意力权重之间的差异。这些设计使该模型可以更清楚地捕获不同类别的信息,以减轻代表学习中信息混乱以更好地优化的方式。对胶水基准的广泛实验和消融研究证明了我们提出的方法的有效性。
In this work, we revisit the Transformer-based pre-trained language models and identify two different types of information confusion in position encoding and model representations, respectively. Firstly, we show that in the relative position encoding, the joint modeling about relative distances and directions brings confusion between two heterogeneous information. It may make the model unable to capture the associative semantics of the same distance and the opposite directions, which in turn affects the performance of downstream tasks. Secondly, we notice the BERT with Mask Language Modeling (MLM) pre-training objective outputs similar token representations (last hidden states of different tokens) and head representations (attention weights of different heads), which may make the diversity of information expressed by different tokens and heads limited. Motivated by the above investigation, we propose two novel techniques to improve pre-trained language models: Decoupled Directional Relative Position (DDRP) encoding and MTH pre-training objective. DDRP decouples the relative distance features and the directional features in classical relative position encoding. MTH applies two novel auxiliary regularizers besides MLM to enlarge the dissimilarities between (a) last hidden states of different tokens, and (b) attention weights of different heads. These designs allow the model to capture different categories of information more clearly, as a way to alleviate information confusion in representation learning for better optimization. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of our proposed methods.