通过加速的异质知识预训练，重新审视和推进中国自然语言理解

论文标题

通过加速的异质知识预训练，重新审视和推进中国自然语言理解

Revisiting and Advancing Chinese Natural Language Understanding with Accelerated Heterogeneous Knowledge Pre-training

论文作者

Zhang, Taolin, Dong, Junwei, Wang, Jianing, Wang, Chengyu, Wang, Ang, Liu, Yinghui, Huang, Jun, Li, Yong, He, Xiaofeng

论文摘要

最近，通过从知识图中的结构化关系中学习和/或句法或依赖性分析中的语言知识，通过知识增强的预训练的预培训的语言模型（KEPLM）改善了上下文感知的表示。与英语不同，自然语言处理（NLP）社区中缺乏高性能的开源中文开头，无法支持各种语言理解应用程序。在本文中，我们通过一系列以各种参数尺寸发布的新型中国开头进行了回顾和推进中国自然语言理解的发展。基于上述两个预训练范例和我们内部实施的Torchaccelerator，我们具有预先培训的基础（110m），大（345m）和巨大（1.3B）版本的Ckbert版本在GPU群集上有效。实验表明，在各种基准NLP任务和不同模型大小方面，CKBERT优于中文的强大基准。

Recently, knowledge-enhanced pre-trained language models (KEPLMs) improve context-aware representations via learning from structured relations in knowledge graphs, and/or linguistic knowledge from syntactic or dependency analysis. Unlike English, there is a lack of high-performing open-source Chinese KEPLMs in the natural language processing (NLP) community to support various language understanding applications. In this paper, we revisit and advance the development of Chinese natural language understanding with a series of novel Chinese KEPLMs released in various parameter sizes, namely CKBERT (Chinese knowledge-enhanced BERT).Specifically, both relational and linguistic knowledge is effectively injected into CKBERT based on two novel pre-training tasks, i.e., linguistic-aware masked language modeling and contrastive multi-hop relation modeling. Based on the above two pre-training paradigms and our in-house implemented TorchAccelerator, we have pre-trained base (110M), large (345M) and huge (1.3B) versions of CKBERT efficiently on GPU clusters. Experiments demonstrate that CKBERT outperforms strong baselines for Chinese over various benchmark NLP tasks and in terms of different model sizes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题