生物船：生物医学生成语言模型的预训练和评估

论文标题

生物船：生物医学生成语言模型的预训练和评估

BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model

论文作者

Yuan, Hongyi, Yuan, Zheng, Gan, Ruyi, Zhang, Jiaxing, Xie, Yutao, Yu, Sheng

论文摘要

预审进的语言模型已成为自然语言处理的重要骨架。最近，已显示内域预处理有益于各种特定领域的下游任务。在生物医学领域中，自然语言生成（NLG）任务在研究时至关重要。接近自然语言理解（NLU）任务作为NLG，通过受约束的语言产生或语言提示来满足一般领域的表现。我们强调缺乏内域生成语言模型以及生物医学领域中的非系统生成的下游基准，从而阻碍了研究界的发展。在这项工作中，我们介绍了将BART适应生物医学领域的生成语言模型。我们整理了各种生物医学语言生成任务，包括对话，摘要，实体链接和命名实体识别。与BART相比，在PubMed摘要上预定的Biobart具有提高的性能，并在几个任务上设定了强大的基准。此外，我们对生物船的预读任务进行消融研究，并发现句子置换对下游任务有负面影响。

Pretrained language models have served as important backbones for natural language processing. Recently, in-domain pretraining has been shown to benefit various domain-specific downstream tasks. In the biomedical domain, natural language generation (NLG) tasks are of critical importance, while understudied. Approaching natural language understanding (NLU) tasks as NLG achieves satisfying performance in the general domain through constrained language generation or language prompting. We emphasize the lack of in-domain generative language models and the unsystematic generative downstream benchmarks in the biomedical domain, hindering the development of the research community. In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain. We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition. BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks. Furthermore, we conduct ablation studies on the pretraining tasks for BioBART and find that sentence permutation has negative effects on downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题