论文标题
通过大规模翻译丰富低资源语言的生物医学知识
Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation
论文作者
论文摘要
生物医学数据和基准非常有价值,但在越南语等英语以外的低资源语言中非常有限。在本文中,我们利用英语 - 越南语中的最先进的翻译模型来翻译和产生审计的生物医学领域中的监督数据。多亏了如此大规模的翻译,我们介绍了Vipubmedt5,这是一种经过验证的编码器变压器模型,该模型在2000万次翻译摘要中训练了高质量的公共Pubmed PubMed语料库。 Vipubmedt5在摘要和首字母缩写歧义中对两个不同的生物医学基准进行了最新结果。此外,我们发布了Vimednli-越南人的一项新的NLP任务,使用最近的公共EN -VI翻译模型从Mednli翻译,并由人类专家精心完善,并对现有方法对VipubMedT5进行了评估。
Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English such as Vietnamese. In this paper, we make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.