论文标题

色相:审计的模型和数据集,用于了解古韩国的hanja文档

HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

论文作者

Yoo, Haneul, Jin, Jiho, Son, Juhee, Bak, JinYeong, Cho, Kyunghyun, Oh, Alice

论文摘要

20世纪之前的韩国历史记录主要用Hanja撰写,这是一种基于汉字的灭绝语言,而现代韩语或中文者则不理解。在这段时间内拥有专业知识的历史学家已经分析了这些文档,但是该过程非常困难且耗时,语言模型将大大加快流程。为了构建和评估Hanja的语言模型,我们发布了Hanja理解评估数据集,该数据集包括按时间顺序归因,主题分类,命名实体识别和摘要检索任务。我们还提出了基于伯特的模型,从14世纪至19世纪,仍在对两个主要语料库进行培训:约瑟王朝的纪事和皇家秘书处的日记。我们将模型与所有任务上的几个基线进行比较,并表明通过对这两个语料库进行培训获得了重大改进。此外,我们在皇家法院和重要官员(DRRI)的每日记录上进行零拍摄实验。历史学家没有对DRRI数据集进行太多研究,而NLP社区根本没有研究。

Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zero-shot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源