论文标题
HMBERT:命名实体识别的历史多语言模型
hmBERT: Historical Multilingual Language Models for Named Entity Recognition
论文作者
论文摘要
与标准命名实体识别(NER)相比,在历史文本中识别人,位置和组织是一个巨大的挑战。要获得机器可读的语料库,通常需要扫描历史文本,并且需要执行光学特征识别(OCR)。结果,历史文献包含错误。此外,位置或组织等实体可以随着时间的推移而改变,这构成了另一个挑战。总体而言,历史文本带有几种特殊性,这些特殊性与现代文本有很大不同,并且在该领域几乎无法使用训练神经标记器的大型标记的Corpora。在这项工作中,我们通过培训大型历史语言模型来解决历史,英语,法语,瑞典语和芬兰人的历史。我们通过使用未标记的数据进行预处理语言模型来规避大量标记数据的需求。我们提出了HMBERT,这是一种历史多语言的基于BERT的语言模型,并以多种不同尺寸的版本发布该模型。此外,我们通过解决下游NER来评估HMBERT的能力,这是今年HIPE-2022共享任务的一部分,并提供了详细的分析和见解。对于多种语言的经典评论粗粒ner挑战,我们的标记者Histeria的表现优于其他团队的三种语言中的其他车型。
Compared to standard Named Entity Recognition (NER), identifying persons, locations, and organizations in historical texts constitutes a big challenge. To obtain machine-readable corpora, the historical text is usually scanned and Optical Character Recognition (OCR) needs to be performed. As a result, the historical corpora contain errors. Also, entities like location or organization can change over time, which poses another challenge. Overall, historical texts come with several peculiarities that differ greatly from modern texts and large labeled corpora for training a neural tagger are hardly available for this domain. In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models. We circumvent the need for large amounts of labeled data by using unlabeled data for pretraining a language model. We propose hmBERT, a historical multilingual BERT-based language model, and release the model in several versions of different sizes. Furthermore, we evaluate the capability of hmBERT by solving downstream NER as part of this year's HIPE-2022 shared task and provide detailed analysis and insights. For the Multilingual Classical Commentary coarse-grained NER challenge, our tagger HISTeria outperforms the other teams' models for two out of three languages.