ANEC：一个名为“实体语料库和基于变压器的识别器”的Amharic

论文标题

ANEC：一个名为“实体语料库和基于变压器的识别器”的Amharic

ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer

论文作者

Jibril, Ebrahim Chekol, Tantğ, A. Cüneyd

论文摘要

命名实体识别是一项信息提取任务，可作为其他自然语言处理任务的预处理步骤，例如机器翻译，信息检索和问题回答。命名实体识别能够识别专有名称以及开放域文本中的时间和数字表达式。对于阿拉伯语，阿姆哈拉语和希伯来语等闪族语言，由于这些语言的结构严重变化，指定的实体识别任务更具挑战性。在本文中，我们提出了一个基于双向长期记忆的Amharic命名实体识别系统，并带有条件随机字段层。我们注释了一种新的Amharic命名实体识别数据集（8,070个句子，具有182,691个令牌），并将合成少数群体过度采样技术应用于我们的数据集以减轻不平衡的分类问题。我们命名的实体识别系统的F_1得分为93％，这是Amharic命名实体识别的新最新结果。

Named Entity Recognition is an information extraction task that serves as a preprocessing step for other natural language processing tasks, such as machine translation, information retrieval, and question answering. Named entity recognition enables the identification of proper names as well as temporal and numeric expressions in an open domain text. For Semitic languages such as Arabic, Amharic, and Hebrew, the named entity recognition task is more challenging due to the heavily inflected structure of these languages. In this paper, we present an Amharic named entity recognition system based on bidirectional long short-term memory with a conditional random fields layer. We annotate a new Amharic named entity recognition dataset (8,070 sentences, which has 182,691 tokens) and apply Synthetic Minority Over-sampling Technique to our dataset to mitigate the imbalanced classification problem. Our named entity recognition system achieves an F_1 score of 93%, which is the new state-of-the-art result for Amharic named entity recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题