ASNER- Assamese指定实体识别的注释数据集和基线

论文标题

ASNER- Assamese指定实体识别的注释数据集和基线

AsNER -- Annotated Dataset and Baseline for Assamese Named Entity recognition

论文作者

Pathak, Dhrubajyoti, Nandi, Sukumar, Sarmah, Priyankoo

论文摘要

我们介绍了ASNER，这是一种使用基线阿萨姆语NER模型的低资源阿萨姆语语言的指定实体注释数据集。该数据集包含大约99k代币，其中包括印度总理和阿萨姆人戏剧演讲中的文字。它还包含个人名称，位置名称和地址。拟议的NER数据集可能是基于深神经的阿萨姆语言处理的重要资源。我们通过训练NER模型进行基准测试数据集，并使用最先进的体系结构对受监督的命名实体识别（NER）进行评估，例如FastText，Bert，XLM-R，Flair，Muril等。我们实施了几种基线方法，该方法采用了最先进的序列序列标记BISM-LSTM-CRF架构。当使用Muril作为单词嵌入方法时，所有基线中最高的F1得分的精度为80.69％。公开提供注释的数据集和最高的性能模型。

We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题