Hiner：一个名为“实体识别数据集”的大印地语

论文标题

Hiner：一个名为“实体识别数据集”的大印地语

HiNER: A Large Hindi Named Entity Recognition Dataset

论文作者

Murthy, Rudra, Bhattacharjee, Pallab, Sharnagat, Rahul, Khatri, Jyotsana, Kanojia, Diptesh, Bhattacharyya, Pushpak

论文摘要

命名实体识别（NER）是一项基础NLP任务，旨在向自由文本中的单词提供诸如人物，位置，组织，时间和数字之类的班级标签。命名实体也可以是多字表达式，其中附加的I-O-B注释信息在NER注释过程中有助于标记它们。尽管英语和欧洲语言具有大量的NER任务注释数据，但印度语言在数量和遵循注释标准方面都缺乏该方面。本文释放了一个尺寸较大的标准印地语数据集，其中包含109,146个句子和2,220,856个令牌，并带有11个标签。我们将以所有基本细节讨论数据集统计信息，并对与数据一起使用的NER标签集进行深入分析。我们数据集中标签集的统计数据显示出健康的每个标签分布，尤其是对于人，位置和组织等著名课程。由于资源效果的证明是在建立资源并在基准数据上测试模型，并针对共享任务中的Leader-table条目测试模型，因此我们使用上述数据也是如此。我们使用不同的语言模型来执行NER的序列标记任务，并通过对在另一个可用于印地语任务的数据集上训练的模型进行比较评估来显示我们数据的功效。如本文所述，我们的数据集有助于获得所有标签的加权F1分数88.78和92.22的加权分数，而92.22则有助于达到92.22。据我们所知，就印地语而言，没有可用的数据集符合数量（数量）和可变性（多样性）的标准。我们通过这项工作填补了这一空白，我们希望这将极大地帮助NLP对印地语的帮助。我们在https://github.com/cfiltnlp/hiner上发布了使用代码和模型的数据集

Named Entity Recognition (NER) is a foundational NLP task that aims to provide class labels like Person, Location, Organisation, Time, and Number to words in free text. Named Entities can also be multi-word expressions where the additional I-O-B annotation information helps label them during the NER annotation process. While English and European languages have considerable annotated data for the NER task, Indian languages lack on that front -- both in terms of quantity and following annotation standards. This paper releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags. We discuss the dataset statistics in all their essential detail and provide an in-depth analysis of the NER tag-set used with our data. The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation. Since the proof of resource-effectiveness is in building models with the resource and testing the model on benchmark data and against the leader-board entries in shared tasks, we do the same with the aforesaid data. We use different language models to perform the sequence labelling task for NER and show the efficacy of our data by performing a comparative evaluation with models trained on another dataset available for the Hindi NER task. Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper. To the best of our knowledge, no available dataset meets the standards of volume (amount) and variability (diversity), as far as Hindi NER is concerned. We fill this gap through this work, which we hope will significantly help NLP for Hindi. We release this dataset with our code and models at https://github.com/cfiltnlp/HiNER

下载PDF全文

下载文献需遵守相关版权规定

论文标题