没有标记数据的指定实体识别：一种弱监督方法

论文标题

没有标记数据的指定实体识别：一种弱监督方法

Named Entity Recognition without Labelled Data: A Weak Supervision Approach

论文作者

Lison, Pierre, Hubin, Aliaksandr, Barnes, Jeremy, Touileb, Samia

论文摘要

当应用于与训练过程中观察到的文本不同的目标域时，指定的实体识别（NER）的性能通常会迅速降低。当有标记的数据可用时，可以使用传输学习技术将现有的NER模型调整为目标域。但是，当目标域没有手工标记的数据时，该怎么办？本文提出了一种简单但有力的方法，可以在没有弱监督下没有标记的数据的情况下学习NER模型。该方法依赖于广泛的标记函数来自动从目标域注释文本。然后，使用隐藏的马尔可夫模型将这些注释合并在一起，该模型捕获了标签函数的不同准确性和混乱。序列标记模型最终可以基于此统一注释进行训练。我们评估了两个英语数据集（Conll 2003和Reuters和Bloomberg的新闻文章）上的方法，并证明了与室外神经NER模型相比，实体级别$ f_1 $得分的大约7个百分点。

Named Entity Recognition (NER) performance often degrades rapidly when applied to target domains that differ from the texts observed during training. When in-domain labelled data is available, transfer learning techniques can be used to adapt existing NER models to the target domain. But what should one do when there is no hand-labelled data for the target domain? This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision. The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain. These annotations are then merged together using a hidden Markov model which captures the varying accuracies and confusions of the labelling functions. A sequence labelling model can finally be trained on the basis of this unified annotation. We evaluate the approach on two English datasets (CoNLL 2003 and news articles from Reuters and Bloomberg) and demonstrate an improvement of about 7 percentage points in entity-level $F_1$ scores compared to an out-of-domain neural NER model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题