通过设定值预测对历史语料库的可靠言论标记

论文标题

通过设定值预测对历史语料库的可靠言论标记

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

论文作者

Heid, Stefan, Wever, Marcel, Hüllermeier, Eyke

论文摘要

语料库的句法注释以词性（POS）标签的形式是语言研究和随后的自动自然语言处理（NLP）任务的关键要求。这个问题通常是使用机器学习方法来解决的，即，通过在足够大的标记数据语料库上训练POS标记器。尽管POS标签的问题基本上可以被认为是为现代语言解决的，但历史文献事实变得更加困难，尤其是由于缺乏母语和培训数据的稀疏性。此外，大多数文本都没有我们今天所知道的句子，也没有常见的拼字法。这些不规则性使自动化POS标签的任务更加困难和错误。在这种情况下，它不应强迫POS标签者预测并提交单个标签，而应使其能够表达其不确定性。在本文中，我们考虑在设定值预测的框架内进行POS标记，这使POS Tagger可以通过预测一组候选POS标签来表达其不确定性，而不是猜测一个候选人。目的是确保高度信心，同时包括正确的POS标签，同时保持候选人的数量少。在我们的实验研究中，我们发现将最先进的POS标记器扩展到设置值的预测会产生更精确和强大的标记，尤其是对于未知单词，即训练数据中不出现的单词。

Syntactic annotation of corpora in the form of part-of-speech (POS) tags is a key requirement for both linguistic research and subsequent automated natural language processing (NLP) tasks. This problem is commonly tackled using machine learning methods, i.e., by training a POS tagger on a sufficiently large corpus of labeled data. While the problem of POS tagging can essentially be considered as solved for modern languages, historical corpora turn out to be much more difficult, especially due to the lack of native speakers and sparsity of training data. Moreover, most texts have no sentences as we know them today, nor a common orthography. These irregularities render the task of automated POS tagging more difficult and error-prone. Under these circumstances, instead of forcing the POS tagger to predict and commit to a single tag, it should be enabled to express its uncertainty. In this paper, we consider POS tagging within the framework of set-valued prediction, which allows the POS tagger to express its uncertainty via predicting a set of candidate POS tags instead of guessing a single one. The goal is to guarantee a high confidence that the correct POS tag is included while keeping the number of candidates small. In our experimental study, we find that extending state-of-the-art POS taggers to set-valued prediction yields more precise and robust taggings, especially for unknown words, i.e., words not occurring in the training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题