使用TSETLIN机器文本分类器的结合条款来测量自然语言文本的新颖性

论文标题

使用TSETLIN机器文本分类器的结合条款来测量自然语言文本的新颖性

Measuring the Novelty of Natural Language Text Using the Conjunctive Clauses of a Tsetlin Machine Text Classifier

论文作者

Bhattarai, Bimal, Granmo, Ole-Christoffer, Jiao, Lei

论文摘要

大多数受监管的文本分类方法都假设了一个封闭的世界，指望在培训时数据中存在的所有类。这种假设可能会导致操作过程中的不可预测的行为，每当新颖，以前看不见的类别出现时。尽管最近已经使用了基于深度学习的方法来进行新颖性检测，但由于其黑盒性质，它们挑战了。本文介绍了\ emph {Duckinable}开放世界文本分类，在此过程中，训练有素的分类器必须在操作过程中处理新颖的类。为此，我们使用新颖的评分机制扩展了最近引入的Tsetlin Machine（TM）。该机制使用TM的结合条款来测量文本在训练数据所涵盖的类别的程度与多大程度匹配。我们证明了这些子句对已知主题的简洁性描述，并且我们的评分机制使得可以从已知的主题中辨别出新颖的话题。从经验上讲，我们基于TM的方法在五个数据集中的三个方面优于其他七个新颖性检测方案，并在其余的剩余方面表现出了第二和第三好的，并获得了可解释的命题基于逻辑的表示。

Most supervised text classification approaches assume a closed world, counting on all classes being present in the data at training time. This assumption can lead to unpredictable behaviour during operation, whenever novel, previously unseen, classes appear. Although deep learning-based methods have recently been used for novelty detection, they are challenging to interpret due to their black-box nature. This paper addresses \emph{interpretable} open-world text classification, where the trained classifier must deal with novel classes during operation. To this end, we extend the recently introduced Tsetlin machine (TM) with a novelty scoring mechanism. The mechanism uses the conjunctive clauses of the TM to measure to what degree a text matches the classes covered by the training data. We demonstrate that the clauses provide a succinct interpretable description of known topics, and that our scoring mechanism makes it possible to discern novel topics from the known ones. Empirically, our TM-based approach outperforms seven other novelty detection schemes on three out of five datasets, and performs second and third best on the remaining, with the added benefit of an interpretable propositional logic-based representation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题