论文标题
无监督的标签改进改进数据列表文本分类
Unsupervised Label Refinement Improves Dataless Text Classification
论文作者
论文摘要
DataLess文本分类能够通过将分数分配给与标签描述配对的任何文档,将文档分类为以前看不见的标签。尽管有希望,但至关重要的是,针对每个下游任务的标签设置的准确描述。这种依赖会导致数据表格分类器对选择标签描述的高度敏感,并阻碍了在实践中更广泛的数据分类应用。在本文中,我们提出以下问题:如何使用下游任务数据集的输入来改进数据列表文本分类?我们的主要解决方案是基于聚类的方法。在给定数据符号分类器的情况下,我们的方法使用K-均值聚类来完善其预测集。我们通过提高两个广泛使用的分类器体系结构的性能来证明我们的方法的广泛适用性,该架构编码了带有两个独立编码器的文本类别对,另一个编码一个单个关节编码器。实验表明,我们的方法一致地改善了跨不同数据集的数据分类,并使分类器更适合标签描述的选择。
Dataless text classification is capable of classifying documents into previously unseen labels by assigning a score to any document paired with a label description. While promising, it crucially relies on accurate descriptions of the label set for each downstream task. This reliance causes dataless classifiers to be highly sensitive to the choice of label descriptions and hinders the broader application of dataless classification in practice. In this paper, we ask the following question: how can we improve dataless text classification using the inputs of the downstream task dataset? Our primary solution is a clustering based approach. Given a dataless classifier, our approach refines its set of predictions using k-means clustering. We demonstrate the broad applicability of our approach by improving the performance of two widely used classifier architectures, one that encodes text-category pairs with two independent encoders and one with a single joint encoder. Experiments show that our approach consistently improves dataless classification across different datasets and makes the classifier more robust to the choice of label descriptions.