论文标题
带有生成的伪标签描述的长尾极端多标签文本分类
Long-tailed Extreme Multi-label Text Classification with Generated Pseudo Label Descriptions
论文作者
论文摘要
极端的多标签文本分类(XMTC)在机器学习研究和应用中是一个艰巨的挑战,这是由于标签空间的巨大尺寸以及与高度偏斜分布中稀有标签的长尾巴相关的严重数据稀缺问题。本文通过提出一种新型方法来解决尾标预测的挑战,该方法结合了训练有素的单词袋(BOW)分类器在严重数据稀缺条件下生成信息标签描述,以及在映射输入文档(查询)中基于神经嵌入的基于神经嵌入的检索模型的功率与相关标签的绘制。所提出的方法在XMTC基准数据集上实现了最先进的性能,并且在尾标预测中明显超过了迄今为止最好的方法。我们还提供了与弓和神经模型W.R.T.相关的理论分析。性能下限。
Extreme Multi-label Text Classification (XMTC) has been a tough challenge in machine learning research and applications due to the sheer sizes of the label spaces and the severe data scarce problem associated with the long tail of rare labels in highly skewed distributions. This paper addresses the challenge of tail label prediction by proposing a novel approach, which combines the effectiveness of a trained bag-of-words (BoW) classifier in generating informative label descriptions under severe data scarce conditions, and the power of neural embedding based retrieval models in mapping input documents (as queries) to relevant label descriptions. The proposed approach achieves state-of-the-art performance on XMTC benchmark datasets and significantly outperforms the best methods so far in the tail label prediction. We also provide a theoretical analysis for relating the BoW and neural models w.r.t. performance lower bound.