伪季赛：利用伪标签来定位识别知识库人口

论文标题

伪季赛：利用伪标签来定位识别知识库人口

PseudoReasoner: Leveraging Pseudo Labels for Commonsense Knowledge Base Population

论文作者

Fang, Tianqing, Do, Quyet V., Zhang, Hongming, Song, Yangqiu, Wong, Ginny Y., See, Simon

论文摘要

常识知识基础（CSKB）人口旨在推理对CSKB的看不见的实体和主张，并且是一项重要但坚硬的常识性推理任务。一个挑战是，它需要跨域的概括能力，因为用于培训的源CSKB的规模相对较小（1M），而整个人口的候选空间较大（200m）。我们建议使用Pseudoreasoner，这是一个针对CSKB人群的半监督学习框架，使用在CSKB上预先培训的教师模型在未标记的候选数据集中提供伪标签，以便学生模型学习。教师可以是一种生成模型，而不是仅限于以前的作品歧视模型。此外，我们根据影响功能和学生模型进一步提高性能的预测，为伪标签设计了一个新的过滤程序。该框架可以在整体性能上提高骨干模型KG-Bert（Roberta-Large）3.3分，尤其是在室外性能上为5.3分，并实现了最先进的表现。代码和数据可从https://github.com/hkust-knowcomp/pseudoreasoner获得。

Commonsense Knowledge Base (CSKB) Population aims at reasoning over unseen entities and assertions on CSKBs, and is an important yet hard commonsense reasoning task. One challenge is that it requires out-of-domain generalization ability as the source CSKB for training is of a relatively smaller scale (1M) while the whole candidate space for population is way larger (200M). We propose PseudoReasoner, a semi-supervised learning framework for CSKB population that uses a teacher model pre-trained on CSKBs to provide pseudo labels on the unlabeled candidate dataset for a student model to learn from. The teacher can be a generative model rather than restricted to discriminative models as previous works. In addition, we design a new filtering procedure for pseudo labels based on influence function and the student model's prediction to further improve the performance. The framework can improve the backbone model KG-BERT (RoBERTa-large) by 3.3 points on the overall performance and especially, 5.3 points on the out-of-domain performance, and achieves the state-of-the-art. Codes and data are available at https://github.com/HKUST-KnowComp/PseudoReasoner.

下载PDF全文

下载文献需遵守相关版权规定

论文标题