确定问题和回答门户网站上的专家：Reddit中有关数据科学能力的案例研究

论文标题

确定问题和回答门户网站上的专家：Reddit中有关数据科学能力的案例研究

Identifying Experts in Question & Answer Portals: A Case Study on Data Science Competencies in Reddit

论文作者

Strukova, Sofia, Ruipérez-Valiente, José A., Mármol, Félix Gómez

论文摘要

问题与答案的胜利（问答）平台的不可替代的钥匙是他们的用户为在各种感兴趣的主题中发布的具有挑战性的问题提供了高质量的答案。从十多年来，专家发现问题在信息检索研究中引起了很多关注。基于几个问答门户网站上专家识别的遇到的差距，我们检查了识别Reddit数据科学专家的可行性。我们的方法基于手动编码结果，其中两位数据科学专家不仅标记了专家和非专家评论，还标记了副业的评论，这是对文献的新颖贡献，可以识别跨Web门户网站的更多评论。我们提出了一种半监督的方法，在培训期间将1,113个标记的评论与100,226个未标记的评论结合在一起。提出的模型使用每个用户的活动行为，包括自然语言处理（NLP），众包和用户功能集。我们得出的结论是，NLP和用户功能集对这三个类别的更好识别贡献最大。这意味着此方法可以在域内良好地概括。最后，我们通过在Reddit中介绍不同类型的用户做出了新的贡献，这打开了许多未来的研究方向。

The irreplaceable key to the triumph of Question & Answer (Q&A) platforms is their users providing high-quality answers to the challenging questions posted across various topics of interest. From more than a decade, the expert finding problem attracted much attention in information retrieval research. Based on the encountered gaps in the expert identification across several Q&A portals, we inspect the feasibility of identifying data science experts in Reddit. Our method is based on the manual coding results where two data science experts labelled not only expert and non-expert comments, but also out-of-scope comments, which is a novel contribution to the literature, enabling the identification of more groups of comments across web portals. We present a semi-supervised approach which combines 1,113 labelled comments with 100,226 unlabelled comments during training. The proposed model uses the activity behaviour of every user, including Natural Language Processing (NLP), crowdsourced and user feature sets. We conclude that the NLP and user feature sets contribute the most to the better identification of these three classes. It means that this method can generalise well within the domain. Finally, we make a novel contribution by presenting different types of users in Reddit, which opens many future research directions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题