论文标题

从选择数据中有效,准确的顶部 - $ K $恢复

Efficient and Accurate Top-$K$ Recovery from Choice Data

论文作者

Nguyen, Duc

论文摘要

学习与选择建模的交集是研究的活跃领域,并在电子商务,信息检索和社会科学中的应用。在某些应用程序(例如推荐系统)中,统计学家主要有兴趣使用被动收集的离散选择数据,即,用户从一组多个项目中选择一个项目,从而尽可能有效地从大量项目中恢复最高的项目集。在这种实际考虑的激励下,我们提出了基于选择的Borda Count算法,作为顶级$ k $ - 重新发现的快速准确的排名算法,即正确识别所有顶级$ K $项目。我们表明,在广泛的随机实用程序模型下,基于选择的Borda Count算法具有最佳的样本复杂性。我们证明,在极限上,基于选择的Borda计数算法与常用的最大似然估计方法产生相同的顶部 - $ K $估计值,但是前者的速度和简单性在实践中带来了很大的优势。合成数据集和真实数据集的实验表明,计数算法在准确性方面具有常用的排名算法竞争,同时更快地数量级。

The intersection of learning to rank and choice modeling is an active area of research with applications in e-commerce, information retrieval and the social sciences. In some applications such as recommendation systems, the statistician is primarily interested in recovering the set of the top ranked items from a large pool of items as efficiently as possible using passively collected discrete choice data, i.e., the user picks one item from a set of multiple items. Motivated by this practical consideration, we propose the choice-based Borda count algorithm as a fast and accurate ranking algorithm for top $K$-recovery i.e., correctly identifying all of the top $K$ items. We show that the choice-based Borda count algorithm has optimal sample complexity for top-$K$ recovery under a broad class of random utility models. We prove that in the limit, the choice-based Borda count algorithm produces the same top-$K$ estimate as the commonly used Maximum Likelihood Estimate method but the former's speed and simplicity brings considerable advantages in practice. Experiments on both synthetic and real datasets show that the counting algorithm is competitive with commonly used ranking algorithms in terms of accuracy while being several orders of magnitude faster.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源