论文标题

在带有匪徒的推荐系统中建模损耗

Modeling Attrition in Recommender Systems with Departing Bandits

论文作者

Ben-Porat, Omer, Cohen, Lee, Leqi, Liu, Lipton, Zachary C., Mansour, Yishay

论文摘要

传统上,当推荐系统被形式化为多臂匪徒时,推荐系统的政策会影响所产生的奖励,但不影响相互作用的时间。但是,在实际系统中,不满意的用户可能会离开(再也不会回来)。在这项工作中,我们提出了一种新型的多军匪徒设置,该设置捕获了这种依赖政策的视野。我们的设置包括一组有限的用户类型,以及带有Bernoulli回报的多个武器。每个(用户类型,ARM)元组对应于(未知)奖励概率。每个用户的类型最初都是未知的,只能通过对建议的响应来推断。此外,如果用户对自己的建议不满意,则可能会离开系统。我们首先解决所有用户共享相同类型的情况,表明最近基于UCB的算法是最佳的。然后,我们向前迈进了更具挑战性的案例,在这种情况下,用户分为两种类型。尽管Naive方法无法处理此设置,但我们提供了一种有效的学习算法,该算法可以实现$ \ tilde {o}(\ sqrt {t})$遗憾,其中$ t $是用户数量。

Traditionally, when recommender systems are formalized as multi-armed bandits, the policy of the recommender system influences the rewards accrued, but not the length of interaction. However, in real-world systems, dissatisfied users may depart (and never come back). In this work, we propose a novel multi-armed bandit setup that captures such policy-dependent horizons. Our setup consists of a finite set of user types, and multiple arms with Bernoulli payoffs. Each (user type, arm) tuple corresponds to an (unknown) reward probability. Each user's type is initially unknown and can only be inferred through their response to recommendations. Moreover, if a user is dissatisfied with their recommendation, they might depart the system. We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal. We then move forward to the more challenging case, where users are divided among two types. While naive approaches cannot handle this setting, we provide an efficient learning algorithm that achieves $\tilde{O}(\sqrt{T})$ regret, where $T$ is the number of users.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源