论文标题

minimax遗憾的是级联匪徒

Minimax Regret for Cascading Bandits

论文作者

Vial, Daniel, Sanghavi, Sujay, Shakkottai, Sanjay, Srikant, R.

论文摘要

级联匪徒是一种自然而流行的模型,它构成了学习在匪徒设置中从伯诺利(Bernoulli)点击反馈进行排名的任务。对于非结构化的奖励,我们证明了与问题无关的(即无差距)遗憾相匹配的上限,这两者都严格改善了最著名的。一个关键的观察结果是,这个问题的困难实例是那些具有少量奖励的人,即在实践中最相关的点击率很小。基于这一点,以及小含义意味着伯努利斯的小方差这一事实,我们的主要技术结果表明,从Bernstein和Chernoff界限得出的方差感知信心集导致最佳算法(最多到日志术语),而基于Hoeffding的基于HOEFFDING的算法却遭受了秩序的命令范围。这与标准(非伴随)匪徒的设置形成鲜明对比,后者感知算法只能改善常数。鉴于此和另一种贡献,我们为线性奖励的结构化案例提出了一种差异算法,并严格表现出遗憾的是,它严格改善了最先进的方法。

Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源