minimax遗憾的是级联匪徒

论文标题

minimax遗憾的是级联匪徒

Minimax Regret for Cascading Bandits

论文作者

Vial, Daniel, Sanghavi, Sujay, Shakkottai, Sanjay, Srikant, R.

论文摘要

级联匪徒是一种自然而流行的模型，它构成了学习在匪徒设置中从伯诺利（Bernoulli）点击反馈进行排名的任务。对于非结构化的奖励，我们证明了与问题无关的（即无差距）遗憾相匹配的上限，这两者都严格改善了最著名的。一个关键的观察结果是，这个问题的困难实例是那些具有少量奖励的人，即在实践中最相关的点击率很小。基于这一点，以及小含义意味着伯努利斯的小方差这一事实，我们的主要技术结果表明，从Bernstein和Chernoff界限得出的方差感知信心集导致最佳算法（最多到日志术语），而基于Hoeffding的基于HOEFFDING的算法却遭受了秩序的命令范围。这与标准（非伴随）匪徒的设置形成鲜明对比，后者感知算法只能改善常数。鉴于此和另一种贡献，我们为线性奖励的结构化案例提出了一种差异算法，并严格表现出遗憾的是，它严格改善了最先进的方法。

Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题