论文标题
上下文匪徒,遗憾:在连续动作空间中有效学习
Contextual Bandits with Smooth Regret: Efficient Learning in Continuous Action Spaces
论文作者
论文摘要
设计有效的通用上下文盗版算法,这些算法与大型甚至连续的动作空间一起使用,将有助于应用于重要场景,例如信息检索,推荐系统和连续控制。尽管获得标准的遗憾保证可能是无望的,但已经提出了另一种遗憾的观念来解决大型行动。我们为上下文土匪提出了一个平稳的遗憾概念,该概念主导了先前提出的替代方案。我们在统计和计算高效的算法上设计了一种在标准监督的甲骨文下与一般功能近似作用的统计和高效算法。我们还提出了一种自适应算法,该算法会自动适应任何平滑度。我们的算法可用于在标准遗憾的情况下恢复以前的最小值/帕累托最佳保证,例如,在带有多个最佳武器和Lipschitz/H {Ö} lder Bandits的匪徒问题中。我们进行大规模的经验评估,证明了我们提出的算法的功效。
Designing efficient general-purpose contextual bandit algorithms that work with large -- or even continuous -- action spaces would facilitate application to important scenarios such as information retrieval, recommendation systems, and continuous control. While obtaining standard regret guarantees can be hopeless, alternative regret notions have been proposed to tackle the large action setting. We propose a smooth regret notion for contextual bandits, which dominates previously proposed alternatives. We design a statistically and computationally efficient algorithm -- for the proposed smooth regret -- that works with general function approximation under standard supervised oracles. We also present an adaptive algorithm that automatically adapts to any smoothness level. Our algorithms can be used to recover the previous minimax/Pareto optimal guarantees under the standard regret, e.g., in bandit problems with multiple best arms and Lipschitz/H{ö}lder bandits. We conduct large-scale empirical evaluations demonstrating the efficacy of our proposed algorithms.