上下文线性斑点的渐近最佳原始二次增量算法

论文标题

上下文线性斑点的渐近最佳原始二次增量算法

An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits

论文作者

Tirinzoni, Andrea, Pirotta, Matteo, Restelli, Marcello, Lazaric, Alessandro

论文摘要

在上下文线性匪徒的设置中，建立在乐观原理上的算法无法利用问题的结构，并且已被证明是渐近的次优。在本文中，我们遵循从问题依赖性遗憾的下范围中得出渐近最佳算法的最新方法，我们引入了一种新颖的算法，该算法沿多个维度改进了最先进的算法。我们建立在下限的重新印象的基础上，在该重新印象中，上下文分布和勘探政策被解耦，并获得了一种算法对不平衡上下文分布的鲁棒性。然后，使用增量原始偶对偶的方法来求解下限的拉格朗日放松，我们获得了可扩展且计算上有效的算法。最后，我们删除了强制探索并基于优化问题的置信区间，以鼓励更适合问题结构的最低探索水平。我们证明了算法的渐近最佳性，同时提供了问题依赖性和最差的有限时间后悔保证。我们的界限与武器数量的对数，从而避免了所有相关先前工作中常见的线性依赖性。值得注意的是，在非上下文线性匪徒的特殊情况下，我们为任何学习范围建立了最小值最佳性。最后，我们验证我们的算法比最先进的基线获得了更好的经验性能。

In the contextual linear bandit setting, algorithms built on the optimism principle fail to exploit the structure of the problem and have been shown to be asymptotically suboptimal. In this paper, we follow recent approaches of deriving asymptotically optimal algorithms from problem-dependent regret lower bounds and we introduce a novel algorithm improving over the state-of-the-art along multiple dimensions. We build on a reformulation of the lower bound, where context distribution and exploration policy are decoupled, and we obtain an algorithm robust to unbalanced context distributions. Then, using an incremental primal-dual approach to solve the Lagrangian relaxation of the lower bound, we obtain a scalable and computationally efficient algorithm. Finally, we remove forced exploration and build on confidence intervals of the optimization problem to encourage a minimum level of exploration that is better adapted to the problem structure. We demonstrate the asymptotic optimality of our algorithm, while providing both problem-dependent and worst-case finite-time regret guarantees. Our bounds scale with the logarithm of the number of arms, thus avoiding the linear dependence common in all related prior works. Notably, we establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits. Finally, we verify that our algorithm obtains better empirical performance than state-of-the-art baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题