随机匪徒的结构自适应算法

论文标题

随机匪徒的结构自适应算法

Structure Adaptive Algorithms for Stochastic Bandits

论文作者

Degenne, Rémy, Shao, Han, Koolen, Wouter M.

论文摘要

我们研究了一系列结构化的随机多臂匪徒问题中的奖励最大化，其中平均武器的奖励满足了一些给定的结构约束，例如线性，单峰，稀疏等。我们的目的是开发灵活的方法（因为它们很容易适应了不同的结构），功能强大（它们在经验上和/或证明符合实例依赖性下限）并有效，因为每轮计算负担很小。我们使用迭代的鞍点求解器从实例依赖性下限中开发了渐近的最佳算法。我们的方法概括了纯探索以奖励最大化的最新迭代方法，在这种挑战中，估计了次要差距及其倒数。我们仍然设法实现了上述所有Desiderata。值得注意的是，我们的技术避免了以前工作所采用的成熟鞍角甲骨文的计算成本，同时又可以实现有限的遗憾界限。我们的实验表明，我们的方法成功地利用了结构假设，而遗憾的是与Vanilla UCB相当的最坏情况。

We study reward maximisation in a wide class of structured stochastic multi-armed bandit problems, where the mean rewards of arms satisfy some given structural constraints, e.g. linear, unimodal, sparse, etc. Our aim is to develop methods that are flexible (in that they easily adapt to different structures), powerful (in that they perform well empirically and/or provably match instance-dependent lower bounds) and efficient in that the per-round computational burden is small. We develop asymptotically optimal algorithms from instance-dependent lower-bounds using iterative saddle-point solvers. Our approach generalises recent iterative methods for pure exploration to reward maximisation, where a major challenge arises from the estimation of the sub-optimality gaps and their reciprocals. Still we manage to achieve all the above desiderata. Notably, our technique avoids the computational cost of the full-blown saddle point oracle employed by previous work, while at the same time enabling finite-time regret bounds. Our experiments reveal that our method successfully leverages the structural assumptions, while its regret is at worst comparable to that of vanilla UCB.

下载PDF全文

下载文献需遵守相关版权规定

论文标题