气泡多臂匪徒

论文标题

气泡多臂匪徒

Ballooning Multi-Armed Bandits

论文作者

Ghalme, Ganesh, Dhamal, Swapnil, Jain, Shweta, Gujar, Sujit, Narahari, Y.

论文摘要

在本文中，我们介绍了气囊多臂匪徒（BL-MAB），这是经典随机mAb模型的新型扩展。在BL-MAB型号中，随着时间的推移，一组可用的武器会生长（或气球）。与总体上最佳手臂计算的遗憾的经典mAB设置相反，在每次最佳可用的手臂上，都会计算出BL-MAB设置中的遗憾。我们首先观察到现有的随机mAb算法会导致BL-MAB模型的线性遗憾。我们证明，如果最好的手臂同样有可能在任何时候到达，那么无法实现次线性遗憾。接下来，我们表明，如果最好的手臂更有可能在早期回合中到达，那么人们就会遗憾。我们提出的算法确定了（1）应探索新到达的手臂的时间范围的比例，以及（2）从探索的手臂中，手臂的序列在剥削阶段拉动。就分布尾巴的薄度而言，对最佳手臂的到达分布做出了合理的假设，我们证明所提出的算法实现了次线性实例独立的遗憾。我们进一步量化了遗憾对到达分布参数的明确依赖性。我们通过广泛的模拟结果加强了理论发现。我们的结论是，即使（a）分配参数尚不清楚，但使用合理的学习机制获得了分配参数，或者（b）最好的手臂不太可能早到，但很大一部分的臂可能相对早起。

In this paper, we introduce Ballooning Multi-Armed Bandits (BL-MAB), a novel extension of the classical stochastic MAB model. In the BL-MAB model, the set of available arms grows (or balloons) over time. In contrast to the classical MAB setting where the regret is computed with respect to the best arm overall, the regret in a BL-MAB setting is computed with respect to the best available arm at each time. We first observe that the existing stochastic MAB algorithms result in linear regret for the BL-MAB model. We prove that, if the best arm is equally likely to arrive at any time instant, a sub-linear regret cannot be achieved. Next, we show that if the best arm is more likely to arrive in the early rounds, one can achieve sub-linear regret. Our proposed algorithm determines (1) the fraction of the time horizon for which the newly arriving arms should be explored and (2) the sequence of arm pulls in the exploitation phase from among the explored arms. Making reasonable assumptions on the arrival distribution of the best arm in terms of the thinness of the distribution's tail, we prove that the proposed algorithm achieves sub-linear instance-independent regret. We further quantify explicit dependence of regret on the arrival distribution parameters. We reinforce our theoretical findings with extensive simulation results. We conclude by showing that our algorithm would achieve sub-linear regret even if (a) the distributional parameters are not exactly known, but are obtained using a reasonable learning mechanism or (b) the best arm is not more likely to arrive early, but a large fraction of arms is likely to arrive relatively early.

下载PDF全文

下载文献需遵守相关版权规定

论文标题