论文标题
具有审查资源消耗的多军匪徒
Multi-Armed Bandits with Censored Consumption of Resources
论文作者
论文摘要
我们考虑经典多臂匪徒问题的资源感知变体:在每个回合中,学习者选择一个手臂并确定资源限制。然后,它观察到相应的(随机)奖励,前提是(随机)消费资源数量保持在限制之下。否则,观察将进行审查,即未获得奖励。对于此问题设置,我们引入了一种遗憾的度量,其中包含了每个学习回合的实际分配资源,以及可实现的奖励的最佳性。因此,为了最大程度地减少遗憾,学习者需要设定资源限制并以这样的方式选择一个手臂,以使在预定义的资源限制内实现高奖励的机会很高,而资源限制本身应尽可能低。我们提出了一种以UCB为灵感的在线学习算法,我们从理论上分析了它的遗憾上限。在一项仿真研究中,我们表明我们的学习算法优于标准多臂强盗算法的直接扩展。
We consider a resource-aware variant of the classical multi-armed bandit problem: In each round, the learner selects an arm and determines a resource limit. It then observes a corresponding (random) reward, provided the (random) amount of consumed resources remains below the limit. Otherwise, the observation is censored, i.e., no reward is obtained. For this problem setting, we introduce a measure of regret, which incorporates the actual amount of allocated resources of each learning round as well as the optimality of realizable rewards. Thus, to minimize regret, the learner needs to set a resource limit and choose an arm in such a way that the chance to realize a high reward within the predefined resource limit is high, while the resource limit itself should be kept as low as possible. We propose a UCB-inspired online learning algorithm, which we analyze theoretically in terms of its regret upper bound. In a simulation study, we show that our learning algorithm outperforms straightforward extensions of standard multi-armed bandit algorithms.