概括具有时间分配的多军匪徒的部分奖励分布

论文标题

概括具有时间分配的多军匪徒的部分奖励分布

Generalizing distribution of partial rewards for multi-armed bandits with temporally-partitioned rewards

论文作者

Broek, Ronald C. van den, Litjens, Rik, Sagis, Tobias, Siecker, Luc, Verbeeke, Nina, Gajane, Pratik

论文摘要

我们在本文中研究了具有时间分配的奖励（TP-MAB）设置的多臂匪徒问题。在TP-MAB设置中，代理商将在多个回合中获得奖励子集，而不是一次全部奖励。在本文中，我们介绍了一种一般的表述，说明如何在几个回合中分发ARM的累积奖励，称为Beta-Spreac Propert。需要这样的概括来处理分区的奖励，在这种分区奖励中，每轮最大奖励并不均匀分布。在β-Spread持有的假设下，我们在TP-MAB问题上得出了一个下限。此外，我们提供了一种算法TP-UCB-FR-G，该算法使用beta-spread属性在某些情况下改善了遗憾的上限。通过概括累积奖励的分布方式，此设置适用于更广泛的应用程序。

We investigate the Multi-Armed Bandit problem with Temporally-Partitioned Rewards (TP-MAB) setting in this paper. In the TP-MAB setting, an agent will receive subsets of the reward over multiple rounds rather than the entire reward for the arm all at once. In this paper, we introduce a general formulation of how an arm's cumulative reward is distributed across several rounds, called Beta-spread property. Such a generalization is needed to be able to handle partitioned rewards in which the maximum reward per round is not distributed uniformly across rounds. We derive a lower bound on the TP-MAB problem under the assumption that Beta-spread holds. Moreover, we provide an algorithm TP-UCB-FR-G, which uses the Beta-spread property to improve the regret upper bound in some scenarios. By generalizing how the cumulative reward is distributed, this setting is applicable in a broader range of applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题