论文标题

概括具有时间分配的多军匪徒的部分奖励分布

Generalizing distribution of partial rewards for multi-armed bandits with temporally-partitioned rewards

论文作者

Broek, Ronald C. van den, Litjens, Rik, Sagis, Tobias, Siecker, Luc, Verbeeke, Nina, Gajane, Pratik

论文摘要

我们在本文中研究了具有时间分配的奖励(TP-MAB)设置的多臂匪徒问题。在TP-MAB设置中,代理商将在多个回合中获得奖励子集,而不是一次全部奖励。在本文中,我们介绍了一种一般的表述,说明如何在几个回合中分发ARM的累积奖励,称为Beta-Spreac Propert。需要这样的概括来处理分区的奖励,在这种分区奖励中,每轮最大奖励并不均匀分布。在β-Spread持有的假设下,我们在TP-MAB问题上得出了一个下限。此外,我们提供了一种算法TP-UCB-FR-G,该算法使用beta-spread属性在某些情况下改善了遗憾的上限。通过概括累积奖励的分布方式,此设置适用于更广泛的应用程序。

We investigate the Multi-Armed Bandit problem with Temporally-Partitioned Rewards (TP-MAB) setting in this paper. In the TP-MAB setting, an agent will receive subsets of the reward over multiple rounds rather than the entire reward for the arm all at once. In this paper, we introduce a general formulation of how an arm's cumulative reward is distributed across several rounds, called Beta-spread property. Such a generalization is needed to be able to handle partitioned rewards in which the maximum reward per round is not distributed uniformly across rounds. We derive a lower bound on the TP-MAB problem under the assumption that Beta-spread holds. Moreover, we provide an algorithm TP-UCB-FR-G, which uses the Beta-spread property to improve the regret upper bound in some scenarios. By generalizing how the cumulative reward is distributed, this setting is applicable in a broader range of applications.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源