具有应用程序的多臂强盗中的最大目标

论文标题

具有应用程序的多臂强盗中的最大目标

Maximal Objectives in the Multi-armed Bandit with Applications

论文作者

Ozbay, Eren, Kamble, Vijay

论文摘要

在随机多军强盗问题的几种应用中，最大化预期总奖励的传统目标是不合适的。在本文中，在在线平台上的某些操作问题的推动下，我们考虑了古典设置中的一个新目标。考虑到$ k $臂，而不是从$ t $ pulls（传统的“总和”目标）中最大化预期的总奖励，我们考虑了$ t $ pults结束时从$ k $ harm中获得的总奖励的矢量，并旨在最大化预期的总奖励总额的最高总奖励（“最大”目标）。对于这个目标，我们表明，任何策略都必须引起$ω（\ log t）$的实例依赖的渐近遗憾（与传统目标相比，依赖实例的常数更高），而最糟糕的遗憾则为$ω（k^{1/3} t^{t^{2/3}）$。然后，我们设计了一种自适应探索 - 然后是基于对平均奖励和自适应停止标准的置信界的探索的策略，以适应问题的难度并实现这些界限（达到对数因素）。然后，我们将算法见解概括为最大化最高$ M $臂的平均总奖励的预期价值，总奖励最高。我们的数值实验证明了与实际参数方案中几种天然替代方案相比，我们的政策有效性。我们讨论了这些新目标的应用程序，即在在线平台中培养足够的具有价值的市场参与者（工人/卖家/服务提供商）的问题。

In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected total reward can be inappropriate. In this paper, motivated by certain operational concerns in online platforms, we consider a new objective in the classical setup. Given $K$ arms, instead of maximizing the expected total reward from $T$ pulls (the traditional "sum" objective), we consider the vector of total rewards earned from each of the $K$ arms at the end of $T$ pulls and aim to maximize the expected highest total reward across arms (the "max" objective). For this objective, we show that any policy must incur an instance-dependent asymptotic regret of $Ω(\log T)$ (with a higher instance-dependent constant compared to the traditional objective) and a worst-case regret of $Ω(K^{1/3}T^{2/3})$. We then design an adaptive explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). We then generalize our algorithmic insights to the problem of maximizing the expected value of the average total reward of the top $m$ arms with the highest total rewards. Our numerical experiments demonstrate the efficacy of our policies compared to several natural alternatives in practical parameter regimes. We discuss applications of these new objectives to the problem of grooming an adequate supply of value-providing market participants (workers/sellers/service providers) in online platforms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题