告别武器：预算中的顺序奖励最大化，放弃选择

论文标题

告别武器：预算中的顺序奖励最大化，放弃选择

A Farewell to Arms: Sequential Reward Maximization on a Budget with a Giving Up Option

论文作者

Sharoff, P, Mehta, Nishant A., Ganti, Ravi

论文摘要

我们考虑了一个顺序决策问题，代理可以一次采取一项操作，并且每个动作都有随机时间范围，即，直到上一项完成后才能采取新的动作。完成后，选择的动作会产生随机奖励。代理商试图在有限的时间预算中最大化其累积奖励，并可以选择“放弃”当前的行动 - 因此没收任何奖励 - 以选择其他行动。我们将这个问题视为随机消费资源的随机多军匪徒问题的变体。对于这个问题，我们首先确定最佳臂是最大化手臂预期奖励与预期等待时间比率的臂，而在代理人看到由于拉动该手臂而获得的奖励。然后，我们使用该比率上的新型上限置信度，然后引入了基于上置信度的算法wait-ucb，为此我们建立了对数，与问题依赖性的遗憾结合，与以前的工作相比，对问题参数的依赖性提高了。还介绍了将Wait-UCB与最新算法进行比较的各种问题配置的模拟。

We consider a sequential decision-making problem where an agent can take one action at a time and each action has a stochastic temporal extent, i.e., a new action cannot be taken until the previous one is finished. Upon completion, the chosen action yields a stochastic reward. The agent seeks to maximize its cumulative reward over a finite time budget, with the option of "giving up" on a current action -- hence forfeiting any reward -- in order to choose another action. We cast this problem as a variant of the stochastic multi-armed bandits problem with stochastic consumption of resource. For this problem, we first establish that the optimal arm is the one that maximizes the ratio of the expected reward of the arm to the expected waiting time before the agent sees the reward due to pulling that arm. Using a novel upper confidence bound on this ratio, we then introduce an upper confidence based-algorithm, WAIT-UCB, for which we establish logarithmic, problem-dependent regret bound which has an improved dependence on problem parameters compared to previous works. Simulations on various problem configurations comparing WAIT-UCB against the state-of-the-art algorithms are also presented.

下载PDF全文

下载文献需遵守相关版权规定

论文标题