论文标题
公平和福利量化,以遗憾的是多军匪徒
Fairness and Welfare Quantification for Regret in Multi-Armed Bandits
论文作者
论文摘要
我们从福利主义的角度扩展了遗憾的概念。目前的工作着重于经典的多军强盗(MAB)框架,通过应用基本福利功能(即NASH Social Felfare(NSW)功能)来量化强盗算法的性能。这对应于将算法的性能等同于其预期奖励的几何平均值,并导致我们研究纳什后悔的研究,被定义为 - 先验未知 - 最佳平均值(在武器中)和算法的性能之间的差异。由于已知新南威尔士州可以满足公平公理的满足,因此我们的方法补充了平均(累积)遗憾的功利主义考虑,其中通过其预期奖励的算术平均值评估了该算法。 这项工作开发了一种算法,鉴于play $ t $的视野,nash遗憾的是$ o \ left(\ sqrt {\ frac {\ frac {k \ log t} {t}} {t}}} \ right)$,在这里$ k $表示mab实例中的武器数。由于对于任何算法,纳什的遗憾至少与平均遗憾(AM-GM不平等)一样多,因此已知的下降平均遗憾也使纳什的遗憾感到遗憾。因此,我们的纳什遗憾保证本质上是紧张的。此外,我们开发了一种随时随地的算法,并具有$ o \ left的NASH后悔保证(\ sqrt {\ frac {k \ log t} {t} {t}} {t}} \ log log t \ right)$。
We extend the notion of regret with a welfarist perspective. Focussing on the classic multi-armed bandit (MAB) framework, the current work quantifies the performance of bandit algorithms by applying a fundamental welfare function, namely the Nash social welfare (NSW) function. This corresponds to equating algorithm's performance to the geometric mean of its expected rewards and leads us to the study of Nash regret, defined as the difference between the -- a priori unknown -- optimal mean (among the arms) and the algorithm's performance. Since NSW is known to satisfy fairness axioms, our approach complements the utilitarian considerations of average (cumulative) regret, wherein the algorithm is evaluated via the arithmetic mean of its expected rewards. This work develops an algorithm that, given the horizon of play $T$, achieves a Nash regret of $O \left( \sqrt{\frac{k \log T}{T}} \right)$, here $k$ denotes the number of arms in the MAB instance. Since, for any algorithm, the Nash regret is at least as much as its average regret (the AM-GM inequality), the known lower bound on average regret holds for Nash regret as well. Therefore, our Nash regret guarantee is essentially tight. In addition, we develop an anytime algorithm with a Nash regret guarantee of $O \left( \sqrt{\frac{k\log T}{T}} \log T \right)$.