汤普森抽样算法的有限时间遗憾的指数家庭多军土匪

论文标题

汤普森抽样算法的有限时间遗憾的指数家庭多军土匪

Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits

论文作者

Jin, Tianyuan, Xu, Pan, Xiao, Xiaokui, Anandkumar, Anima

论文摘要

We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.我们对expts提供了严格的遗憾分析，同时产生有限的遗憾和渐近遗憾。特别是，对于具有指数家庭奖励的$ k $武装的土匪，超过$ t $的expts是sub-ucb（对于有限的时间遗憾的有限时间遗憾的有限标准），最大值最佳$ \ sqrt {\ log log k} $，以及差异是最佳的，是thealesential niblesential nifeartial enlusenential enluberential for Tardonential for Tardenential for Tarriential for Tarriential for Tarriential Forshard for Tarriential Forshard。此外，除了在Expts中使用的采样分布外，我们还通过添加一个贪婪的剥削步骤来提出expts $^+$，以避免过度估计亚最佳武器。 expts $^+$是随时随地的强盗算法，可以同时实现最小值的最佳和渐近最优性，以达到指数式的家庭奖励分布。我们的证明技术在概念上很简单，可以轻松地应用于用特定的奖励分布分析标准的汤普森抽样。

We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a $K$-armed bandit with exponential family rewards, ExpTS over a horizon $T$ is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor $\sqrt{\log K}$, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS$^+$, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms. ExpTS$^+$ is an anytime bandit algorithm and achieves the minimax optimality and asymptotic optimality simultaneously for exponential family reward distributions. Our proof techniques are general and conceptually simple and can be easily applied to analyze standard Thompson sampling with specific reward distributions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题