论文标题
部分可观测时空混沌系统的无模型预测
A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits
论文作者
论文摘要
我们研究了随机的多军强盗问题和设计新政策,这些政策既享有预期的遗憾,又享有遗憾分发的轻尾风险。 Specifically, our policy design (i) enjoys the worst-case optimality for the expected regret at order $O(\sqrt{KT\ln T})$ and (ii) has the worst-case tail probability of incurring a regret larger than any $x>0$ being upper bounded by $\exp(-Ω(x/\sqrt{KT}))$, a rate that we prove to be best achievable with respect所有最糟糕的最佳政策,至$ t $。与基于标准的基于信心的政策相比,我们提出的政策在在时间范围开始时进行更多探索与在接近终结时进行更多剥削之间取得了微妙的平衡。我们还增强了政策设计,以适应$ t $先验的“任何时间”设置,并且与已知$ t $的“固定时间”设置相比,证明了等效的政策性能。进行数值实验以说明理论发现。我们发现,从管理的角度来看,我们的新政策设计产生更好的尾巴分布,比著名的政策更可取,尤其是(i)(i)有低估波动性概况的风险,或者(ii)调整政策超参数的挑战。最后,我们将我们提出的政策设计扩展到随机线性匪徒的设置,从而在遗憾分布上导致最糟糕的最优性。
We study the stochastic multi-armed bandit problem and design new policies that enjoy both worst-case optimality for expected regret and light-tailed risk for regret distribution. Specifically, our policy design (i) enjoys the worst-case optimality for the expected regret at order $O(\sqrt{KT\ln T})$ and (ii) has the worst-case tail probability of incurring a regret larger than any $x>0$ being upper bounded by $\exp(-Ω(x/\sqrt{KT}))$, a rate that we prove to be best achievable with respect to $T$ for all worst-case optimal policies. Our proposed policy achieves a delicate balance between doing more exploration at the beginning of the time horizon and doing more exploitation when approaching the end, compared to standard confidence-bound-based policies. We also enhance the policy design to accommodate the "any-time" setting where $T$ is unknown a priori, and prove equivalently desired policy performances as compared to the "fixed-time" setting with known $T$. Numerical experiments are conducted to illustrate the theoretical findings. We find that from a managerial perspective, our new policy design yields better tail distributions and is preferable than celebrated policies especially when (i) there is a risk of under-estimating the volatility profile, or (ii) there is a challenge of tuning policy hyper-parameters. We conclude by extending our proposed policy design to the stochastic linear bandit setting that leads to both worst-case optimality in terms of expected regret and light-tailed risk on the regret distribution.
