论文标题
政权切换匪徒
Regime Switching Bandits
论文作者
论文摘要
我们研究了一个多臂强盗问题,其中奖励表现出机制切换。具体而言,从所有武器产生的随机奖励的分布是由建模为有限状态马尔可夫链的共同基础状态调节的。代理商不观察基本状态,必须学习过渡矩阵和奖励分布。我们为此问题提出了一种学习算法,基于隐藏的马尔可夫模型的光谱方法估计,在部分可观察到的马尔可夫决策过程中的信念错误控制以及在线学习上的信心界限方法。我们还为提出的学习算法建立了一个上限$ o(t^{2/3} \ sqrt {\ log t})$,其中$ t $是学习范围。最后,我们进行概念验证实验,以说明学习算法的性能。
We study a multi-armed bandit problem where the rewards exhibit regime switching. Specifically, the distributions of the random rewards generated from all arms are modulated by a common underlying state modeled as a finite-state Markov chain. The agent does not observe the underlying state and has to learn the transition matrix and the reward distributions. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, belief error control in partially observable Markov decision processes and upper-confidence-bound methods for online learning. We also establish an upper bound $O(T^{2/3}\sqrt{\log T})$ for the proposed learning algorithm where $T$ is the learning horizon. Finally, we conduct proof-of-concept experiments to illustrate the performance of the learning algorithm.