论文标题
蒙特卡洛树搜索中的凸正则化
Convex Regularization in Monte-Carlo Tree Search
论文作者
论文摘要
蒙特卡洛计划和加强学习(RL)对于顺序决策至关重要。最近的Alphago和Alphazero算法显示了如何成功结合这两个范式以解决大规模的顺序决策问题。这些方法利用了众所周知的UCT算法的变体,以权衡对良好行动和对未访问状态的探索的剥削,但是它们的经验成功是以差样本效率和较高的计算时间为代价。在本文中,我们通过考虑蒙特卡洛树搜索(MCT)中的凸正则化来克服这些局限性,后者已在RL中成功使用以有效地驱动探索。首先,我们介绍了一种统一的理论,该理论涉及使用通用凸正则化合物在MCT中的使用,从而得出了遗憾分析并提供指数融合率的保证。其次,我们利用理论框架,根据策略更新的相对熵以及政策的Tsallis熵,为MCT引入新颖的正规备份操作员。最后,我们通过经验评估了Alphago和Alphazero的拟议运营商,涉及增加维度和分支因素的问题,从玩具问题到几个Atari游戏,展示了他们的优越性W.R.T.代表基线。
Monte-Carlo planning and Reinforcement Learning (RL) are essential to sequential decision making. The recent AlphaGo and AlphaZero algorithms have shown how to successfully combine these two paradigms in order to solve large scale sequential decision problems. These methodologies exploit a variant of the well-known UCT algorithm to trade off exploitation of good actions and exploration of unvisited states, but their empirical success comes at the cost of poor sample-efficiency and high computation time. In this paper, we overcome these limitations by considering convex regularization in Monte-Carlo Tree Search (MCTS), which has been successfully used in RL to efficiently drive exploration. First, we introduce a unifying theory on the use of generic convex regularizers in MCTS, deriving the regret analysis and providing guarantees of exponential convergence rate. Second, we exploit our theoretical framework to introduce novel regularized backup operators for MCTS, based on the relative entropy of the policy update, and on the Tsallis entropy of the policy. Finally, we empirically evaluate the proposed operators in AlphaGo and AlphaZero on problems of increasing dimensionality and branching factor, from a toy problem to several Atari games, showing their superiority w.r.t. representative baselines.