论文标题
通过学习模型的持续控制搜索和计划
Continuous Control for Searching and Planning with a Learned Model
论文作者
论文摘要
具有计划能力的决策代理商在诸如国际象棋,Shogi和Go等具有挑战性的领域取得了巨大的成功。为了将计划能力概括为代理商无法使用环境动态的更通用的任务,研究人员提出了可以通过与环境的交互来学习动力学模型的Muzero算法。在本文中,我们提供了一种方法和必要的理论结果,将Muzero算法扩展到具有连续动作空间的更广泛的环境。通过在两个相对较低的穆约科克环境上的数值结果,我们显示了所提出的算法的表现优于软演员 - 批评(SAC)算法,这是一种最先进的无模型的无模型深度强化学习算法。
Decision-making agents with planning capabilities have achieved huge success in the challenging domain like Chess, Shogi, and Go. In an effort to generalize the planning ability to the more general tasks where the environment dynamics are not available to the agent, researchers proposed the MuZero algorithm that can learn the dynamical model through the interactions with the environment. In this paper, we provide a way and the necessary theoretical results to extend the MuZero algorithm to more generalized environments with continuous action space. Through numerical results on two relatively low-dimensional MuJoCo environments, we show the proposed algorithm outperforms the soft actor-critic (SAC) algorithm, a state-of-the-art model-free deep reinforcement learning algorithm.