论文标题
确保基于熵的基于价值的增强学习的单调政策改进
Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based Reinforcement Learning
论文作者
论文摘要
本文旨在建立一种基于熵的基于价值的强化学习方法,以确保在每个策略更新中单调的策略改进。与以前提出的关于一般无限马多利亚人MDP的政策改进的低结合,我们得出了一个熵验证意识的下限。由于我们的界限仅需要估算预期的策略优势函数,因此可以扩展到大规模(连续)状态空间问题。我们提出了一种新颖的强化学习算法,该算法利用了该较低的限制,作为调整政策更新程度以减轻政策振荡的标准。我们使用线性函数近似器进行价值估计,在离散状态迷宫和连续状态倒置的摆任务中证明了方法的有效性。
This paper aims to establish an entropy-regularized value-based reinforcement learning method that can ensure the monotonic improvement of policies at each policy update. Unlike previously proposed lower-bounds on policy improvement in general infinite-horizon MDPs, we derive an entropy-regularization aware lower bound. Since our bound only requires the expected policy advantage function to be estimated, it is scalable to large-scale (continuous) state-space problems. We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation. We demonstrate the effectiveness of our approach in both discrete-state maze and continuous-state inverted pendulum tasks using a linear function approximator for value estimation.