论文标题

确保基于熵的基于价值的增强学习的单调政策改进

Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based Reinforcement Learning

论文作者

Zhu, Lingwei, Matsubara, Takamitsu

论文摘要

本文旨在建立一种基于熵的基于价值的强化学习方法,以确保在每个策略更新中单调的策略改进。与以前提出的关于一般无限马多利亚人MDP的政策改进的低结合,我们得出了一个熵验证意识的下限。由于我们的界限仅需要估算预期的策略优势函数,因此可以扩展到大规模(连续)状态空间问题。我们提出了一种新颖的强化学习算法,该算法利用了该较低的限制,作为调整政策更新程度以减轻政策振荡的标准。我们使用线性函数近似器进行价值估计,在离散状态迷宫和连续状态倒置的摆任务中证明了方法的有效性。

This paper aims to establish an entropy-regularized value-based reinforcement learning method that can ensure the monotonic improvement of policies at each policy update. Unlike previously proposed lower-bounds on policy improvement in general infinite-horizon MDPs, we derive an entropy-regularization aware lower bound. Since our bound only requires the expected policy advantage function to be estimated, it is scalable to large-scale (continuous) state-space problems. We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation. We demonstrate the effectiveness of our approach in both discrete-state maze and continuous-state inverted pendulum tasks using a linear function approximator for value estimation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源