确保基于熵的基于价值的增强学习的单调政策改进

论文标题

确保基于熵的基于价值的增强学习的单调政策改进

Ensuring Monotonic Policy Improvement in Entropy-regularized Value-based Reinforcement Learning

论文作者

Zhu, Lingwei, Matsubara, Takamitsu

论文摘要

本文旨在建立一种基于熵的基于价值的强化学习方法，以确保在每个策略更新中单调的策略改进。与以前提出的关于一般无限马多利亚人MDP的政策改进的低结合，我们得出了一个熵验证意识的下限。由于我们的界限仅需要估算预期的策略优势函数，因此可以扩展到大规模（连续）状态空间问题。我们提出了一种新颖的强化学习算法，该算法利用了该较低的限制，作为调整政策更新程度以减轻政策振荡的标准。我们使用线性函数近似器进行价值估计，在离散状态迷宫和连续状态倒置的摆任务中证明了方法的有效性。

This paper aims to establish an entropy-regularized value-based reinforcement learning method that can ensure the monotonic improvement of policies at each policy update. Unlike previously proposed lower-bounds on policy improvement in general infinite-horizon MDPs, we derive an entropy-regularization aware lower bound. Since our bound only requires the expected policy advantage function to be estimated, it is scalable to large-scale (continuous) state-space problems. We propose a novel reinforcement learning algorithm that exploits this lower-bound as a criterion for adjusting the degree of a policy update for alleviating policy oscillation. We demonstrate the effectiveness of our approach in both discrete-state maze and continuous-state inverted pendulum tasks using a linear function approximator for value estimation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题