论文标题
除了政策梯度定理以外,用于参与者批评算法的有效政策更新
Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms
论文作者
论文摘要
在加强学习中,在给定状态下的最佳行动取决于随后州的政策决策。结果,学习目标是随着时间的推移而发展的,政策优化过程必须有效地学习以前学到的知识。在本文中,我们发现策略梯度定理规定了由于其相对于价值目标的结构对称性而无法学习的策略更新。为了提高未学习速度,我们研究了一个新的政策更新:关于最大化$ Q $的动作,跨凝结损失的梯度,但发现此类更新可能导致价值下降。因此,我们引入了一个没有缺陷的修改后的策略更新,并证明了其在$ \ Mathcal {o}中融合全球最优性的保证,在经典假设下(T^{ - 1})$。此外,我们评估了沿六个分析维度的标准策略更新和我们的跨凝结政策更新。最后,我们从经验上验证了我们的理论发现。
In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.