政策梯度估计的时间差异方法

论文标题

政策梯度估计的时间差异方法

A Temporal-Difference Approach to Policy Gradient Estimation

论文作者

Tosatto, Samuele, Patterson, Andrew, White, Martha, Mahmood, A. Rupam

论文摘要

政策梯度定理（Sutton等，2000）规定了目标政策下累积折扣国家分布以近似梯度的使用。实际上，大多数基于该定理的算法都会打破此假设，引入了分布转移，该分布转移可能导致逆转溶液的收敛性。在本文中，我们提出了一种新的方法，可以从开始状态重建政策梯度，而无需采取特定的抽样策略。可以根据梯度评论家来简化此形式的策略梯度计算，由于梯度的新钟声方程式，可以递归估算。通过使用来自跨政策数据流的梯度评论家的时间差异更新，我们开发了第一个以无模型方式避开分布变化问题的估计器。我们证明，在某些可实现的条件下，无论采样策略如何，我们的估计器都是公正的。我们从经验上表明，我们的技术在存在非政策样本的情况下实现了卓越的偏见差异和性能。

The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题