零订单确定性政策梯度

论文标题

零订单确定性政策梯度

Zeroth-order Deterministic Policy Gradient

论文作者

Kumar, Harshat, Kalogerias, Dionysios S., Pappas, George J., Ribeiro, Alejandro

论文摘要

确定性策略梯度（DPG）从标准的随机行动策略梯度（PG）中删除了一定程度的随机性，并在解决涉及马尔可夫决策过程的复杂动态问题方面展示了实质性的成功成功。不过，与此同时，DPG失去了以无模型（即仅是演员）方式学习的能力，通常需要使用批评家来获得对相关政策回报梯度的一致估计。在这项工作中，我们介绍了零级确定性策略梯度（ZDPG），该阶段通过对$ q $ function的两点随机评估近似于策略奖励梯度，该评估由正确设计的低维操作空间扰动构建。 ZDPG利用随机地平线推广的想法来获得$ Q $功能的无偏估计，可以提高对批评家的依赖，并恢复真正的无模型政策学习，同时享受内置和可证明的算法稳定性。此外，我们为ZDPG提供了新的有限样品复杂性界限，该ZDPG的现有结果最多可改善两个数量级。我们的发现得到了几个数值实验的支持，这些实验在实用环境中介绍了ZDPG的有效性，并且其优势在PG和基线PG中的优势。

Deterministic Policy Gradient (DPG) removes a level of randomness from standard randomized-action Policy Gradient (PG), and demonstrates substantial empirical success for tackling complex dynamic problems involving Markov decision processes. At the same time, though, DPG loses its ability to learn in a model-free (i.e., actor-only) fashion, frequently necessitating the use of critics in order to obtain consistent estimates of the associated policy-reward gradient. In this work, we introduce Zeroth-order Deterministic Policy Gradient (ZDPG), which approximates policy-reward gradients via two-point stochastic evaluations of the $Q$-function, constructed by properly designed low-dimensional action-space perturbations. Exploiting the idea of random horizon rollouts for obtaining unbiased estimates of the $Q$-function, ZDPG lifts the dependence on critics and restores true model-free policy learning, while enjoying built-in and provable algorithmic stability. Additionally, we present new finite sample complexity bounds for ZDPG, which improve upon existing results by up to two orders of magnitude. Our findings are supported by several numerical experiments, which showcase the effectiveness of ZDPG in a practical setting, and its advantages over both PG and Baseline PG.

下载PDF全文

下载文献需遵守相关版权规定

论文标题