超越差异：了解基准对政策优化的真正影响

论文标题

超越差异：了解基准对政策优化的真正影响

Beyond variance reduction: Understanding the true impact of baselines on policy optimization

论文作者

Chung, Wesley, Thomas, Valentin, Machado, Marlos C., Roux, Nicolas Le

论文摘要

强盗和加固学习（RL）问题通常可以被构成优化问题，在这种优化问题中，目标是最大化平均性能，同时仅访问真正梯度的随机估计。传统上，随机优化理论预测，学习动力学受损耗函数的曲率和梯度估计的噪声的控制。在本文中，我们证明了强盗和RL问题并非如此。为了根据多步MDP来解释我们的分析，我们专注于从随机优化原理（例如自然政策梯度和EXP3）中得出的技术，我们表明，在这些问题中违反了优化理论的一些标准假设。我们提出理论结果表明，至少对于匪徒问题，曲率和噪声不足以解释学习动力学，并且看似无害的选择（例如基线）可以确定算法是否会收敛。这些理论发现与我们的经验评估相匹配，我们将其扩展到多州MDP。

Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that this is not the case for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题