政策流失现象

论文标题

政策流失现象

The Phenomenon of Policy Churn

论文作者

Schaul, Tom, Barreto, André, Quan, John, Ostrovski, Georg

论文摘要

我们确定并研究了政策流失的现象，即基于价值的强化学习中贪婪政策的快速变化。政策流失以惊人的快速速度运作，改变了少数学习更新（在典型的深度RL设置中，例如Atari上的DQN），改变了很大一部分州的贪婪行动。我们从经验上表征了现象，验证它不限于特定算法或环境特性。许多消融有助于削弱关于为什么流失仅与深度学习有关的少数相关的合理解释。最后，我们假设政策流失是一种有益但被忽视的隐性探索形式，它以新鲜的方式铸造$ε$ - 梅迪探索，即$ε$ - noise的作用要比预期的要小得多。

We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs to just a handful, all related to deep learning. Finally, we hypothesise that policy churn is a beneficial but overlooked form of implicit exploration that casts $ε$-greedy exploration in a fresh light, namely that $ε$-noise plays a much smaller role than expected.

下载PDF全文

下载文献需遵守相关版权规定

论文标题