您需要熵奖励（实际上）吗？

论文标题

您需要熵奖励（实际上）吗？

Do You Need the Entropy Reward (in Practice)?

论文作者

Yu, Haonan, Zhang, Haichao, Xu, Wei

论文摘要

最大熵（Maxent）RL最大化原始任务奖励和熵奖励的组合。据信，熵对政策改进和政策评估施加的正规化共同有助于良好的探索，培训融合以及学识渊博的政策的鲁棒性。本文通过对Maxent RL的流行代表进行了各种消融研究，仔细研究了熵作为内在奖励。我们的发现表明，通常应谨慎地将熵奖励应用于政策评估。一方面，熵奖励与任何其他内在奖励一样，如果无法正确管理，则可能会掩盖主要任务奖励。我们确定了熵奖励的一些故障案例，尤其是在情节马尔可夫决策过程（MDP）中，这可能会导致该政策过于乐观或悲观。另一方面，我们的大规模实证研究表明，与在政策改进和政策评估中使用相比，仅使用熵正规化在政策改进方面，可以带来可比甚至更好的绩效和鲁棒性。基于这些观察结果，我们建议将熵奖励归一化为零平均值（saczero），或者只是将其从政策评估（saclite）中删除，以获得更好的实际结果。

Maximum entropy (MaxEnt) RL maximizes a combination of the original task reward and an entropy reward. It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies. This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC), a popular representative of MaxEnt RL. Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation. On one hand, the entropy reward, like any other intrinsic reward, could obscure the main task reward if it is not properly managed. We identify some failure cases of the entropy reward especially in episodic Markov decision processes (MDPs), where it could cause the policy to be overly optimistic or pessimistic. On the other hand, our large-scale empirical study shows that using entropy regularization alone in policy improvement, leads to comparable or even better performance and robustness than using it in both policy improvement and policy evaluation. Based on these observations, we recommend either normalizing the entropy reward to a zero mean (SACZero), or simply removing it from policy evaluation (SACLite) for better practical results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题