论文标题
您的政策正规机秘密是对手
Your Policy Regularizer is Secretly an Adversary
论文作者
论文摘要
诸如最大熵正则化之类的策略正则化方法被广泛用于增强学习,以改善学习政策的鲁棒性。在本文中,我们展示了这种鲁棒性是如何通过对冲的奖励功能扰动而产生的,奖励功能是从想象中的对手设定的有限设置中选择的。使用凸双重性,我们表征了KL和Alpha-Divergence正则化的一组强大的对抗奖励扰动,其中包括Shannon和Tsallis Entropy正则定期为特殊情况。重要的是,可以在此强大的集合中给出概括保证。我们提供了有关最坏情况的奖励扰动的详细讨论,并提供了直观的经验示例,以说明这种稳健性及其与概括的关系。最后,我们讨论我们的分析如何补充并扩展了对对抗奖励鲁棒性和路径一致性最佳条件的先前结果。
Policy regularization methods such as maximum entropy regularization are widely used in reinforcement learning to improve the robustness of a learned policy. In this paper, we show how this robustness arises from hedging against worst-case perturbations of the reward function, which are chosen from a limited set by an imagined adversary. Using convex duality, we characterize this robust set of adversarial reward perturbations under KL and alpha-divergence regularization, which includes Shannon and Tsallis entropy regularization as special cases. Importantly, generalization guarantees can be given within this robust set. We provide detailed discussion of the worst-case reward perturbations, and present intuitive empirical examples to illustrate this robustness and its relationship with generalization. Finally, we discuss how our analysis complements and extends previous results on adversarial reward robustness and path consistency optimality conditions.