连续的MDP同态和同形政策梯度

论文标题

连续的MDP同态和同形政策梯度

Continuous MDP Homomorphisms and Homomorphic Policy Gradient

论文作者

Rezaei-Shoshtari, Sahand, Zhao, Rosie, Panangaden, Prakash, Meger, David, Precup, Doina

论文摘要

抽象已被广泛研究，以提高增强学习算法的效率和概括。在本文中，我们研究了连续控制环境中的抽象。我们将MDP同态的定义扩展到连续状态空间中的连续作用。我们在抽象MDP上得出了策略梯度定理，这使我们能够利用环境的近似对称性进行策略优化。基于此定理，我们提出了一种能够使用LAX Bisimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation Mimulation的参与者批评算法。我们证明了我们方法对DeepMind Control Suite中基准任务的有效性。我们的方法利用MDP同态来表示学习的能力会导致从像素观测中学习时的性能。

Abstraction has been widely studied as a way to improve the efficiency and generalization of reinforcement learning algorithms. In this paper, we study abstraction in the continuous-control setting. We extend the definition of MDP homomorphisms to encompass continuous actions in continuous state spaces. We derive a policy gradient theorem on the abstract MDP, which allows us to leverage approximate symmetries of the environment for policy optimization. Based on this theorem, we propose an actor-critic algorithm that is able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. We demonstrate the effectiveness of our method on benchmark tasks in the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance when learning from pixel observations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题