MDP同态的可计划近似值：行动中的均衡级

论文标题

MDP同态的可计划近似值：行动中的均衡级

Plannable Approximations to MDP Homomorphisms: Equivariance under Actions

论文作者

van der Pol, Elise, Kipf, Thomas, Oliehoek, Frans A., Welling, Max

论文摘要

这项工作利用了在强化学习中的代表性学习方面的行动均值。在动作下的均衡性指出，在潜在空间中，输入空间中的过渡在等效过渡中反映出来，而地图和过渡功能也应通勤。我们引入了对比度损失函数，该函数可以对学习的表示形式实施行动均衡。我们证明，当我们的损失为零时，我们会有确定性马尔可夫决策过程（MDP）的同态。学习模糊图导致结构化的潜在空间，从而使我们能够建立一个通过价值迭代计划的模型。我们通过实验表明，对于确定性的MDP，可以将抽象MDP中的最佳策略成功提起到原始的MDP。此外，该方法很容易适应目标状态的变化。从经验上讲，我们表明，在这样的MDP中，与使用重建的表示学习方法相比，我们在更少的时期获得了更好的表示，同时比无模型方法更好地将其推广到新目标。

This work exploits action equivariance for representation learning in reinforcement learning. Equivariance under actions states that transitions in the input space are mirrored by equivalent transitions in latent space, while the map and transition functions should also commute. We introduce a contrastive loss function that enforces action equivariance on the learned representations. We prove that when our loss is zero, we have a homomorphism of a deterministic Markov Decision Process (MDP). Learning equivariant maps leads to structured latent spaces, allowing us to build a model on which we plan through value iteration. We show experimentally that for deterministic MDPs, the optimal policy in the abstract MDP can be successfully lifted to the original MDP. Moreover, the approach easily adapts to changes in the goal states. Empirically, we show that in such MDPs, we obtain better representations in fewer epochs compared to representation learning approaches using reconstructions, while generalizing better to new goals than model-free approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题