行为先验表示离线增强学习

论文标题

行为先验表示离线增强学习

Behavior Prior Representation learning for Offline Reinforcement Learning

论文作者

Zang, Hongyu, Li, Xin, Yu, Jie, Liu, Chen, Islam, Riashat, Combes, Remi Tachet Des, Laroche, Romain

论文摘要

离线增强学习（RL）在具有丰富和嘈杂的输入的环境中挣扎，在没有环境交互的情况下，代理只能访问固定数据集。过去的工作根据国家代表的预培训提出了共同的解决方法，然后进行了政策培训。在这项工作中，我们介绍了一种简单但有效的学习状态表示方法。我们的方法，即行为先验表示（BPR），根据数据集的行为克隆来学习以易于整合的目标学习状态表示：我们首先通过模仿数据集中的动作来学习状态表示，然后使用任何台面上的离线RL Algorithm在固定表示方面训练策略。从理论上讲，我们证明，当将BPR整合到具有政策改进保证（保守算法）或产生政策价值（悲观算法）的下限时，BPR会执行性能保证。从经验上讲，我们表明BPR与现有的最新离线RL算法相结合，可在几个离线控制基准中取得重大改进。该代码可在\ url {https://github.com/bit1029public/offline_bpr}中获得。

Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks. The code is available at \url{https://github.com/bit1029public/offline_bpr}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题