Morel：基于模型的离线增强学习

论文标题

Morel：基于模型的离线增强学习

MOReL : Model-Based Offline Reinforcement Learning

论文作者

Kidambi, Rahul, Rajeswaran, Aravind, Netrapalli, Praneeth, Joachims, Thorsten

论文摘要

在离线增强学习（RL）中，目标是仅根据与环境的历史互动数据集学习高度回报的政策。离线训练RL策略的能力可以大大扩展RL的适用性，其数据效率和实验速度。离线RL的先前工作几乎仅限于无模型的RL方法。在这项工作中，我们介绍了Morel，这是用于基于模型的离线RL的算法框架。该框架由两个步骤组成：（a）使用离线数据集学习悲观的MDP（P-MDP）；（b）在此P-MDP中学习一项近乎最佳的政策。学识渊博的P-MDP具有对于任何政策，实际环境中的绩效大约受P-MDP的性能降低。这使其可以作为政策评估和学习目的的良好替代品，并克服基于模型的RL（例如模型开发）的常见陷阱。从理论上讲，我们表明莫雷尔是离线RL的最小值最佳（最多的日志因子）。通过实验，我们表明莫雷尔匹配或超过了最先进的结果，从而广泛研究了离线RL基准测试。此外，Morel的模块化设计使其组件的未来进步（例如生成建模，不确定性估计，计划等）可以直接转化为离线RL的进步。

In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline can greatly expand the applicability of RL, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; and (b) learning a near-optimal policy in this P-MDP. The learned P-MDP has the property that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the P-MDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of model-based RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds state-of-the-art results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g. generative modeling, uncertainty estimation, planning etc.) to directly translate into advances for offline RL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题