基于模型的离线计划

论文标题

基于模型的离线计划

Model-Based Offline Planning

论文作者

Argenson, Arthur, Dulac-Arnold, Gabriel

论文摘要

离线学习是使实际系统可用的增强学习（RL）的关键部分。离线RL查看系统操作中有数据的方案，但是在学习策略时没有直接访问系统。关于培训RL策略的最新工作已经显示了直接从数据中学到的无模型策略的结果，或者是按照学习的数据模型的规划。无模型的政策往往更具性能，但更不透明，更难在外部命令，并且不容易集成到较大的系统中。我们提出了一个离线学习者，该脱机学习者生成一个模型，该模型可通过计划直接控制系统。这使我们可以直接从数据中获得易于控制的策略，而无需与系统进行交互。我们在一系列受机器人启发的任务上展示了我们的算法，基于模型的离线计划（MBOP）的性能，并展示了其能力利用计划以尊重环境约束。我们能够从仅50秒的实时系统交互中找到某些模拟系统的近乎最佳策略，并在一系列环境中创建零击目标条件策略。随附的视频可以在此处找到：https：//youtu.be/nxgghdzofts

Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to integrate into larger systems. We propose an offline learner that generates a model that can be used to control the system directly through planning. This allows us to have easily controllable policies directly from data, without ever interacting with the system. We show the performance of our algorithm, Model-Based Offline Planning (MBOP) on a series of robotics-inspired tasks, and demonstrate its ability leverage planning to respect environmental constraints. We are able to find near-optimal polices for certain simulated systems from as little as 50 seconds of real-time system interaction, and create zero-shot goal-conditioned policies on a series of environments. An accompanying video can be found here: https://youtu.be/nxGGHdZOFts

下载PDF全文

下载文献需遵守相关版权规定

论文标题