一个政策指导的模仿方法，用于离线加强学习

论文标题

一个政策指导的模仿方法，用于离线加强学习

A Policy-Guided Imitation Approach for Offline Reinforcement Learning

论文作者

Xu, Haoran, Jiang, Li, Li, Jianxiong, Zhan, Xianyuan

论文摘要

离线增强学习（RL）方法通常可以分为两种类型：基于RL和基于模仿的方法。基于RL的方法原则上可以享受分布式概括，但会遭受错误的非政策评估。基于模仿的方法避免了违规评估，但太保守了，无法超越数据集。在这项研究中，我们提出了一种替代方法，继承了模仿式方法的训练稳定性，同时仍允许逻辑分布概括。我们将传统的奖励最大化政策分解为离线RL中的指导政策和执行政策。在培训期间，指导式和执行政策仅使用来自数据集的数据以有监督和脱钩的方式学习。在评估过程中，指南 - 政策可以通过告诉应该去哪里执行操作，以使奖励可以最大化，并用作\ textit {prophet}。通过这样做，我们的算法允许\ textit {state-compositionality}从数据集中，而不是以前的模仿方式进行的\ textit {action-compositionality}。我们愚弄这种新方法的策略引导离线RL（\ texttt {por}）。 \ texttt {por}演示了D4RL的最新性能，D4RL是离线RL的标准基准测试。我们还从补充次优的数据改进方面，强调了\ texttt {por}的好处，并仅通过更改指南票价来轻松适应新任务。

Offline reinforcement learning (RL) methods can generally be categorized into two types: RL-based and Imitation-based. RL-based methods could in principle enjoy out-of-distribution generalization but suffer from erroneous off-policy evaluation. Imitation-based methods avoid off-policy evaluation but are too conservative to surpass the dataset. In this study, we propose an alternative approach, inheriting the training stability of imitation-style methods while still allowing logical out-of-distribution generalization. We decompose the conventional reward-maximizing policy in offline RL into a guide-policy and an execute-policy. During training, the guide-poicy and execute-policy are learned using only data from the dataset, in a supervised and decoupled manner. During evaluation, the guide-policy guides the execute-policy by telling where it should go so that the reward can be maximized, serving as the \textit{Prophet}. By doing so, our algorithm allows \textit{state-compositionality} from the dataset, rather than \textit{action-compositionality} conducted in prior imitation-style methods. We dumb this new approach Policy-guided Offline RL (\texttt{POR}). \texttt{POR} demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline RL. We also highlight the benefits of \texttt{POR} in terms of improving with supplementary suboptimal data and easily adapting to new tasks by only changing the guide-poicy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题