歧视者引导的基于模型的离线模仿学习

论文标题

歧视者引导的基于模型的离线模仿学习

Discriminator-Guided Model-Based Offline Imitation Learning

论文作者

Zhang, Wenjia, Xu, Haoran, Niu, Haoyi, Cheng, Peng, Li, Ming, Zhang, Heming, Zhou, Guyue, Zhan, Xianyuan

论文摘要

离线模仿学习（IL）是从没有奖励标签的专家演示中解决决策问题的有力方法。在有限的专家数据下，现有的离线IL方法患有严重的性能变性。但是，包括学习的动态模型可以潜在地改善专家数据的州行动空间覆盖范围，但是，它也面临着诸如模型近似/概括/概括错误和次级临时数据的挑战性问题。在本文中，我们提出了基于歧视者指导的基于模型的离线模仿学习（DMIL）框架，该框架引入了一个歧视者，以同时区分模型推出数据的动力学正确性和次优性与真实专家示范的次优性。 DMIL采用了一种新颖的合作对抗学习策略，该策略使用歧视者指导和融合了政策和动态模型的学习过程，从而改善了模型的性能和鲁棒性。当演示包含大量次优数据时，我们的框架也可以扩展到案例。实验结果表明，与小型数据集下的最新离线IL方法相比，DMIL及其扩展具有出色的性能和鲁棒性。

Offline imitation learning (IL) is a powerful method to solve decision-making problems from expert demonstrations without reward labels. Existing offline IL methods suffer from severe performance degeneration under limited expert data. Including a learned dynamics model can potentially improve the state-action space coverage of expert data, however, it also faces challenging issues like model approximation/generalization errors and suboptimality of rollout data. In this paper, we propose the Discriminator-guided Model-based offline Imitation Learning (DMIL) framework, which introduces a discriminator to simultaneously distinguish the dynamics correctness and suboptimality of model rollout data against real expert demonstrations. DMIL adopts a novel cooperative-yet-adversarial learning strategy, which uses the discriminator to guide and couple the learning process of the policy and dynamics model, resulting in improved model performance and robustness. Our framework can also be extended to the case when demonstrations contain a large proportion of suboptimal data. Experimental results show that DMIL and its extension achieve superior performance and robustness compared to state-of-the-art offline IL methods under small datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题