从次优的演示中学习稀疏的奖励任务

论文标题

从次优的演示中学习稀疏的奖励任务

Learning Sparse Rewarded Tasks from Sub-Optimal Demonstrations

论文作者

Zhu, Zhuangdi, Lin, Kaixiang, Dai, Bo, Zhou, Jiayu

论文摘要

无模型的深钢筋学习（RL）证明了它在许多复杂的顺序决策问题上的优势。然而，对密集奖励和高样本复杂性的大大依赖阻碍了这些方法在现实情况下的广泛采用。另一方面，模仿学习（IL）通过利用现有的专家演示来有效地学习稀疏回报的任务。在实践中，收集足够数量的专家演示可能会非常昂贵，而且演示质量通常会限制学习政策的表现。在这项工作中，我们提出了自适应模仿学习（SAIL），这些学习（SAIL）只有有限数量的次优示威，可以实现（接近）最佳性能，以实现高度挑战性的稀疏奖励任务。帆桥通过有效利用Sup-Timimal示范并有效地探索环境以超过表现的性能，从而大大降低样本复杂性，从而大大降低样品复杂性。广泛的经验结果表明，与最先进的艺术相比，在不同的连续控制任务中，航行不仅可以显着提高样本效率，还可以使不同连续控制任务的最终表现更好。

Model-free deep reinforcement learning (RL) has demonstrated its superiority on many complex sequential decision-making problems. However, heavy dependence on dense rewards and high sample-complexity impedes the wide adoption of these methods in real-world scenarios. On the other hand, imitation learning (IL) learns effectively in sparse-rewarded tasks by leveraging the existing expert demonstrations. In practice, collecting a sufficient amount of expert demonstrations can be prohibitively expensive, and the quality of demonstrations typically limits the performance of the learning policy. In this work, we propose Self-Adaptive Imitation Learning (SAIL) that can achieve (near) optimal performance given only a limited number of sub-optimal demonstrations for highly challenging sparse reward tasks. SAIL bridges the advantages of IL and RL to reduce the sample complexity substantially, by effectively exploiting sup-optimal demonstrations and efficiently exploring the environment to surpass the demonstrated performance. Extensive empirical results show that not only does SAIL significantly improve the sample-efficiency but also leads to much better final performance across different continuous control tasks, comparing to the state-of-the-art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题