学会超越：立体声逆增强学习以及并发策略优化

论文标题

学会超越：立体声逆增强学习以及并发策略优化

Learn to Exceed: Stereo Inverse Reinforcement Learning with Concurrent Policy Optimization

论文作者

Tao, Feng, Cao, Yongcan

论文摘要

在本文中，我们研究了获得可以模仿的控制政策的问题，然后在马尔可夫决策过程中表现出色的专家演示，而学习代理人不知道奖励功能。一种主要的相关方法是反向加固学习（IRL），该方法主要侧重于从专家示范中推断出奖励功能。但是，IRL和相关算法获得的控制政策几乎不能超过专家演示。为了克服这一限制，我们提出了一种新颖的方法，使学习代理能够通过新的并发奖励和行动政策学习方法胜过演示者。特别是，我们首先提出了一个新的立体声效用定义，该定义旨在解决专家演示的解释时的偏见。然后，我们为学习代理人提出损失功能，以同时学习奖励和行动政策，以便学习代理可以超越专家的演示。该方法的性能首先在OpenAI环境中证明。进一步努力通过室内无人机飞行方案在实验中验证所提出的方法。

In this paper, we study the problem of obtaining a control policy that can mimic and then outperform expert demonstrations in Markov decision processes where the reward function is unknown to the learning agent. One main relevant approach is the inverse reinforcement learning (IRL), which mainly focuses on inferring a reward function from expert demonstrations. The obtained control policy by IRL and the associated algorithms, however, can hardly outperform expert demonstrations. To overcome this limitation, we propose a novel method that enables the learning agent to outperform the demonstrator via a new concurrent reward and action policy learning approach. In particular, we first propose a new stereo utility definition that aims to address the bias in the interpretation of expert demonstrations. We then propose a loss function for the learning agent to learn reward and action policies concurrently such that the learning agent can outperform expert demonstrations. The performance of the proposed method is first demonstrated in OpenAI environments. Further efforts are conducted to experimentally validate the proposed method via an indoor drone flight scenario.

下载PDF全文

下载文献需遵守相关版权规定

论文标题