有用的策略不变构成任意建议

论文标题

有用的策略不变构成任意建议

Useful Policy Invariant Shaping from Arbitrary Advice

论文作者

Behboudian, Paniz, Satsangi, Yash, Taylor, Matthew E., Harutyunyan, Anna, Bowling, Michael

论文摘要

加强学习是一个强大的学习范式，在该范式中，代理可以学会最大程度地提高稀疏和延迟的奖励信号。尽管RL在复杂领域取得了许多令人印象深刻的成功，但学习可能需要数小时，数天甚至数年的培训数据。当代RL研究的主要挑战是发现如何使用更少的数据学习。先前的工作表明，可以成功地使用域信息来塑造奖励。通过添加其他奖励信息，代理可以用更少的数据来学习。此外，如果奖励是根据潜在功能构建的，则保证最佳策略将不变。尽管这种基于潜在的奖励成型（PBR）具有希望，但它受到对明确的潜在功能的需求的限制。理想情况下，我们希望能够从人或其他代理商那里获得任意建议，并提高绩效，而不会影响最佳政策。最近引入的基于动态潜在的建议（DPBA）方法通过承认人类或其他代理商的任意建议来应对这一挑战，并在不影响最佳政策的情况下提高绩效。本文的主要贡献是在理论上和经验上揭露DPBA的缺陷。另外，为了实现理想的目标，我们提出了一种简单的方法，称为策略不变型显式塑造（派），并从理论上和经验上显示派对在DPBA失败的情况下成功。

Reinforcement learning is a powerful learning paradigm in which agents can learn to maximize sparse and delayed reward signals. Although RL has had many impressive successes in complex domains, learning can take hours, days, or even years of training data. A major challenge of contemporary RL research is to discover how to learn with less data. Previous work has shown that domain information can be successfully used to shape the reward; by adding additional reward information, the agent can learn with much less data. Furthermore, if the reward is constructed from a potential function, the optimal policy is guaranteed to be unaltered. While such potential-based reward shaping (PBRS) holds promise, it is limited by the need for a well-defined potential function. Ideally, we would like to be able to take arbitrary advice from a human or other agent and improve performance without affecting the optimal policy. The recently introduced dynamic potential based advice (DPBA) method tackles this challenge by admitting arbitrary advice from a human or other agent and improves performance without affecting the optimal policy. The main contribution of this paper is to expose, theoretically and empirically, a flaw in DPBA. Alternatively, to achieve the ideal goals, we present a simple method called policy invariant explicit shaping (PIES) and show theoretically and empirically that PIES succeeds where DPBA fails.

下载PDF全文

下载文献需遵守相关版权规定

论文标题