论文标题
使用查询和反例
Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples
论文作者
论文摘要
尽管深入的强化学习(RL)在各种任务中都超过了人类水平的表现,但它仍然面临一些根本的挑战。首先,大多数RL方法都需要对环境探索的密集数据,以实现令人满意的性能。其次,在RL中使用神经网络会使很难以人类可以理解的方式来解释系统的内部。为了应对这两个挑战,我们提出了一个框架,该框架使RL代理可以推理其勘探过程并提炼高级知识,以有效地指导其未来的探索。具体而言,我们提出了一种新颖的RL算法,该算法通过使用L*学习算法以有限的奖励自动机学习高级知识。我们证明,在情节RL中,有限的奖励自动机可以表达具有有限的奖励值有限的任何非马克维亚有限的奖励函数,并近似任何非马克维亚有限的奖励函数(具有无限的奖励值),具有任意精度。我们还为发作长度提供了一个下限,以使所提出的RL方法几乎可以肯定地收敛到极限的最佳策略。我们在具有非马克维亚奖励功能的两个RL环境上测试了这种方法,选择了各种任务,每个环境的复杂性都会增加。我们将算法与非马克维亚奖励功能的最先进的RL算法进行了比较,例如RL(JIRP)的奖励机和政策的联合推断,学习奖励机(LRM)和近端政策优化(PPO2)。我们的结果表明,我们的算法比其他基线方法更快地收敛到最佳策略。
Despite the fact that deep reinforcement learning (RL) has surpassed human-level performances in various tasks, it still has several fundamental challenges. First, most RL methods require intensive data from the exploration of the environment to achieve satisfactory performance. Second, the use of neural networks in RL renders it hard to interpret the internals of the system in a way that humans can understand. To address these two challenges, we propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations. Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm. We prove that in episodic RL, a finite reward automaton can express any non-Markovian bounded reward functions with finitely many reward values and approximate any non-Markovian bounded reward function (with infinitely many reward values) with arbitrary precision. We also provide a lower bound for the episode length such that the proposed RL approach almost surely converges to an optimal policy in the limit. We test this approach on two RL environments with non-Markovian reward functions, choosing a variety of tasks with increasing complexity for each environment. We compare our algorithm with the state-of-the-art RL algorithms for non-Markovian reward functions, such as Joint Inference of Reward machines and Policies for RL (JIRP), Learning Reward Machine (LRM), and Proximal Policy Optimization (PPO2). Our results show that our algorithm converges to an optimal policy faster than other baseline methods.