不适用的动作学习在强化学习中学习知识转移

论文标题

不适用的动作学习在强化学习中学习知识转移

Inapplicable Actions Learning for Knowledge Transfer in Reinforcement Learning

论文作者

Ardon, Leo, Pozanco, Alberto, Borrajo, Daniel, Ganesh, Sumitra

论文摘要

众所周知，强化学习（RL）算法的扩展很差，可用于许多可用动作的环境，需要许多样本来学习最佳政策。在每个可能状态下考虑相同的固定动作空间的传统方法意味着代理必须理解，同时还必须学会最大化其奖励，以忽略诸如$ \ textit {Intapplicable Action} $之类的无关行动（即，在给定状态下执行时对环境没有影响的动作）。了解此信息可以通过掩盖从策略分布中的不适用动作来帮助探索与查找最佳策略相关的动作来帮助降低RL算法的样本复杂性。尽管该技术在自动化计划社区中已在带有Strips语言的先决条件的概念中被形式化了一段时间，但RL算法从未正式利用这些信息来修剪搜索空间以探索。这通常是在RL算法中添加的手工制作的域逻辑以临时方式完成的。在本文中，我们提出了一种更系统的方法，将这些知识引入算法。我们（i）将可以手动指定给代理的知识的方式标准化；（ii）提出了一个新框架，以自主学习部分行动模型，该模型封装了与该政策共同诉讼的先决条件。我们通过实验表明，学习不可及的动作可以通过提供可靠的信号来掩盖无关紧要的动作来大大提高算法的样本效率。此外，我们证明，由于获取的知识的可传递性，它可以在其他任务和域中重复使用，以使学习过程更有效。

Reinforcement Learning (RL) algorithms are known to scale poorly to environments with many available actions, requiring numerous samples to learn an optimal policy. The traditional approach of considering the same fixed action space in every possible state implies that the agent must understand, while also learning to maximize its reward, to ignore irrelevant actions such as $\textit{inapplicable actions}$ (i.e. actions that have no effect on the environment when performed in a given state). Knowing this information can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy. While this technique has been formalized for quite some time within the Automated Planning community with the concept of precondition in the STRIPS language, RL algorithms have never formally taken advantage of this information to prune the search space to explore. This is typically done in an ad-hoc manner with hand-crafted domain logic added to the RL algorithm. In this paper, we propose a more systematic approach to introduce this knowledge into the algorithm. We (i) standardize the way knowledge can be manually specified to the agent; and (ii) present a new framework to autonomously learn the partial action model encapsulating the precondition of an action jointly with the policy. We show experimentally that learning inapplicable actions greatly improves the sample efficiency of the algorithm by providing a reliable signal to mask out irrelevant actions. Moreover, we demonstrate that thanks to the transferability of the knowledge acquired, it can be reused in other tasks and domains to make the learning process more efficient.

下载PDF全文

下载文献需遵守相关版权规定

论文标题