富含多样性的期权批评

论文标题

富含多样性的期权批评

Diversity-Enriched Option-Critic

论文作者

Kamat, Anand, Precup, Doina

论文摘要

时间抽象允许加强学习剂代表知识并通过不同的时间尺度制定策略。已经证明了选项批判性框架以学习为选项的时间扩展的动作，在无模型设置中端到端。但是，由于两个主要的挑战，采用非常相似的行为或一组相关的任务集，期权批判性的可行性仍然有限。这些发生不仅使时间抽象的需求无效，而且还会影响性能。在本文中，我们通过学习各种选择来解决这些问题。我们介绍了信息理论的内在奖励，该奖励增强了任务奖励以及新的终止目标，以鼓励选项集中的行为多样性。我们从经验上表明，我们所提出的方法能够在几个离散和连续的控制任务上端到端学习选项，超过选项批判的优于差距。此外，我们表明我们的方法可持续产生与期权批评相比的强大，可重复使用，可靠和可解释的选择。

Temporal abstraction allows reinforcement learning agents to represent knowledge and develop strategies over different temporal scales. The option-critic framework has been demonstrated to learn temporally extended actions, represented as options, end-to-end in a model-free setting. However, feasibility of option-critic remains limited due to two major challenges, multiple options adopting very similar behavior, or a shrinking set of task relevant options. These occurrences not only void the need for temporal abstraction, they also affect performance. In this paper, we tackle these problems by learning a diverse set of options. We introduce an information-theoretic intrinsic reward, which augments the task reward, as well as a novel termination objective, in order to encourage behavioral diversity in the option set. We show empirically that our proposed method is capable of learning options end-to-end on several discrete and continuous control tasks, outperforms option-critic by a wide margin. Furthermore, we show that our approach sustainably generates robust, reusable, reliable and interpretable options, in contrast to option-critic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题