Deep RL的嘈杂符号抽象：带有奖励机的案例研究

论文标题

Deep RL的嘈杂符号抽象：带有奖励机的案例研究

Noisy Symbolic Abstractions for Deep RL: A case study with Reward Machines

论文作者

Li, Andrew C., Chen, Zizhao, Vaezipoor, Pashootan, Klassen, Toryn Q., Icarte, Rodrigo Toro, McIlraith, Sheila A.

论文摘要

自然语言和正式语言为人类提供了指定指示和奖励功能的有效机制。当奖励机器捕获的象征性语言（一种越来越流行的自动机启发的结构）指定奖励功能时，我们研究了如何通过RL生成策略。我们对环境状态映射到符号（这里是奖励机器）词汇的映射（通常称为标签函数）的情况是不确定的。我们将噪声符号抽象作为POMDP优化问题的奖励机中的策略学习问题提出了问题，并研究了几种解决该问题的方法，这些方法是基于现有技术和新技术的，后者着重于预测奖励机器状态，而不是基于接地的单个符号。我们在正确解释符号词汇的正确解释中分析了这些方法并在不同程度的不确定性下对它们进行了实验评估。我们通过对说明性，玩具领域和部分可观察到的深度RL领域的经验研究来验证我们的方法的强度以及现有方法的局限性。

Natural and formal languages provide an effective mechanism for humans to specify instructions and reward functions. We investigate how to generate policies via RL when reward functions are specified in a symbolic language captured by Reward Machines, an increasingly popular automaton-inspired structure. We are interested in the case where the mapping of environment state to a symbolic (here, Reward Machine) vocabulary -- commonly known as the labelling function -- is uncertain from the perspective of the agent. We formulate the problem of policy learning in Reward Machines with noisy symbolic abstractions as a special class of POMDP optimization problem, and investigate several methods to address the problem, building on existing and new techniques, the latter focused on predicting Reward Machine state, rather than on grounding of individual symbols. We analyze these methods and evaluate them experimentally under varying degrees of uncertainty in the correct interpretation of the symbolic vocabulary. We verify the strength of our approach and the limitation of existing methods via an empirical investigation on both illustrative, toy domains and partially observable, deep RL domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题