无模型强化学习中的反事实信用分配

论文标题

无模型强化学习中的反事实信用分配

Counterfactual Credit Assignment in Model-Free Reinforcement Learning

论文作者

Mesnard, Thomas, Weber, Théophane, Viola, Fabio, Thakoor, Shantanu, Saade, Alaa, Harutyunyan, Anna, Dabney, Will, Stepleton, Tom, Heess, Nicolas, Guez, Arthur, Moulines, Éric, Hutter, Marcus, Buesing, Lars, Munos, Rémi

论文摘要

强化学习中的学分分配是衡量动作对未来奖励的影响的问题。特别是，这需要将技能与运气分开，即将行动对奖励的影响与外部因素和随后的行动的影响。为了实现这一目标，我们将反事实的概念从因果关系理论转化为无模型的RL设置。关键的想法是通过学习从轨迹中提取相关信息来调节未来事件的功能。我们制定了一种政策梯度算法家庭，该算法将这些未来条件价值用作基准或批评家，并证明它们的差异较低。为了避免对未来信息进行调节的潜在偏见，我们将事后的信息限制为不包含有关代理行为的信息。我们证明了算法对许多说明性和挑战性问题的功效和有效性。

Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, i.e. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We formulate a family of policy gradient algorithms that use these future-conditional value functions as baselines or critics, and show that they are provably low variance. To avoid the potential bias from conditioning on future information, we constrain the hindsight information to not contain information about the agent's actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative and challenging problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题