在存在离散的马尔可维亚语境进化中的强化学习

论文标题

在存在离散的马尔可维亚语境进化中的强化学习

Reinforcement Learning in Presence of Discrete Markovian Context Evolution

论文作者

Ren, Hang, Sootla, Aivar, Jafferjee, Taher, Shen, Junxiao, Wang, Jun, Bou-Ammar, Haitham

论文摘要

我们考虑了与上下文相关的增强学习（RL）设置，其特征是：a）未直接观察到的未直接观察到的未知数； b）情节期间发生的（不连续的）上下文变化； c）马尔可夫语境的演变。我们认为，这种具有挑战性的案例通常在应用程序中得到满足，并使用贝叶斯方法和变异推断对其进行解决。我们适应了模型学习的先验粘性层次级别过程（HDP），这可以说是Markov过程建模的最佳选择。然后，我们得出了一个上下文蒸馏过程，该过程以无监督的方式识别和消除了虚假的上下文。我们认为，这两个组件的组合允许从数据中推断上下文数量，从而处理上下文基数假设。然后，我们找到最佳策略的表示，从而实现了使用现成的RL算法进行有效的政策学习。最后，我们通过经验证明（使用健身房环境卡车杆旋转，无人机，交叉路口）表明，我们的方法成功了，而其他框架的最新方法失败并详细说明了此类故障的原因。

We consider a context-dependent Reinforcement Learning (RL) setting, which is characterized by: a) an unknown finite number of not directly observable contexts; b) abrupt (discontinuous) context changes occurring during an episode; and c) Markovian context evolution. We argue that this challenging case is often met in applications and we tackle it using a Bayesian approach and variational inference. We adapt a sticky Hierarchical Dirichlet Process (HDP) prior for model learning, which is arguably best-suited for Markov process modeling. We then derive a context distillation procedure, which identifies and removes spurious contexts in an unsupervised fashion. We argue that the combination of these two components allows to infer the number of contexts from data thus dealing with the context cardinality assumption. We then find the representation of the optimal policy enabling efficient policy learning using off-the-shelf RL algorithms. Finally, we demonstrate empirically (using gym environments cart-pole swing-up, drone, intersection) that our approach succeeds where state-of-the-art methods of other frameworks fail and elaborate on the reasons for such failures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题