论文标题
软动作先验:朝着强大的政策转移
Soft Action Priors: Towards Robust Policy Transfer
论文作者
论文摘要
尽管在许多具有挑战性的问题中取得了成功,但增强学习(RL)仍然面临样本效率低下的效率,可以通过向代理引入先验知识来减轻这种效率。但是,许多在加强学习中的转移技术使教师是专家的限制假设。在本文中,我们将增强学习的先前行动用作推理框架 - 即,在每个州的行动上的分布类似于教师政策,而不是贝叶斯的先验 - 恢复了最先进的政策蒸馏技术。然后,我们提出了一类自适应方法,这些方法可以通过结合奖励成型和辅助正则损失来鲁and义动作先验。与先前的工作相反,我们开发了利用次优的行动先验的算法,这些算法可能仍然会传授宝贵的知识 - 我们称之为软动作先验。拟议的算法根据对每个州的老师有用性的估计来调整教师反馈的强度来适应。我们执行表格实验,该实验表明所提出的方法达到了最先进的性能,在从次优先的先验中学习时超过了它。最后,我们证明了自适应算法在连续动作中的鲁棒性深度RL问题,与现有的策略蒸馏方法相比,自适应算法显着提高了稳定性。
Despite success in many challenging problems, reinforcement learning (RL) is still confronted with sample inefficiency, which can be mitigated by introducing prior knowledge to agents. However, many transfer techniques in reinforcement learning make the limiting assumption that the teacher is an expert. In this paper, we use the action prior from the Reinforcement Learning as Inference framework - that is, a distribution over actions at each state which resembles a teacher policy, rather than a Bayesian prior - to recover state-of-the-art policy distillation techniques. Then, we propose a class of adaptive methods that can robustly exploit action priors by combining reward shaping and auxiliary regularization losses. In contrast to prior work, we develop algorithms for leveraging suboptimal action priors that may nevertheless impart valuable knowledge - which we call soft action priors. The proposed algorithms adapt by adjusting the strength of teacher feedback according to an estimate of the teacher's usefulness in each state. We perform tabular experiments, which show that the proposed methods achieve state-of-the-art performance, surpassing it when learning from suboptimal priors. Finally, we demonstrate the robustness of the adaptive algorithms in continuous action deep RL problems, in which adaptive algorithms considerably improved stability when compared to existing policy distillation methods.