论文标题
了解学到的奖励功能
Understanding Learned Reward Functions
论文作者
论文摘要
在许多实际任务中,不可能在程序上指定RL代理的奖励功能。在这种情况下,必须从与人互动和观察人类互动中学习奖励功能。但是,当前的奖励学习技术可能无法产生奖励功能,从而准确反映了用户偏好。因此,在奖励学习方面没有重大进展,因此能够审核学习的奖励功能以验证它们是否真正捕获用户偏好,这一点很重要。在本文中,我们研究了解释学到的奖励功能的技术。特别是,我们采用显着性方法来识别故障模式并预测奖励功能的鲁棒性。我们发现,学到的奖励功能通常会实现依赖环境偶然性方面的令人惊讶的算法。我们还发现,现有的可解释性技术通常会涉及奖励产出无关紧要的变化,这表明奖励可解释性可能需要与政策解释性明显不同的方法。
In many real-world tasks, it is not possible to procedurally specify an RL agent's reward function. In such cases, a reward function must instead be learned from interacting with and observing humans. However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences. Absent significant advances in reward learning, it is thus important to be able to audit learned reward functions to verify whether they truly capture user preferences. In this paper, we investigate techniques for interpreting learned reward functions. In particular, we apply saliency methods to identify failure modes and predict the robustness of reward functions. We find that learned reward functions often implement surprising algorithms that rely on contingent aspects of the environment. We also discover that existing interpretability techniques often attend to irrelevant changes in reward output, suggesting that reward interpretability may need significantly different methods from policy interpretability.