出轨：奖励和模仿学习的诊断环境

论文标题

出轨：奖励和模仿学习的诊断环境

DERAIL: Diagnostic Environments for Reward And Imitation Learning

论文作者

Freire, Pedro, Gleave, Adam, Toyer, Sam, Russell, Stuart

论文摘要

许多现实世界任务的目的是复杂的，难以在程序上指定。这使得有必要使用奖励或模仿学习算法直接从人类数据中推断出奖励或政策。这些算法的现有基准侧重于现实主义，在复杂环境中进行测试。不幸的是，这些基准测试很慢，不可靠并且无法隔离故障。作为一种互补的方法，我们开发了一套简单的诊断任务，以隔离算法性能的单个方面。我们评估了一系列关于我们的任务的共同奖励和模仿学习算法。我们的结果证实，算法性能对实现细节高度敏感。此外，在一个基于流行的基于偏好的奖励学习实现的案例中，我们说明了套件如何查明设计缺陷并迅速评估候选解决方案。这些环境可从https://github.com/humancompatibleai/seals获得。

The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complementary approach, we develop a suite of simple diagnostic tasks that test individual facets of algorithm performance in isolation. We evaluate a range of common reward and imitation learning algorithms on our tasks. Our results confirm that algorithm performance is highly sensitive to implementation details. Moreover, in a case-study into a popular preference-based reward learning implementation, we illustrate how the suite can pinpoint design flaws and rapidly evaluate candidate solutions. The environments are available at https://github.com/HumanCompatibleAI/seals .

下载PDF全文

下载文献需遵守相关版权规定

论文标题