通过无监督的环境设计，新兴的复杂性和零拍传输

论文标题

通过无监督的环境设计，新兴的复杂性和零拍传输

Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

论文作者

Dennis, Michael, Jaques, Natasha, Vinitsky, Eugene, Bayen, Alexandre, Russell, Stuart, Critch, Andrew, Levine, Sergey

论文摘要

广泛的加强学习（RL）问题 - 包括鲁棒性，转移学习，无监督的RL和新兴的复杂性 - 需要指定培训策略的任务或环境的分布。但是，创建有用的环境分布是错误的，并且需要大量的开发人员时间和精力。我们将无监督的环境设计（UED）作为替代范式提出，开发人员为具有未知参数的环境提供，这些参数用于自动在有效的，可解决的环境上自动产生分布。现有的自动产生环境的方法遭受了常见的故障模式：域随机化无法生成结构或使环境的难度适应代理的学习进度，而最小值的对抗训练会导致通常无法解决的最坏情况。为了为我们的主角生成结构化的可解决环境，我们引入了与环境生成对手相关的第二个拮抗剂。对手的动机是生成最大化遗憾的环境，定义为主角和对手代理人的回归之间的区别。我们称我们的技术主角拮抗剂引起的遗憾环境设计（配对）。我们的实验表明，配对会产生越来越复杂的环境的天然课程，并且在高度新颖的环境中测试时，配对剂实现了更高的零射传递性能。

A wide range of reinforcement learning (RL) problems - including robustness, transfer learning, unsupervised RL, and emergent complexity - require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environment Design (UED) as an alternative paradigm, where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments. Existing approaches to automatically generating environments suffer from common failure modes: domain randomization cannot generate structure or adapt the difficulty of the environment to the agent's learning progress, and minimax adversarial training leads to worst-case environments that are often unsolvable. To generate structured, solvable environments for our protagonist agent, we introduce a second, antagonist agent that is allied with the environment-generating adversary. The adversary is motivated to generate environments which maximize regret, defined as the difference between the protagonist and antagonist agent's return. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED). Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题