无监督环境设计的接地不确定性

论文标题

无监督环境设计的接地不确定性

Grounding Aleatoric Uncertainty for Unsupervised Environment Design

论文作者

Jiang, Minqi, Dennis, Michael, Parker-Holder, Jack, Lupu, Andrei, Küttler, Heinrich, Grefenstette, Edward, Rocktäschel, Tim, Foerster, Jakob

论文摘要

事实证明，在加固学习中的自适应课程（RL）有效地制定了稳健的策略，以在火车和测试环境之间差异。最近，无监督的环境设计（UED）框架通用RL课程以生成整个环境的序列，从而带来了具有强大minimax遗憾属性的新方法。有问题的是，在部分观察或随机设置中，最佳策略可能取决于预期部署设置中环境的基础分布，而课程学习一定会改变培训分布。我们将这种现象形式化为课程引起的协变量（CICS），并描述其在核心参数中的发生如何导致次优政策。直接从基本真相分布中采样这些参数可以避免问题，但阻碍了课程学习。我们提出了Samplr，即Minimax遗憾的方法，即使由于CICS偏向基础培训数据，它也优化了基础真相函数。我们证明并验证了具有挑战性的领域，我们的方法可以在基础上的分布下保留最优性，同时促进整个环境环境范围内的鲁棒性。

Adaptive curricula in reinforcement learning (RL) have proven effective for producing policies robust to discrepancies between the train and test environment. Recently, the Unsupervised Environment Design (UED) framework generalized RL curricula to generating sequences of entire environments, leading to new methods with robust minimax regret properties. Problematically, in partially-observable or stochastic settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment in the intended deployment setting, while curriculum learning necessarily shifts the training distribution. We formalize this phenomenon as curriculum-induced covariate shift (CICS), and describe how its occurrence in aleatoric parameters can lead to suboptimal policies. Directly sampling these parameters from the ground-truth distribution avoids the issue, but thwarts curriculum learning. We propose SAMPLR, a minimax regret UED method that optimizes the ground-truth utility function, even when the underlying training data is biased due to CICS. We prove, and validate on challenging domains, that our approach preserves optimality under the ground-truth distribution, while promoting robustness across the full range of environment settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题