发现多米诺骨牌的政策：多样性优化维持几乎最佳性

论文标题

发现多米诺骨牌的政策：多样性优化维持几乎最佳性

Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality

论文作者

Zahavy, Tom, Schroecker, Yannick, Behbahani, Feryal, Baumli, Kate, Flennerhag, Sebastian, Hou, Shaobo, Singh, Satinder

论文摘要

找到相同问题的不同解决方案是与创造力和适应新情况相关的智力的关键方面。在强化学习中，一组不同的政策对于探索，转移，层次结构和健壮性可能很有用。我们提出了多米诺骨牌，这是一种多样性优化的方法，可维持几乎最佳性。我们将问题形式化为马尔可夫决策过程，在该过程中，目的是找到各种政策，这些政策是通过策略中的国家占领之间的距离来衡量的，同时在外部奖励方面保持了近乎最佳的影响。我们证明该方法可以发现各个领域中的各种和有意义的行为，例如DeepMind Control Suite中的不同运动模式。我们对我们的方法进行广泛的分析，将其与其他多目标基线进行比较，表明我们可以通过可解释的超参数来控制集合的质量和多样性，并表明发现的集合对扰动是可靠的。

Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO, a method for Diversity Optimization Maintaining Near Optimality. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviors in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the discovered set is robust to perturbations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题