对手学习意识的近端学习

论文标题

对手学习意识的近端学习

Proximal Learning With Opponent-Learning Awareness

论文作者

Zhao, Stephen, Lu, Chris, Grosse, Roger Baker, Foerster, Jakob Nicolaus

论文摘要

用对手学习意识学习（LOLA）（Foerster等人[2018a]）是一种多名强化学习算法，通常在部分竞争性的环境中学习基于互惠的合作。但是，Lola通常无法在神经网络参数化的更复杂的策略空间上学习这种行为，部分原因是更新规则对策略参数化敏感。在对手建模环境中，这个问题尤其明显，在对手建模环境中，对手的策略尚不清楚，并且必须从观察结果中推断出来。在这种情况下，Lola被指定不明，因为行为等效的对手策略可能会导致不等式的更新。为了解决这个缺点，我们重新解释了LOLA作为近端操作员的近似，然后得出了一种新算法，即近端Lola（Pola），该算法直接使用近端公式。与Lola不同，Pola更新是参数化不变的，因为当近端目标具有独特的最佳，行为等效的策略时，会导致行为等效的更新。然后，我们对理想的Pola更新提出了实用的近似，我们在功能近似和对手建模的几个部分竞争环境中进行了评估。从经验上讲，这表明Pola比Lola更可靠地实现了基于互惠的合作。

Learning With Opponent-Learning Awareness (LOLA) (Foerster et al. [2018a]) is a multi-agent reinforcement learning algorithm that typically learns reciprocity-based cooperation in partially competitive environments. However, LOLA often fails to learn such behaviour on more complex policy spaces parameterized by neural networks, partly because the update rule is sensitive to the policy parameterization. This problem is especially pronounced in the opponent modeling setting, where the opponent's policy is unknown and must be inferred from observations; in such settings, LOLA is ill-specified because behaviorally equivalent opponent policies can result in non-equivalent updates. To address this shortcoming, we reinterpret LOLA as approximating a proximal operator, and then derive a new algorithm, proximal LOLA (POLA), which uses the proximal formulation directly. Unlike LOLA, the POLA updates are parameterization invariant, in the sense that when the proximal objective has a unique optimum, behaviorally equivalent policies result in behaviorally equivalent updates. We then present practical approximations to the ideal POLA update, which we evaluate in several partially competitive environments with function approximation and opponent modeling. This empirically demonstrates that POLA achieves reciprocity-based cooperation more reliably than LOLA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题