无模型的对手塑造

论文标题

无模型的对手塑造

Model-Free Opponent Shaping

论文作者

Lu, Chris, Willi, Timon, de Witt, Christian Schroeder, Foerster, Jakob

论文摘要

在通用游戏中，自我利益的学习代理人的相互作用通常会导致统一的结果，例如迭代囚犯的困境（IPD）中的缺陷缺陷。为了克服这一点，某些方法，例如以对手学习意识（LOLA）学习，塑造了对手的学习过程。但是，这些方法是近视的，因为只能预期少量步骤，因此是不对称的，因为它们将其他代理视为天真的学习者，并且需要使用高阶衍生剂，这些衍生物是通过对对手的可区分学习算法的访问来计算的。为了解决这些问题，我们提出了无模型的对手塑造（M-FOS）。 M-Fos在一个元游戏中学习，每个元步骤是基础内部游戏的一集。元状态由内部策略组成，元政策产生了下一集的新内部政策。然后，M-FOS使用无通用模型的优化方法来学习完成长马对手塑造的元过程。从经验上讲，M-Fos近距离地利用了文献中的幼稚学习者和其他更复杂的算法。例如，据我们所知，这是学习IPD中众所周知的零确定（ZD）勒索策略的第一种方法。在相同的环境中，M-FOS在元自我游戏下导致社会最佳成果。最后，我们表明可以将M-FOS缩放到高维设置。

In general-sum games, the interaction of self-interested learning agents commonly leads to collectively worst-case outcomes, such as defect-defect in the iterated prisoner's dilemma (IPD). To overcome this, some methods, such as Learning with Opponent-Learning Awareness (LOLA), shape their opponents' learning process. However, these methods are myopic since only a small number of steps can be anticipated, are asymmetric since they treat other agents as naive learners, and require the use of higher-order derivatives, which are calculated through white-box access to an opponent's differentiable learning algorithm. To address these issues, we propose Model-Free Opponent Shaping (M-FOS). M-FOS learns in a meta-game in which each meta-step is an episode of the underlying inner game. The meta-state consists of the inner policies, and the meta-policy produces a new inner policy to be used in the next episode. M-FOS then uses generic model-free optimisation methods to learn meta-policies that accomplish long-horizon opponent shaping. Empirically, M-FOS near-optimally exploits naive learners and other, more sophisticated algorithms from the literature. For example, to the best of our knowledge, it is the first method to learn the well-known Zero-Determinant (ZD) extortion strategy in the IPD. In the same settings, M-FOS leads to socially optimal outcomes under meta-self-play. Finally, we show that M-FOS can be scaled to high-dimensional settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题