强化学习中概括的游戏理论观点

论文标题

强化学习中概括的游戏理论观点

A Game-Theoretic Perspective of Generalization in Reinforcement Learning

论文作者

Yang, Chang, Wang, Ruiyu, Wang, Xinrun, Wang, Zhen

论文摘要

强化学习（RL）的概括对于RL算法的实际部署至关重要。提出了各种方案来解决概括问题，包括转移学习，多任务学习和元学习以及强大和对抗性的强化学习。但是，各种方案都没有统一的表述，也没有跨不同方案的方法的全面比较。在这项工作中，我们提出了一个游戏理论框架，用于加强学习的概括，名为“女孩”，在该框架中，RL代理人在一组任务上对对手进行了训练，对手可以在给定阈值范围内操纵分布。使用不同的配置，女孩可以减少上述各种方案。 To solve GiRL, we adapt the widely-used method in game theory, policy space response oracle (PSRO) with the following three important modifications: i) we use model-agnostic meta learning (MAML) as the best-response oracle, ii) we propose a modified projected replicated dynamics, i.e., R-PRD, which ensures the computed meta-strategy of the adversary fall in the threshold, and iii) we also为测试过程中的多种策略进行几次学习的协议。在Mujoco环境上进行的广泛实验表明，我们所提出的方法可以优于现有基线，例如MAML。

Generalization in reinforcement learning (RL) is of importance for real deployment of RL algorithms. Various schemes are proposed to address the generalization issues, including transfer learning, multi-task learning and meta learning, as well as the robust and adversarial reinforcement learning. However, there is not a unified formulation of the various schemes, as well as the comprehensive comparisons of methods across different schemes. In this work, we propose a game-theoretic framework for the generalization in reinforcement learning, named GiRL, where an RL agent is trained against an adversary over a set of tasks, where the adversary can manipulate the distributions over tasks within a given threshold. With different configurations, GiRL can reduce the various schemes mentioned above. To solve GiRL, we adapt the widely-used method in game theory, policy space response oracle (PSRO) with the following three important modifications: i) we use model-agnostic meta learning (MAML) as the best-response oracle, ii) we propose a modified projected replicated dynamics, i.e., R-PRD, which ensures the computed meta-strategy of the adversary fall in the threshold, and iii) we also propose a protocol for the few-shot learning of the multiple strategies during testing. Extensive experiments on MuJoCo environments demonstrate that our proposed methods can outperform existing baselines, e.g., MAML.

下载PDF全文

下载文献需遵守相关版权规定

论文标题