论文标题
通过自我播放对基于模型的增强学习的尖锐分析
A Sharp Analysis of Model-based Reinforcement Learning with Self-Play
论文作者
论文摘要
基于模型的算法 - 通过构建和利用估计模型来探索环境的算法 - 广泛用于增强学习实践,理论上证明是在马尔可夫决策过程(MDPS)中实现最佳样本效率。但是,对于马尔可夫游戏中的多名强化学习,目前最著名的基于模型算法的样本复杂性是次优的,并且与最近无模型的方法进行了不利的比较。在本文中,我们对多代理马尔可夫游戏的基于模型的自我游戏算法进行了彻底的分析。 We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an $ε$-approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/ε^2)$ episodes of game playing, where $S$ is the number of states, $A,B$ are the number of actions for the two玩家和$ h $是地平线长度。这显着改善了$ \ tilde {\ Mathcal {o}}}(H^4S^2ab/ε^2)$的最著名的模型保证,并且是与信息理论下限$ω(H^3S(a+b)/ε^2)$匹配的第一个匹配,除了a $ \ min \ min \ min \ min \ fes a a a a a a,b \} $。此外,如果$ \ min \ {a,b \} = o(h^3)$,我们的保证与最知名的无模型算法进行比较,并输出单个马尔可夫策略,而现有的现有样品效率的无模型算法输出了嵌套的Markov策略的嵌套混合物,那么在一般非Markarkov中,在一般非M-Markov和商店和不利于商店和exceplecient and Executience and expecutient and Executient and executient and exceptience。我们进一步调整了分析,以设计用于零和马尔可夫游戏的可证明有效的任务无关算法,并为多玩家通用的Markov Games设计了第一行的样本效率高效算法。
Model-based algorithms -- algorithms that explore the environment through building and utilizing an estimated model -- are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an $ε$-approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/ε^2)$ episodes of game playing, where $S$ is the number of states, $A,B$ are the number of actions for the two players respectively, and $H$ is the horizon length. This significantly improves over the best known model-based guarantee of $\tilde{\mathcal{O}}(H^4S^2AB/ε^2)$, and is the first that matches the information-theoretic lower bound $Ω(H^3S(A+B)/ε^2)$ except for a $\min\{A,B\}$ factor. In addition, our guarantee compares favorably against the best known model-free algorithm if $\min \{A,B\}=o(H^3)$, and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.