论文标题

悲观的最小值价值迭代:从离线数据集中证明有效的平衡学习

Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets

论文作者

Zhong, Han, Xiong, Wei, Tan, Jiyuan, Wang, Liwei, Zhang, Tong, Wang, Zhaoran, Yang, Zhuoran

论文摘要

我们在离线环境中研究了情节性的两人零和马尔可夫游戏(MGS),目标是根据收集的数据集找到了近似的NASH均衡(NE)策略对。当数据集对所有策略对没有统一的覆盖范围时,找到近似NE涉及三个方面的挑战:(i)行为策略和最佳策略之间的分布变化,(ii)处理大型状态空间的功能近似,以及(iii)最小值的平衡求解。我们提出了一种基于悲观的算法,被称为悲观的最小值迭代(PMVI),该算法通过构建对玩家的价值函数的悲观估计来克服分布转移,并根据两个值函数求解NES,并输出策略对。此外,我们建立了一个依赖于数据的上限,该上的次要性恢复了sublinear速率,而无需假设数据集的均匀覆盖范围。我们还证明了一个信息理论下限,这表明上限中的数据依赖性项是固有的。我们的理论结果还突出了“相对不确定性”的概念,该概念表征了达到离线MG中样品效率的必要条件。据我们所知,我们为具有功能近似的离线MG提供了第一个几乎最小的最佳结果。

We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源