部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

Oracle-free Reinforcement Learning in Mean-Field Games along a Single Sample Path

论文作者

Zaman, Muhammad Aneeq uz, Koppel, Alec, Bhatt, Sujay, Başar, Tamer

论文摘要

我们考虑在平均场比赛（MFGS）中在线增强学习。与传统方法不同，我们通过开发使用通用剂的单个样品路径来近似于平均场平衡（MFE）的算法，从而减轻了平均野外甲骨文的需求。我们将其称为{\ it Sandbox Learning}，因为它可以用作在多代理非合作环境中任何代理学习的温暖启动。我们采用了两种时间尺度的方法，在该方法中，平均场的在线定点递归在较慢的时间尺度上运行，并与对通用代理更快的时间尺度的控制策略更新一致。鉴于代理商的基本马尔可夫决策过程（MDP）正在交流，因此我们提供有限的样本收敛保证，从平均场和控制策略融合到平均场平衡方面。沙盒学习算法的样本复杂性为$ \ tilde {\ Mathcal {o}}}（ε^{ - 4}）$，其中$ε$是MFE近似错误。这类似于假设访问Oracle的作品。最后，我们从经验上证明了沙盒学习算法在各种情况下的有效性，包括MDP不一定具有单个交流类的情况。

We consider online reinforcement learning in Mean-Field Games (MFGs). Unlike traditional approaches, we alleviate the need for a mean-field oracle by developing an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent. We call this {\it Sandbox Learning}, as it can be used as a warm-start for any agent learning in a multi-agent non-cooperative setting. We adopt a two time-scale approach in which an online fixed-point recursion for the mean-field operates on a slower time-scale, in tandem with a control policy update on a faster time-scale for the generic agent. Given that the underlying Markov Decision Process (MDP) of the agent is communicating, we provide finite sample convergence guarantees in terms of convergence of the mean-field and control policy to the mean-field equilibrium. The sample complexity of the Sandbox learning algorithm is $\tilde{\mathcal{O}}(ε^{-4})$ where $ε$ is the MFE approximation error. This is similar to works which assume access to oracle. Finally, we empirically demonstrate the effectiveness of the sandbox learning algorithm in diverse scenarios, including those where the MDP does not necessarily have a single communicating class.

下载PDF全文

下载文献需遵守相关版权规定

论文标题