具有部分历史共享的分散的随机控制系统中的强化学习

论文标题

具有部分历史共享的分散的随机控制系统中的强化学习

Reinforcement Learning in Decentralized Stochastic Control Systems with Partial History Sharing

论文作者

Arabneydi, Jalal, Mahajan, Aditya

论文摘要

在本文中，我们对希望合作以完成一项共同任务的多个代理的系统感兴趣，而a）代理具有不同的信息（分散信息），而b）代理并不完全知道系统的模型，即他们可能部分地知道该模型或根本不知道该模型。代理商必须通过与环境互动，即通过分散的强化学习（RL）来学习最佳策略。具有不同信息的多个代理的存在使分散的强化在概念上比集中的强化学习更加困难。在本文中，我们开发了一种分散的加强学习算法，该学习算法学习了部分历史共享信息结构的$ε$ -Team-Team-Team-Timeptimal-Time-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Tim-Team-最佳解决方案，该解决方案包括一大批分散的控制系统，包括延迟共享，控制共享，平均现场共享等。我们的方法组成了两个主要步骤。在第一步中，我们使用现有的称为常见信息方法的方法将分散控制系统转换为等效的集中式POMDP（部分可观察到的马尔可夫决策过程）。但是，最终的POMDP需要完全了解系统模型。为了规避这一要求，在第二步中，我们使用该概念引入了一个新概念，称为“增量扩展表示”，我们构建了一个有限状态的RL算法，其近似误差会收敛到零快速的零。我们通过获得两用户多访问广播通道（MABC）的分散Q学习算法（MABC）来说明提出的方法并进行数值验证，这是分散控制系统的基准示例。

In this paper, we are interested in systems with multiple agents that wish to collaborate in order to accomplish a common task while a) agents have different information (decentralized information) and b) agents do not know the model of the system completely i.e., they may know the model partially or may not know it at all. The agents must learn the optimal strategies by interacting with their environment i.e., by decentralized Reinforcement Learning (RL). The presence of multiple agents with different information makes decentralized reinforcement learning conceptually more difficult than centralized reinforcement learning. In this paper, we develop a decentralized reinforcement learning algorithm that learns $ε$-team-optimal solution for partial history sharing information structure, which encompasses a large class of decentralized control systems including delayed sharing, control sharing, mean field sharing, etc. Our approach consists of two main steps. In the first step, we convert the decentralized control system to an equivalent centralized POMDP (Partially Observable Markov Decision Process) using an existing approach called common information approach. However, the resultant POMDP requires the complete knowledge of system model. To circumvent this requirement, in the second step, we introduce a new concept called "Incrementally Expanding Representation" using which we construct a finite-state RL algorithm whose approximation error converges to zero exponentially fast. We illustrate the proposed approach and verify it numerically by obtaining a decentralized Q-learning algorithm for two-user Multi Access Broadcast Channel (MABC) which is a benchmark example for decentralized control systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题