有效的加强学习在有限的MDP中应用于约束RL

论文标题

有效的加强学习在有限的MDP中应用于约束RL

Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL

论文作者

Chen, Xiaoyu, Hu, Jiachen, Li, Lihong, Wang, Liwei

论文摘要

研究了Markov决策过程（FMDPS）中的强化学习（RL）。我们提出了一种称为FMDP-BF的算法，该算法利用FMDP的分解结构。 FMDP-BF的遗憾显示出比为非成分MDP设计的最佳算法的指数级要小，并且通过$ \ sqrt {h | \ Mathcal {s sqrt {s sqrt {s sqrt { $ | \ MATHCAL {S} _i | $是商品状态子空间的基数，$ h $是计划范围。为了显示我们的边界的最优性，我们还为FMDP提供了一个下限，这表明我们的算法几乎是最佳的W.R.T. TimeStep $ t $，Horizon $ h $和考虑国家行动子空间基础。最后，作为应用程序，我们研究了约束RL的新公式，称为RL具有背包约束（RLWK），并根据FMDP-BF提供了第一个样品效率算法。

Reinforcement learning (RL) in episodic, factored Markov decision processes (FMDPs) is studied. We propose an algorithm called FMDP-BF, which leverages the factorization structure of FMDP. The regret of FMDP-BF is shown to be exponentially smaller than that of optimal algorithms designed for non-factored MDPs, and improves on the best previous result for FMDPs~\citep{osband2014near} by a factored of $\sqrt{H|\mathcal{S}_i|}$, where $|\mathcal{S}_i|$ is the cardinality of the factored state subspace and $H$ is the planning horizon. To show the optimality of our bounds, we also provide a lower bound for FMDP, which indicates that our algorithm is near-optimal w.r.t. timestep $T$, horizon $H$ and factored state-action subspace cardinality. Finally, as an application, we study a new formulation of constrained RL, known as RL with knapsack constraints (RLwK), and provides the first sample-efficient algorithm based on FMDP-BF.

下载PDF全文

下载文献需遵守相关版权规定

论文标题