参数化的MDP和强化学习问题 - 最大基于熵原理的框架

论文标题

参数化的MDP和强化学习问题 - 最大基于熵原理的框架

Parameterized MDPs and Reinforcement Learning Problems -- A Maximum Entropy Principle Based Framework

论文作者

Srivastava, Amber, Salapaka, Srinivasa M

论文摘要

我们提出了一个框架，以解决一类顺序决策问题。我们的框架功能以鲁棒性学习最佳控制策略，确定未知状态和动作参数，并对问题参数进行灵敏度分析。我们考虑了以（且没有）吸收状态为（MDP）建模的顺序决策问题的两个广泛类别。我们框架的基础思想是根据MDP下的轨迹的香农熵量化探索，并确定最大化它的随机策略，同时保证沿轨迹沿着轨迹的预期成本的低价值。这项由此产生的政策在学习过程的早期就提高了探索质量，因此即使在存在嘈杂数据的情况下，即使在我们与流行算法（例如Q-学习，双Q学习和熵）和熵正规化的软Q学习的情况下，即使在存在嘈杂的数据的情况下，也可以更快地收敛速率和强大的解决方案。该框架扩展到参数化的MDP和RL问题的类别，在该类别中，状态和操作取决于参数，目的是确定最佳参数以及相应的最佳策略。在这里，相关的成本函数可能是多个局部最小值较差的非凸起。适用于5G小细胞网络问题的仿真结果表明通信路线和小细胞位置的成功确定。我们还获得了对问题参数的敏感性措施和对嘈杂环境数据的鲁棒性。

We present a framework to address a class of sequential decision making problems. Our framework features learning the optimal control policy with robustness to noisy data, determining the unknown state and action parameters, and performing sensitivity analysis with respect to problem parameters. We consider two broad categories of sequential decision making problems modelled as infinite horizon Markov Decision Processes (MDPs) with (and without) an absorbing state. The central idea underlying our framework is to quantify exploration in terms of the Shannon Entropy of the trajectories under the MDP and determine the stochastic policy that maximizes it while guaranteeing a low value of the expected cost along a trajectory. This resulting policy enhances the quality of exploration early on in the learning process, and consequently allows faster convergence rates and robust solutions even in the presence of noisy data as demonstrated in our comparisons to popular algorithms such as Q-learning, Double Q-learning and entropy regularized Soft Q-learning. The framework extends to the class of parameterized MDP and RL problems, where states and actions are parameter dependent, and the objective is to determine the optimal parameters along with the corresponding optimal policy. Here, the associated cost function can possibly be non-convex with multiple poor local minima. Simulation results applied to a 5G small cell network problem demonstrate successful determination of communication routes and the small cell locations. We also obtain sensitivity measures to problem parameters and robustness to noisy environment data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题