论文标题
对一系列可解决模型的政策梯度的研究
A Study of Policy Gradient on a Class of Exactly Solvable Models
论文作者
论文摘要
策略梯度方法广泛用于加强学习,以优化预期回报。在本文中,我们探讨了策略参数的演变,对于特殊的可解决的POMDP,作为连续状态马尔可夫链的特殊类别,其过渡概率取决于策略价值的分布的梯度。我们的方法在很大程度上依赖于随机步行理论,特别是仿射Weyl群。我们构建了一类具有可控探索难度的新型可观察到的环境,其中价值分布,因此可以通过分析得出策略参数演变。使用这些环境,我们分析了策略梯度与价值函数不同局部最大值的概率收敛。据我们所知,这是开发出一种在分析中计算POMDPS中政策梯度的景观开发的方法,从而导致了对这个问题难度的有趣见解。
Policy gradient methods are extensively used in reinforcement learning as a way to optimize expected return. In this paper, we explore the evolution of the policy parameters, for a special class of exactly solvable POMDPs, as a continuous-state Markov chain, whose transition probabilities are determined by the gradient of the distribution of the policy's value. Our approach relies heavily on random walk theory, specifically on affine Weyl groups. We construct a class of novel partially observable environments with controllable exploration difficulty, in which the value distribution, and hence the policy parameter evolution, can be derived analytically. Using these environments, we analyze the probabilistic convergence of policy gradient to different local maxima of the value function. To our knowledge, this is the first approach developed to analytically compute the landscape of policy gradient in POMDPs for a class of such environments, leading to interesting insights into the difficulty of this problem.