论文标题
在线性二次高斯(LQG)控制的政策优化中逃脱高级马鞍
Escaping High-order Saddles in Policy Optimization for Linear Quadratic Gaussian (LQG) Control
论文作者
论文摘要
一阶政策优化已被广泛用于加强学习。它可以保证找到状态反馈线性二次调节器(LQR)的最佳策略。但是,对于线性二次高斯(LQG)控制,LQG成本具有虚假的次优固定点,策略优化的性能尚不清楚。在本文中,我们介绍了一种新颖的扰动政策梯度(PGD)方法,以逃避大量不良的固定点(包括高阶马鞍)。特别是,基于LQG的特定结构,我们引入了一种新型的重新聚集过程,该过程将迭代从高阶鞍座转换为严格的鞍座,PGD中的标准随机扰动可以有效地逃脱。我们进一步描述了我们算法可以逃脱的高级马鞍。
First order policy optimization has been widely used in reinforcement learning. It guarantees to find the optimal policy for the state-feedback linear quadratic regulator (LQR). However, the performance of policy optimization remains unclear for the linear quadratic Gaussian (LQG) control where the LQG cost has spurious suboptimal stationary points. In this paper, we introduce a novel perturbed policy gradient (PGD) method to escape a large class of bad stationary points (including high-order saddles). In particular, based on the specific structure of LQG, we introduce a novel reparameterization procedure which converts the iterate from a high-order saddle to a strict saddle, from which standard random perturbations in PGD can escape efficiently. We further characterize the high-order saddles that can be escaped by our algorithm.