论文标题
随机最佳控制的神经网络方法
A Neural Network Approach for Stochastic Optimal Control
论文作者
论文摘要
我们提出了一种神经网络方法,用于近似高维随机控制问题的价值函数。我们的培训过程同时更新了我们的价值功能估计,并确定最佳轨迹可能访问的状态空间的一部分。我们的方法利用了最佳控制理论以及半线性抛物线偏微分方程和前向后随机微分方程之间的基本关系的见解。为了将采样集中在神经网络训练期间相关状态上,我们使用随机泛美蛋白最大原理(PMP)来获得当前价值函数估计值的最佳控制。通过设计,我们的方法与确定性控制问题中出现的非粘性汉密尔顿 - 雅各比 - 贝尔曼方程的特征方法相吻合。我们的训练损失包括控制问题的客观功能的加权总和,以及沿采样轨迹强制执行HJB方程的惩罚术语。重要的是,培训不受监督,因为它不需要控制问题的解决方案。 我们的数值实验强调了我们计划确定状态空间相关部分并产生有意义的价值估计的能力。使用二维模型问题,我们证明了随机PMP在告知采样并与有限元方法进行比较的重要性。通过非线性控制仿射四轮驱动器的示例,我们说明我们的方法可以处理复杂的动态。对于100维基准问题,我们证明我们的方法提高了准确性和时间的时间,并且通过修改,我们显示了我们方案的更广泛适用性。
We present a neural network approach for approximating the value function of high-dimensional stochastic control problems. Our training process simultaneously updates our value function estimate and identifies the part of the state space likely to be visited by optimal trajectories. Our approach leverages insights from optimal control theory and the fundamental relation between semi-linear parabolic partial differential equations and forward-backward stochastic differential equations. To focus the sampling on relevant states during neural network training, we use the stochastic Pontryagin maximum principle (PMP) to obtain the optimal controls for the current value function estimate. By design, our approach coincides with the method of characteristics for the non-viscous Hamilton-Jacobi-Bellman equation arising in deterministic control problems. Our training loss consists of a weighted sum of the objective functional of the control problem and penalty terms that enforce the HJB equations along the sampled trajectories. Importantly, training is unsupervised in that it does not require solutions of the control problem. Our numerical experiments highlight our scheme's ability to identify the relevant parts of the state space and produce meaningful value estimates. Using a two-dimensional model problem, we demonstrate the importance of the stochastic PMP to inform the sampling and compare to a finite element approach. With a nonlinear control affine quadcopter example, we illustrate that our approach can handle complicated dynamics. For a 100-dimensional benchmark problem, we demonstrate that our approach improves accuracy and time-to-solution and, via a modification, we show the wider applicability of our scheme.