论文标题
可区分的模拟器会提供更好的政策梯度吗?
Do Differentiable Simulators Give Better Policy Gradients?
论文作者
论文摘要
通过基于一阶梯度的估计,通过替换零阶梯度估算来替换零阶梯度估计来实现增强学习的速度更快的计算时间。但是,尚不清楚哪些因素决定了两个估计量在复杂景观上的性能,尽管该问题对于可区分的模拟器的实用性至关重要,但涉及长期计划和对物理系统的控制。我们表明,某些物理系统的特征,例如刚度或不连续性,可能会损害一阶估计器的功效,并通过偏见和方差的镜头分析这种现象。我们还提出了一个$α$ - 订单梯度估计器,其中$α\在[0,1] $中,它正确利用了精确的梯度将一阶估计值的效率与零级方法的鲁棒性结合在一起。我们证明了传统估计器的陷阱以及在某些数值示例中的$α$订单估计器的优势。
Differentiable simulators promise faster computation time for reinforcement learning by replacing zeroth-order gradient estimates of a stochastic objective with an estimate based on first-order gradients. However, it is yet unclear what factors decide the performance of the two estimators on complex landscapes that involve long-horizon planning and control on physical systems, despite the crucial relevance of this question for the utility of differentiable simulators. We show that characteristics of certain physical systems, such as stiffness or discontinuities, may compromise the efficacy of the first-order estimator, and analyze this phenomenon through the lens of bias and variance. We additionally propose an $α$-order gradient estimator, with $α\in [0,1]$, which correctly utilizes exact gradients to combine the efficiency of first-order estimates with the robustness of zero-order methods. We demonstrate the pitfalls of traditional estimators and the advantages of the $α$-order estimator on some numerical examples.