论文标题
用深层确定性动力学梯度估算Q(S,S')
Estimating Q(s,s') with Deep Deterministic Dynamics Gradients
论文作者
论文摘要
在本文中,我们介绍了一种新颖的价值功能形式,即$ q(s,s')$,该$表达了从状态$ s $过渡到邻近的状态$ s'$,然后在此之后发挥最佳作用的实用性。为了得出最佳策略,我们开发了一个前向动力学模型,该模型学会了做出最大化该值的下一州预测。这种表述将动作与价值观分离,同时仍在学习非政策。我们在价值函数传递,冗余行动空间内学习以及从次优或完全随机的策略产生的状态观察中学习了这种方法的好处。代码和视频可在http://sites.google.com/view/qss-paper上找到。
In this paper, we introduce a novel form of value function, $Q(s, s')$, that expresses the utility of transitioning from a state $s$ to a neighboring state $s'$ and then acting optimally thereafter. In order to derive an optimal policy, we develop a forward dynamics model that learns to make next-state predictions that maximize this value. This formulation decouples actions from values while still learning off-policy. We highlight the benefits of this approach in terms of value function transfer, learning within redundant action spaces, and learning off-policy from state observations generated by sub-optimal or completely random policies. Code and videos are available at http://sites.google.com/view/qss-paper.