当下生活：学习动力学模型适应不断发展的政策

论文标题

当下生活：学习动力学模型适应不断发展的政策

Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy

论文作者

Wang, Xiyao, Wongkamjan, Wichayaporn, Huang, Furong

论文摘要

基于模型的增强学习（RL）通常通过学习动态模型来生成用于策略学习的样品，通常比无模型RL实现实践中的样本效率更高。先前的工作学习了一个动态模型，该模型适合所有历史策略的经验国家行动探视分布，即样本重播缓冲区。但是，在本文中，我们观察到，在\ emph {所有历史策略}的分布下拟合动力学模型并不一定会受益于\ emph {当前策略}的模型预测，因为随着时间的推移，使用的策略在不断发展。培训期间不断发展的政策将导致州行动探视分配变化。我们从理论上分析了这种分布在历史政策上的转变如何影响模型学习和模型推出。然后，我们提出了一种新颖的动力学模型学习方法，称为\ textit {策略适应动态模型学习（PDML）}。 PDML动态调整历史政策混合分布，以确保学习模型可以不断适应不断发展的政策的州行动访问分布。在Mujoco中的一系列连续控制环境上进行的实验表明，PDML可以显着提高样品效率和较高的渐近性能以及基于最先进的基于模型的RL方法。

Model-based reinforcement learning (RL) often achieves higher sample efficiency in practice than model-free RL by learning a dynamics model to generate samples for policy learning. Previous works learn a dynamics model that fits under the empirical state-action visitation distribution for all historical policies, i.e., the sample replay buffer. However, in this paper, we observe that fitting the dynamics model under the distribution for \emph{all historical policies} does not necessarily benefit model prediction for the \emph{current policy} since the policy in use is constantly evolving over time. The evolving policy during training will cause state-action visitation distribution shifts. We theoretically analyze how this distribution shift over historical policies affects the model learning and model rollouts. We then propose a novel dynamics model learning method, named \textit{Policy-adapted Dynamics Model Learning (PDML)}. PDML dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy. Experiments on a range of continuous control environments in MuJoCo show that PDML achieves significant improvement in sample efficiency and higher asymptotic performance combined with the state-of-the-art model-based RL methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题