探索计划：解决基于模型的增强学习中的动态瓶颈

论文标题

探索计划：解决基于模型的增强学习中的动态瓶颈

Planning with Exploration: Addressing Dynamics Bottleneck in Model-based Reinforcement Learning

论文作者

Wang, Xiyao, Zhang, Junge, Huang, Wenzhen, Yin, Qiyue

论文摘要

与无模型增强学习（MFRL）相比，基于模型的增强学习（MBRL）具有更高的样品效率。但是，MBRL受到动态瓶颈困境的困扰。动态瓶颈难题是算法的性能落入局部最佳的现象，而不是在与环境的相互作用步骤增加时增加，这意味着更多的数据无法带来更好的性能。在本文中，我们发现轨迹奖励估计误差是通过理论分析引起动态瓶颈困境的主要原因。我们给出了轨迹奖励估计错误的上限，并指出提高代理商的勘探能力是减少轨迹奖励估计误差的关键，从而减轻了动态瓶颈困境。由此激励，提出了一种基于模型的控制方法与勘探名为基于模型的基于模型的基于模型的探索（MOPE2）相结合。我们对几个复杂的连续控制基准任务进行实验。结果验证了Mope2可以有效地减轻动态瓶颈困境，并具有比以前的MBRL和MFRL算法更高的样品效率。

Model-based reinforcement learning (MBRL) is believed to have higher sample efficiency compared with model-free reinforcement learning (MFRL). However, MBRL is plagued by dynamics bottleneck dilemma. Dynamics bottleneck dilemma is the phenomenon that the performance of the algorithm falls into the local optimum instead of increasing when the interaction step with the environment increases, which means more data can not bring better performance. In this paper, we find that the trajectory reward estimation error is the main reason that causes dynamics bottleneck dilemma through theoretical analysis. We give an upper bound of the trajectory reward estimation error and point out that increasing the agent's exploration ability is the key to reduce trajectory reward estimation error, thereby alleviating dynamics bottleneck dilemma. Motivated by this, a model-based control method combined with exploration named MOdel-based Progressive Entropy-based Exploration (MOPE2) is proposed. We conduct experiments on several complex continuous control benchmark tasks. The results verify that MOPE2 can effectively alleviate dynamics bottleneck dilemma and have higher sample efficiency than previous MBRL and MFRL algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题