学习以乐观的方式计划：通过潜在模型合奏的不确定性引导深入探索

论文标题

学习以乐观的方式计划：通过潜在模型合奏的不确定性引导深入探索

Learning to Plan Optimistically: Uncertainty-Guided Deep Exploration via Latent Model Ensembles

论文作者

Seyde, Tim, Schwarting, Wilko, Karaman, Sertac, Rus, Daniela

论文摘要

通过互动学习复杂的机器人行为需要结构化探索。计划应以优化长期性能的潜力为目标，同时仅在有利于这一目标的情况下降低不确定性。本文提出了潜在的乐观价值探索（Love），该策略可以在不确定的长期回报时通过乐观探索进行深入探索。我们将潜在的世界模型与价值函数估计相结合，以预测无限 - 摩恩的回报，并通过结合恢复相关的不确定性。然后，对该政策进行培训，以较高的信心（UCB）目标培训，以识别和选择最有希望改善长期绩效的交互。我们将Love应用于连续动作空间中的视觉机器人控制任务，并且与最先进的目标和其他探索目标相比，样本效率的平均提高了20％以上。在稀疏而难以探索的环境中，我们的平均改善超过30％。

Learning complex robot behaviors through interaction requires structured exploration. Planning should target interactions with the potential to optimize long-term performance, while only reducing uncertainty where conducive to this objective. This paper presents Latent Optimistic Value Exploration (LOVE), a strategy that enables deep exploration through optimism in the face of uncertain long-term rewards. We combine latent world models with value function estimation to predict infinite-horizon returns and recover associated uncertainty via ensembling. The policy is then trained on an upper confidence bound (UCB) objective to identify and select the interactions most promising to improve long-term performance. We apply LOVE to visual robot control tasks in continuous action spaces and demonstrate on average more than 20% improved sample efficiency in comparison to state-of-the-art and other exploration objectives. In sparse and hard to explore environments we achieve an average improvement of over 30%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题