使用无模型的深度RL适用于基于模型的RL的自适应推出长度

论文标题

使用无模型的深度RL适用于基于模型的RL的自适应推出长度

Adaptive Rollout Length for Model-Based RL Using Model-Free Deep RL

论文作者

Bhatia, Abhinav, Thomas, Philip S., Zilberstein, Shlomo

论文摘要

基于模型的强化学习有望通过学习环境中间模型来学习与环境相比，从与环境的互动更少的相互作用中学习最佳政策，以预测未来的相互作用。当预测相互作用的序列时，限制预测范围的推出长度是关键的超参数，因为预测的准确性会降低远离真实体验的区域。结果，从长远来看，从长远来看，总体上是一个更糟糕的政策。因此，超参数提供了质量和效率之间的权衡。在这项工作中，我们将调整推出长度调整为元级的顺序决策问题的问题构成了问题，该问题优化了基于模型的增强型学习所学到的最终策略，从而，通过基于学习过程的反馈来调整超级参数的固定预算，例如模型的准确性，例如模型的准确性以及交互预算的剩余预算。我们使用无模型的深度强化学习来解决元级决策问题，并证明我们的方法在两个众所周知的强化学习环境上表现优于共同的启发式基准。

Model-based reinforcement learning promises to learn an optimal policy from fewer interactions with the environment compared to model-free reinforcement learning by learning an intermediate model of the environment in order to predict future interactions. When predicting a sequence of interactions, the rollout length, which limits the prediction horizon, is a critical hyperparameter as accuracy of the predictions diminishes in the regions that are further away from real experience. As a result, with a longer rollout length, an overall worse policy is learned in the long run. Thus, the hyperparameter provides a trade-off between quality and efficiency. In this work, we frame the problem of tuning the rollout length as a meta-level sequential decision-making problem that optimizes the final policy learned by model-based reinforcement learning given a fixed budget of environment interactions by adapting the hyperparameter dynamically based on feedback from the learning process, such as accuracy of the model and the remaining budget of interactions. We use model-free deep reinforcement learning to solve the meta-level decision problem and demonstrate that our approach outperforms common heuristic baselines on two well-known reinforcement learning environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题