论文标题

LOCA遗憾:一致评估基于模型的行为的一致指标

The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning

论文作者

van Seijen, Harm, Nekoei, Hadi, Racah, Evan, Chandar, Sarath

论文摘要

基于模型的深层增强学习(RL)有可能大大提高深RL的样本效率。尽管长期以来一直存在各种挑战,但最近有许多论文通过基于模型的方法进行了报告。这是一个很大的发展,但是缺乏一致的度量来评估这种方法,因此很难比较各种方法。例如,由于基于模型的学习与其他各个方面(例如代表学习),常见的单任务样本效率度量将改进混为一谈,因此很难评估基于模型的RL的真实进度。为了解决这个问题,我们引入了一个实验设置,以评估RL方法的基于模型的行为,这是受神经科学从检测人类和动物中基于模型的行为的工作的启发。我们的指标基于此设置,局部变化适应(LOCA)遗憾,可以衡量RL方法适应环境中本地变化的速度。我们的指标可以识别基于模型的行为,即使该方法使用较差的表示形式,并提供了有关方法行为与基于最佳模型的行为的距离的洞察力。我们使用我们的设置来评估Muzero在经典山车任务的变化方面的基于模型的行为。

Deep model-based Reinforcement Learning (RL) has the potential to substantially improve the sample-efficiency of deep RL. While various challenges have long held it back, a number of papers have recently come out reporting success with deep model-based methods. This is a great development, but the lack of a consistent metric to evaluate such methods makes it difficult to compare various approaches. For example, the common single-task sample-efficiency metric conflates improvements due to model-based learning with various other aspects, such as representation learning, making it difficult to assess true progress on model-based RL. To address this, we introduce an experimental setup to evaluate model-based behavior of RL methods, inspired by work from neuroscience on detecting model-based behavior in humans and animals. Our metric based on this setup, the Local Change Adaptation (LoCA) regret, measures how quickly an RL method adapts to a local change in the environment. Our metric can identify model-based behavior, even if the method uses a poor representation and provides insight in how close a method's behavior is from optimal model-based behavior. We use our setup to evaluate the model-based behavior of MuZero on a variation of the classic Mountain Car task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源