LOCA遗憾：一致评估基于模型的行为的一致指标

论文标题

LOCA遗憾：一致评估基于模型的行为的一致指标

The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning

论文作者

van Seijen, Harm, Nekoei, Hadi, Racah, Evan, Chandar, Sarath

论文摘要

基于模型的深层增强学习（RL）有可能大大提高深RL的样本效率。尽管长期以来一直存在各种挑战，但最近有许多论文通过基于模型的方法进行了报告。这是一个很大的发展，但是缺乏一致的度量来评估这种方法，因此很难比较各种方法。例如，由于基于模型的学习与其他各个方面（例如代表学习），常见的单任务样本效率度量将改进混为一谈，因此很难评估基于模型的RL的真实进度。为了解决这个问题，我们引入了一个实验设置，以评估RL方法的基于模型的行为，这是受神经科学从检测人类和动物中基于模型的行为的工作的启发。我们的指标基于此设置，局部变化适应（LOCA）遗憾，可以衡量RL方法适应环境中本地变化的速度。我们的指标可以识别基于模型的行为，即使该方法使用较差的表示形式，并提供了有关方法行为与基于最佳模型的行为的距离的洞察力。我们使用我们的设置来评估Muzero在经典山车任务的变化方面的基于模型的行为。

Deep model-based Reinforcement Learning (RL) has the potential to substantially improve the sample-efficiency of deep RL. While various challenges have long held it back, a number of papers have recently come out reporting success with deep model-based methods. This is a great development, but the lack of a consistent metric to evaluate such methods makes it difficult to compare various approaches. For example, the common single-task sample-efficiency metric conflates improvements due to model-based learning with various other aspects, such as representation learning, making it difficult to assess true progress on model-based RL. To address this, we introduce an experimental setup to evaluate model-based behavior of RL methods, inspired by work from neuroscience on detecting model-based behavior in humans and animals. Our metric based on this setup, the Local Change Adaptation (LoCA) regret, measures how quickly an RL method adapts to a local change in the environment. Our metric can identify model-based behavior, even if the method uses a poor representation and provides insight in how close a method's behavior is from optimal model-based behavior. We use our setup to evaluate the model-based behavior of MuZero on a variation of the classic Mountain Car task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题