焦点：通过远程度量学习和行为正则化有效的全面秘密荟萃提升学习

论文标题

焦点：通过远程度量学习和行为正则化有效的全面秘密荟萃提升学习

FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization

论文作者

Li, Lanqing, Yang, Rui, Luo, Dijun

论文摘要

我们研究了离线元强化学习（OMRL）问题，这是一个范式，使增强学习（RL）算法能够快速适应看不见的任务而无需与环境进行任何互动，从而使RL在许多现实世界中真正实用。这个问题仍然没有完全理解，需要解决两个主要挑战。首先，离线RL通常会受到隔离状态侵犯的自举错误，这会导致价值函数的分歧。其次，元RL需要通过控制政策共同学习的有效而健壮的任务推论。在这项工作中，我们将学习政策的行为正规化作为离线RL的一般方法，并结合确定性上下文编码器，以进行有效的任务推断。我们提出了一个新颖的负功率距离度量，该距离嵌入了嵌入空间，其梯度的传播与Bellman的备份分离。我们提供分析和洞察力，表明一些简单的设计选择可以对涉及元RL和远程度量学习的最新方法产生实质性改进。据我们所知，我们的方法是第一个无模型和端到端的OMRL算法，该算法在计算上是有效的，并证明了在几个Meta-RL基准测试上的先验算法。

We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks without any interactions with the environments, making RL truly practical in many real-world applications. This problem is still not fully understood, for which two major challenges need to be addressed. First, offline RL usually suffers from bootstrapping errors of out-of-distribution state-actions which leads to divergence of value functions. Second, meta-RL requires efficient and robust task inference learned jointly with control policy. In this work, we enforce behavior regularization on learned policy as a general approach to offline RL, combined with a deterministic context encoder for efficient task inference. We propose a novel negative-power distance metric on bounded context embedding space, whose gradients propagation is detached from the Bellman backup. We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches involving meta-RL and distance metric learning. To the best of our knowledge, our method is the first model-free and end-to-end OMRL algorithm, which is computationally efficient and demonstrated to outperform prior algorithms on several meta-RL benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题