基于模型的对抗性元强化学习

论文标题

基于模型的对抗性元强化学习

Model-based Adversarial Meta-Reinforcement Learning

论文作者

Lin, Zichuan, Thomas, Garrett, Yang, Guangwen, Ma, Tengyu

论文摘要

元强化学习（META-RL）旨在从多个培训任务中学习有效调整未见测试任务的能力。尽管取得了成功，但已知现有的元rl算法对任务分布转移很敏感。当测试任务分配与培训任务分布不同时，性能可能会大大降低。为了解决这个问题，本文提出了基于模型的对抗性元加强学习（ADMRL），我们的目的是最大程度地减少最差的案例子次要差距 - 最佳回报与算法在适应后（具有基于模型的方法的所有任务）适应后，算法实现的回报之间的差异。我们提出了一个Minimax目标，并通过在固定任务上学习动力学模型并找到当前模型的对抗任务来进行优化，该任务是该模型引起的策略的最大次优级。假设任务家族已被参数化，我们通过隐式函数定理得出了相对于任务参数的次级梯度梯度的公式，并展示如何通过共轭梯度方法有效地实现梯度估计器以及强化估计器的新颖使用。我们在几个连续的控制基准上评估了我们的方法，并证明了其在所有任务中最差的性能，分发任务的通用能力以及培训和测试时间样本效率上，对现有的最新元元RL算法的功效。

Meta-reinforcement learning (meta-RL) aims to learn from multiple training tasks the ability to adapt efficiently to unseen test tasks. Despite the success, existing meta-RL algorithms are known to be sensitive to the task distribution shift. When the test task distribution is different from the training task distribution, the performance may degrade significantly. To address this issue, this paper proposes Model-based Adversarial Meta-Reinforcement Learning (AdMRL), where we aim to minimize the worst-case sub-optimality gap -- the difference between the optimal return and the return that the algorithm achieves after adaptation -- across all tasks in a family of tasks, with a model-based approach. We propose a minimax objective and optimize it by alternating between learning the dynamics model on a fixed task and finding the adversarial task for the current model -- the task for which the policy induced by the model is maximally suboptimal. Assuming the family of tasks is parameterized, we derive a formula for the gradient of the suboptimality with respect to the task parameters via the implicit function theorem, and show how the gradient estimator can be efficiently implemented by the conjugate gradient method and a novel use of the REINFORCE estimator. We evaluate our approach on several continuous control benchmarks and demonstrate its efficacy in the worst-case performance over all tasks, the generalization power to out-of-distribution tasks, and in training and test time sample efficiency, over existing state-of-the-art meta-RL algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题