论文标题

变形金刚是元强化学习者

Transformers are Meta-Reinforcement Learners

论文作者

Melo, Luckeciano C.

论文摘要

近年来,变压器架构和变体在许多机器学习任务中取得了巨大的成功。这种成功与关注机制的处理能力和上下文相关的权重的存在本质上相关。我们认为,这些功能适合荟萃方面学习算法的核心作用。实际上,元素代理需要从一系列轨迹中推断任务。此外,它需要一种快速的适应策略来使其政策适应一项新任务 - 可以使用自我注意机制来实现。在这项工作中,我们介绍了TRMRL(用于元强化学习的变压器),这是一种使用变压器体系结构模拟内存恢复机制的元素代理。它将工作记忆的最新过去联系在一起,以通过变压器层递归地构建情节记忆。我们表明,自我发作计算共识表示,该表示将每一层的贝叶斯风险降至最低,并提供有意义的功能来计算最佳动作。我们在高维连续控制环境中进行了实验,以进行运动和灵巧的操纵。结果表明,与这些环境中的基准相比,TRMRL呈现出可比或优质的渐近性能,样本效率和分布式概括。

The transformer architecture and variants presented remarkable success across many machine learning tasks in recent years. This success is intrinsically related to the capability of handling long sequences and the presence of context-dependent weights from the attention mechanism. We argue that these capabilities suit the central role of a Meta-Reinforcement Learning algorithm. Indeed, a meta-RL agent needs to infer the task from a sequence of trajectories. Furthermore, it requires a fast adaptation strategy to adapt its policy for a new task -- which can be achieved using the self-attention mechanism. In this work, we present TrMRL (Transformers for Meta-Reinforcement Learning), a meta-RL agent that mimics the memory reinstatement mechanism using the transformer architecture. It associates the recent past of working memories to build an episodic memory recursively through the transformer layers. We show that the self-attention computes a consensus representation that minimizes the Bayes Risk at each layer and provides meaningful features to compute the best actions. We conducted experiments in high-dimensional continuous control environments for locomotion and dexterous manipulation. Results show that TrMRL presents comparable or superior asymptotic performance, sample efficiency, and out-of-distribution generalization compared to the baselines in these environments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源