掩盖自动编码，以进行可扩展且可推广的决策

论文标题

掩盖自动编码，以进行可扩展且可推广的决策

Masked Autoencoding for Scalable and Generalizable Decision Making

论文作者

Liu, Fangchen, Liu, Hao, Grover, Aditya, Abbeel, Pieter

论文摘要

我们有兴趣学习可扩展的代理，以增强增强剂学习，这些学习可以从类似于当前的大型视力和语言模型类似的大规模，多样的顺序数据中学习。为此，本文介绍了蒙面的决策预测（MaskDP），这是一种简单且可扩展的自我监督预审方法（RL）和行为克隆（BC）。在MaskDP方法中，我们对州行动轨迹采用了蒙版自动编码器（MAE），其中我们随机掩盖了状态和动作令牌并重建缺失的数据。通过这样做，需要模型来推断蒙面状态和动作，并提取有关动态的信息。我们发现，掩盖不同比例的输入序列有助于学习一个更好的模型，可以很好地推广到多个下游任务。在我们的实证研究中，我们发现MaskDP模型可以从零射击转移到新的卑诗省任务（例如单个目标和多个目标达到）的能力，并且可以从几个示例过渡中推断出技巧。此外，MaskDP很好地传输到离线RL，并显示出令人鼓舞的缩放行为W.R.T.用于型号大小。它可以适合数据有效的填充，并通过基于自回旋预处理的先前方法实现竞争结果。

We are interested in learning scalable agents for reinforcement learning that can learn from large-scale, diverse sequential data similar to current large vision and language models. To this end, this paper presents masked decision prediction (MaskDP), a simple and scalable self-supervised pretraining method for reinforcement learning (RL) and behavioral cloning (BC). In our MaskDP approach, we employ a masked autoencoder (MAE) to state-action trajectories, wherein we randomly mask state and action tokens and reconstruct the missing data. By doing so, the model is required to infer masked-out states and actions and extract information about dynamics. We find that masking different proportions of the input sequence significantly helps with learning a better model that generalizes well to multiple downstream tasks. In our empirical study, we find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching, and it can zero-shot infer skills from a few example transitions. In addition, MaskDP transfers well to offline RL and shows promising scaling behavior w.r.t. to model size. It is amenable to data-efficient finetuning, achieving competitive results with prior methods based on autoregressive pretraining.

下载PDF全文

下载文献需遵守相关版权规定

论文标题