论文标题
掩盖自动编码,以进行可扩展且可推广的决策
Masked Autoencoding for Scalable and Generalizable Decision Making
论文作者
论文摘要
我们有兴趣学习可扩展的代理,以增强增强剂学习,这些学习可以从类似于当前的大型视力和语言模型类似的大规模,多样的顺序数据中学习。为此,本文介绍了蒙面的决策预测(MaskDP),这是一种简单且可扩展的自我监督预审方法(RL)和行为克隆(BC)。在MaskDP方法中,我们对州行动轨迹采用了蒙版自动编码器(MAE),其中我们随机掩盖了状态和动作令牌并重建缺失的数据。通过这样做,需要模型来推断蒙面状态和动作,并提取有关动态的信息。我们发现,掩盖不同比例的输入序列有助于学习一个更好的模型,可以很好地推广到多个下游任务。在我们的实证研究中,我们发现MaskDP模型可以从零射击转移到新的卑诗省任务(例如单个目标和多个目标达到)的能力,并且可以从几个示例过渡中推断出技巧。此外,MaskDP很好地传输到离线RL,并显示出令人鼓舞的缩放行为W.R.T.用于型号大小。它可以适合数据有效的填充,并通过基于自回旋预处理的先前方法实现竞争结果。
We are interested in learning scalable agents for reinforcement learning that can learn from large-scale, diverse sequential data similar to current large vision and language models. To this end, this paper presents masked decision prediction (MaskDP), a simple and scalable self-supervised pretraining method for reinforcement learning (RL) and behavioral cloning (BC). In our MaskDP approach, we employ a masked autoencoder (MAE) to state-action trajectories, wherein we randomly mask state and action tokens and reconstruct the missing data. By doing so, the model is required to infer masked-out states and actions and extract information about dynamics. We find that masking different proportions of the input sequence significantly helps with learning a better model that generalizes well to multiple downstream tasks. In our empirical study, we find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching, and it can zero-shot infer skills from a few example transitions. In addition, MaskDP transfers well to offline RL and shows promising scaling behavior w.r.t. to model size. It is amenable to data-efficient finetuning, achieving competitive results with prior methods based on autoregressive pretraining.