自适应行为克隆正规化，用于稳定的脱机加强学习

论文标题

自适应行为克隆正规化，用于稳定的脱机加强学习

Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

论文作者

Zhao, Yi, Boney, Rinu, Ilin, Alexander, Kannala, Juho, Pajarinen, Joni

论文摘要

通过从固定的数据集中学习，离线增强学习可以学习代理行为而不与环境互动。但是，根据离线数据集的质量，此类预培训的代理的性能可能有限，并且需要通过与环境进行互动来在线进行微调。在在线微调过程中，由于从离线数据到在线数据的突然变化，预先训练的代理商的性能可能会迅速崩溃。尽管通过离线RL方法（例如行为克隆损失）强制执行的约束阻碍了这种约束，但这些约束也通过强迫代理商与行为政策保持亲密关系，从而大大减慢了在线微调。我们建议根据代理商的绩效和训练稳定性在线微调期间自适应地权衡行为克隆损失。此外，我们使用Q功能的随机集合来进一步提高在线微调的样本效率，通过执行大量的学习更新。实验表明，所提出的方法在流行的D4RL基准测试中产生了最新的脱机范围内强化学习绩效。代码可用：\ url {https://github.com/zhaoyi11/adaptive_bc}。

Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark. Code is available: \url{https://github.com/zhaoyi11/adaptive_bc}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题