从多源数据中估算离线增强学习的行为

论文标题

从多源数据中估算离线增强学习的行为

Behavior Estimation from Multi-Source Data for Offline Reinforcement Learning

论文作者

Zhang, Guoxi, Kashima, Hisashi

论文摘要

离线增强学习（RL）由于具有吸引人的数据效率而获得了不断上升的利息。本研究解决了行为估计，这是一项奠定了许多离线RL算法的基础的任务。行为估算旨在估算生成培训数据的政策。特别是，这项工作考虑了从多个来源收集数据的方案。在这种情况下，忽略了数据异质性，现有的行为估计方法遭受了行为错误的规定。为了克服这一缺点，本研究提出了一个潜在变量模型，以从数据中推断一组策略，该模型允许代理用作行为策略最能描述特定轨迹的策略。该模型为多源数据提供了代理细粒的表征，并有助于其克服行为错误指定。这项工作还为该模型提出了一种学习算法，并通过扩展了现有的离线RL算法来说明其实际用法。最后，通过广泛的评估，这项工作证实了行为错误指定的存在和提议模型的功效。

Offline reinforcement learning (RL) have received rising interest due to its appealing data efficiency. The present study addresses behavior estimation, a task that lays the foundation of many offline RL algorithms. Behavior estimation aims at estimating the policy with which training data are generated. In particular, this work considers a scenario where the data are collected from multiple sources. In this case, neglecting data heterogeneity, existing approaches for behavior estimation suffers from behavior misspecification. To overcome this drawback, the present study proposes a latent variable model to infer a set of policies from data, which allows an agent to use as behavior policy the policy that best describes a particular trajectory. This model provides with a agent fine-grained characterization for multi-source data and helps it overcome behavior misspecification. This work also proposes a learning algorithm for this model and illustrates its practical usage via extending an existing offline RL algorithm. Lastly, with extensive evaluation this work confirms the existence of behavior misspecification and the efficacy of the proposed model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题