论文标题
Lobsdice:通过固定分配校正估计从观察中学习的离线学习
LobsDICE: Offline Learning from Observation via Stationary Distribution Correction Estimation
论文作者
论文摘要
我们考虑了从观察(LFO)学习的问题,在该问题中,代理商旨在模仿专家仅由国家示威的专家的行为模仿专家的行为。我们还假设代理无法与环境相互作用,而是可以访问某些具有未知品质的代理收集的动作标记的过渡数据。在许多实际情况下,LFO的脱机环境令人着迷,在许多实际情况下,地面专家行动无法访问并且任意环境相互作用是昂贵或冒险的。在本文中,我们介绍了Lobsdice,这是一种离线LFO算法,该算法学会通过在固定分布的空间中优化模仿专家政策。我们的算法解决了一个凸的最小化问题,该问题最大程度地减少了专家诱导的两个国家转变分布与代理策略之间的差异。通过一系列广泛的离线LFO任务,我们表明Lobsdice的表现优于强大的基线方法。
We consider the problem of learning from observation (LfO), in which the agent aims to mimic the expert's behavior from the state-only demonstrations by experts. We additionally assume that the agent cannot interact with the environment but has access to the action-labeled transition data collected by some agents with unknown qualities. This offline setting for LfO is appealing in many real-world scenarios where the ground-truth expert actions are inaccessible and the arbitrary environment interactions are costly or risky. In this paper, we present LobsDICE, an offline LfO algorithm that learns to imitate the expert policy via optimization in the space of stationary distributions. Our algorithm solves a single convex minimization problem, which minimizes the divergence between the two state-transition distributions induced by the expert and the agent policy. Through an extensive set of offline LfO tasks, we show that LobsDICE outperforms strong baseline methods.