论文标题
向后模仿和通过双向模型推出的向前增强学习
Backward Imitation and Forward Reinforcement Learning via Bi-directional Model Rollouts
论文作者
论文摘要
传统的基于模型的增强学习(RL)方法使用学习的动力学模型生成向前的推出轨迹,以减少与真实环境的相互作用。最近基于模型的RL方法考虑了学习向后模型的方法,该模型指定了前一个状态的条件概率给定以前的动作和当前状态以生成向后推出轨迹的情况。但是,在这种类型的基于模型的方法中,从向后推出的样品和向前推出的样品简单地聚集在一起,以通过无模型的RL算法优化策略,这可能会降低样本效率和收敛速率。这是因为这种方法忽略了这样一个事实,即落后推出轨迹通常是从某些高价值状态开始产生的,并且对于代理人改善行为的肯定会更具启发性。在本文中,我们提出了向后的模仿和前向加强学习(BIFRL)框架,在该框架中,代理将向后的推出痕迹视为模仿出色行为的专家演示,然后收集策略增强的前瞻性推出过渡。因此,BIFRL以更有效的方式使代理人能够从高价值状态伸出并从高价值状态进行探索,并进一步降低了实际的相互作用,从而使其更适合于现实机器人学习。此外,引入了价值调节的生成对抗网络,以增强代理商很少收到的宝贵状态。从理论上讲,我们提供了BIFRL优于基线方法的条件。在实验上,我们证明BIFRL获得了更好的样品效率,并在与基于最新模型的方法相比的各种Mujoco运动任务上产生竞争性渐近性能。
Traditional model-based reinforcement learning (RL) methods generate forward rollout traces using the learnt dynamics model to reduce interactions with the real environment. The recent model-based RL method considers the way to learn a backward model that specifies the conditional probability of the previous state given the previous action and the current state to additionally generate backward rollout trajectories. However, in this type of model-based method, the samples derived from backward rollouts and those from forward rollouts are simply aggregated together to optimize the policy via the model-free RL algorithm, which may decrease both the sample efficiency and the convergence rate. This is because such an approach ignores the fact that backward rollout traces are often generated starting from some high-value states and are certainly more instructive for the agent to improve the behavior. In this paper, we propose the backward imitation and forward reinforcement learning (BIFRL) framework where the agent treats backward rollout traces as expert demonstrations for the imitation of excellent behaviors, and then collects forward rollout transitions for policy reinforcement. Consequently, BIFRL empowers the agent to both reach to and explore from high-value states in a more efficient manner, and further reduces the real interactions, making it potentially more suitable for real-robot learning. Moreover, a value-regularized generative adversarial network is introduced to augment the valuable states which are infrequently received by the agent. Theoretically, we provide the condition where BIFRL is superior to the baseline methods. Experimentally, we demonstrate that BIFRL acquires the better sample efficiency and produces the competitive asymptotic performance on various MuJoCo locomotion tasks compared against state-of-the-art model-based methods.