论文标题
歧视者软演员评论家没有外部奖励
Discriminator Soft Actor Critic without Extrinsic Rewards
论文作者
论文摘要
从少量的专家数据和采样数据中,很难在未知状态中很好地模仿。诸如行为克隆之类的监督学习方法不需要采样数据,但通常会遭受分配转移的困扰。基于强化学习的方法,例如逆增强学习和生成对抗性模仿学习(GAIL),只能从少数专家数据中学习。但是,他们通常需要与环境互动。软Q模仿学习解决了这些问题,并且表明它可以通过将行为克隆和软Q学习与持续的奖励相结合,可以有效地学习。为了使该算法更加强大,我们提出了歧视者软演员评论家(DSAC)。它使用基于对抗性逆增强学习而不是恒定奖励的奖励功能。我们在只有四个专家轨迹的Pybullet环境上进行了评估。
It is difficult to be able to imitate well in unknown states from a small amount of expert data and sampling data. Supervised learning methods such as Behavioral Cloning do not require sampling data, but usually suffer from distribution shift. The methods based on reinforcement learning, such as inverse reinforcement learning and generative adversarial imitation learning (GAIL), can learn from only a few expert data. However, they often need to interact with the environment. Soft Q imitation learning addressed the problems, and it was shown that it could learn efficiently by combining Behavioral Cloning and soft Q-learning with constant rewards. In order to make this algorithm more robust to distribution shift, we propose Discriminator Soft Actor Critic (DSAC). It uses a reward function based on adversarial inverse reinforcement learning instead of constant rewards. We evaluated it on PyBullet environments with only four expert trajectories.