论文标题
通过对比随机步行发现内在的奖励
Discovering Intrinsic Reward with Contrastive Random Walk
论文作者
论文摘要
本文的目的是证明使用对比的随机步行作为好奇方法的功效,以更快地融合到最佳策略。对抗性随机步行定义了在神经网络的帮助下定义随机步行的过渡矩阵。它通过封闭循环学习有意义的状态表示。损失随机步行是一种内在的奖励,并将其添加到环境中。我们的方法在非尾巴稀疏奖励方案中很好地效果,因为我们的方法与其他方法相比在相同的迭代中获得最高的奖励。同时,对比的随机步行更加健壮。由于环境的随机初始化,性能不会发生太大变化。我们还发现自适应重新启动和适当的温度对于对比随机行走的性能至关重要。
The aim of this paper is to demonstrate the efficacy of using Contrastive Random Walk as a curiosity method to achieve faster convergence to the optimal policy.Contrastive Random Walk defines the transition matrix of a random walk with the help of neural networks. It learns a meaningful state representation with a closed loop. The loss of Contrastive Random Walk serves as an intrinsic reward and is added to the environment reward. Our method works well in non-tabular sparse reward scenarios, in the sense that our method receives the highest reward within the same iterations compared to other methods. Meanwhile, Contrastive Random Walk is more robust. The performance doesn't change much with different random initialization of environments. We also find that adaptive restart and appropriate temperature are crucial to the performance of Contrastive Random Walk.