论文标题
对艰苦探索环境中的强化学习的内在动机技术的评估研究
An Evaluation Study of Intrinsic Motivation Techniques applied to Reinforcement Learning over Hard Exploration Environments
论文作者
论文摘要
在过去的几年中,围绕在稀疏回报的环境中制定的强化学习任务的研究活动尤为明显。在解决这些艰苦探索问题的众多方法中,内在动机机制可以说是迄今为止研究最多的替代方案之一。随着时间的流逝,该领域报告的进展通过提出新的算法思想来生成替代机制来衡量新颖性,从而解决了探索问题。但是,在这个方向上的大多数努力都忽略了不同的设计选择和参数设置的影响,这些选择也被引入了,以提高生成的固有奖励的效果,而忘记了这些选择的应用在其他可能受益的内在动机技术中。此外,其中一些固有方法是通过不同的基础增强算法(例如PPO,Impala)和神经网络体系结构应用的,难以公平地比较每个解决方案提供的结果和提供的实际进度。这项工作的目的是在强化对艰难勘探环境的强化学习中强调这一关键问题,从而揭示了前卫内在动机技术对各种设计因素的可变性和敏感性。最终,我们在此报告的实验强调了这些设计方面的仔细选择以及在同一设置下所讨论的任务的重要性,因此可以保证公平的比较。
In the last few years, the research activity around reinforcement learning tasks formulated over environments with sparse rewards has been especially notable. Among the numerous approaches proposed to deal with these hard exploration problems, intrinsic motivation mechanisms are arguably among the most studied alternatives to date. Advances reported in this area over time have tackled the exploration issue by proposing new algorithmic ideas to generate alternative mechanisms to measure the novelty. However, most efforts in this direction have overlooked the influence of different design choices and parameter settings that have also been introduced to improve the effect of the generated intrinsic bonus, forgetting the application of those choices to other intrinsic motivation techniques that may also benefit of them. Furthermore, some of those intrinsic methods are applied with different base reinforcement algorithms (e.g. PPO, IMPALA) and neural network architectures, being hard to fairly compare the provided results and the actual progress provided by each solution. The goal of this work is to stress on this crucial matter in reinforcement learning over hard exploration environments, exposing the variability and susceptibility of avant-garde intrinsic motivation techniques to diverse design factors. Ultimately, our experiments herein reported underscore the importance of a careful selection of these design aspects coupled with the exploration requirements of the environment and the task in question under the same setup, so that fair comparisons can be guaranteed.