进化策略与强化学习方法之间的定性差异，以控制自主代理

论文标题

进化策略与强化学习方法之间的定性差异，以控制自主代理

Qualitative Differences Between Evolutionary Strategies and Reinforcement Learning Methods for Control of Autonomous Agents

论文作者

Milano, Nicola, Nolfi, Stefano

论文摘要

在本文中，我们通过专注于两种流行的最先进的算法：OpenAI-ES进化策略和近端政策优化（PPO）增强学习算法 - 这是两个家庭的最类似方法，从而分析了进化策略和强化学习算法之间的定性差异。我们分析了这些方法在以下方面的不同之处：（i）一般疗效，（ii）能够应对稀疏的奖励，（iii）发现最小溶液的倾向/能力，（iv）依赖奖励成型的依赖性，（v）（v）应对环境条件变化的能力。对基准问题的两种方法训练的代理人展示的性能和行为策略的分析，使我们能够证明在先前的研究中未发现的定性差异，以确定两种方法的相对弱点，并提出了改善其中一些弱点的方法。我们表明，奖励函数的特征具有很大的影响，不仅对OpenAI-ES和PPO，而且对于替代增强学习算法而定性上也有所不同，从而证明了优化奖励功能对所使用算法的特征的重要性。

In this paper we analyze the qualitative differences between evolutionary strategies and reinforcement learning algorithms by focusing on two popular state-of-the-art algorithms: the OpenAI-ES evolutionary strategy and the Proximal Policy Optimization (PPO) reinforcement learning algorithm -- the most similar methods of the two families. We analyze how the methods differ with respect to: (i) general efficacy, (ii) ability to cope with sparse rewards, (iii) propensity/capacity to discover minimal solutions, (iv) dependency on reward shaping, and (v) ability to cope with variations of the environmental conditions. The analysis of the performance and of the behavioral strategies displayed by the agents trained with the two methods on benchmark problems enable us to demonstrate qualitative differences which were not identified in previous studies, to identify the relative weakness of the two methods, and to propose ways to ameliorate some of those weakness. We show that the characteristics of the reward function has a strong impact which vary qualitatively not only for the OpenAI-ES and the PPO but also for alternative reinforcement learning algorithms, thus demonstrating the importance of optimizing the characteristic of the reward function to the algorithm used.

下载PDF全文

下载文献需遵守相关版权规定

论文标题