论文标题
不断发展的帕累托 - 最佳参与者批评算法,用于普遍性和稳定性
Evolving Pareto-Optimal Actor-Critic Algorithms for Generalizability and Stability
论文作者
论文摘要
可推广性和稳定性是现实世界中操作加固学习(RL)代理的两个关键目标。设计优化这些目标的RL算法可能是一个昂贵而艰苦的过程。本文介绍了Metapg,这是一种用于自动设计的进化方法 - 参与者 - 批评损失函数。 METAPG明确优化了可推广性和性能,并隐式优化了这两个指标的稳定性。我们使用软件批评者(SAC)初始化损失功能人群,并使用编码单任务性能,零击的概括性的环境配置的零允许性能以及与不同随机种子的独立运行的稳定性进行多项测量指标进行多目标优化。在现实世界中RL基准套件的一系列连续控制任务上,我们发现我们的方法在进化过程中使用单个环境,将算法进化,以改善SAC的性能和概括性,将其提高4%和20%,并减少不稳定的算法。然后,我们从Brax物理模拟器扩展到更复杂的环境,并复制在实际设置中遇到的概括性测试,例如不同的摩擦系数。 METAPG进化了算法,这些算法可以获得10%的概括性,而不会在同一元训练环境中丧失性能,并在其他BRAX环境中进行跨域评估时获得与SAC相似的结果。进化结果是可以解释的。通过分析最佳算法的结构,我们确定了有助于优化某些目标的元素,例如批评损失的正规化术语。
Generalizability and stability are two key objectives for operating reinforcement learning (RL) agents in the real world. Designing RL algorithms that optimize these objectives can be a costly and painstaking process. This paper presents MetaPG, an evolutionary method for automated design of actor-critic loss functions. MetaPG explicitly optimizes for generalizability and performance, and implicitly optimizes the stability of both metrics. We initialize our loss function population with Soft Actor-Critic (SAC) and perform multi-objective optimization using fitness metrics encoding single-task performance, zero-shot generalizability to unseen environment configurations, and stability across independent runs with different random seeds. On a set of continuous control tasks from the Real-World RL Benchmark Suite, we find that our method, using a single environment during evolution, evolves algorithms that improve upon SAC's performance and generalizability by 4% and 20%, respectively, and reduce instability up to 67%. Then, we scale up to more complex environments from the Brax physics simulator and replicate generalizability tests encountered in practical settings, such as different friction coefficients. MetaPG evolves algorithms that can obtain 10% better generalizability without loss of performance within the same meta-training environment and obtain similar results to SAC when doing cross-domain evaluations in other Brax environments. The evolution results are interpretable; by analyzing the structure of the best algorithms we identify elements that help optimizing certain objectives, such as regularization terms for the critic loss.