SIBRE：基于自我改进的奖励，用于增强学习中的自适应反馈

论文标题

SIBRE：基于自我改进的奖励，用于增强学习中的自适应反馈

SIBRE: Self Improvement Based REwards for Adaptive Feedback in Reinforcement Learning

论文作者

Nath, Somjit, Verma, Richa, Ray, Abhik, Khadilkar, Harshad

论文摘要

我们提出了一种通用的奖励塑形方法，用于提高增强学习（RL）的收敛速度，称为基于自我改进的奖励或SIBRE。该方法旨在与任何现有的RL算法结合使用，并包括对代理商自己过去的性能的奖励改进。我们证明，在与原始RL算法相同的条件下，SIBRE在期望中收敛。当原始奖励被微弱歧视或稀疏时，重塑奖励有助于区分政策。在具有不同RL算法的几个著名基准环境上进行的实验表明，SIBRE会更快，更稳定地收敛到最佳策略。与基线RL算法相比，我们还对超参数进行了灵敏度分析。

We propose a generic reward shaping approach for improving the rate of convergence in reinforcement learning (RL), called Self Improvement Based REwards, or SIBRE. The approach is designed for use in conjunction with any existing RL algorithm, and consists of rewarding improvement over the agent's own past performance. We prove that SIBRE converges in expectation under the same conditions as the original RL algorithm. The reshaped rewards help discriminate between policies when the original rewards are weakly discriminated or sparse. Experiments on several well-known benchmark environments with different RL algorithms show that SIBRE converges to the optimal policy faster and more stably. We also perform sensitivity analysis with respect to hyper-parameters, in comparison with baseline RL algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题