论文标题

通过数据重新平衡来增强离线增强学习

Boosting Offline Reinforcement Learning via Data Rebalancing

论文作者

Yue, Yang, Kang, Bingyi, Ma, Xiao, Xu, Zhongwen, Huang, Gao, Yan, Shuicheng

论文摘要

离线增强学习(RL)受到学习策略和数据集之间的分配变化的挑战。为了解决这个问题,现有的作品主要集中于设计复杂的算法,以明确或隐式地限制学习的策略与行为策略接近。该约束不仅适用于表现良好的行动,还适用于下等方面,这限制了学习策略的性能上限。对齐两个分布的密度并没有使支架保持一致,而是可以放松的约束,同时仍然能够避免分发措施。因此,我们提出了一种简单而有效的方法,以基于观察结果来提高离线RL算法,即重新采样数据集使分布支持保持不变。更具体地说,我们通过根据其情节返回在旧数据集中重新采样每个过渡来构建更好的行为策略。我们将方法配音为红色(基于返回的数据重新平衡),可以用少于10行的代码更改实现,并增加了可以忽略的运行时间。广泛的实验表明,RED可以有效地提高离线RL性能和正交的长尾分类中的策略。在D4RL基准测试中实现了新的最先进。

Offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. To address this problem, existing works mainly focus on designing sophisticated algorithms to explicitly or implicitly constrain the learned policy to be close to the behavior policy. The constraint applies not only to well-performing actions but also to inferior ones, which limits the performance upper bound of the learned policy. Instead of aligning the densities of two distributions, aligning the supports gives a relaxed constraint while still being able to avoid out-of-distribution actions. Therefore, we propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. More specifically, we construct a better behavior policy by resampling each transition in an old dataset according to its episodic return. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time. Extensive experiments demonstrate that ReD is effective at boosting offline RL performance and orthogonal to decoupling strategies in long-tailed classification. New state-of-the-arts are achieved on the D4RL benchmark.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源