论文标题
改进TD3-BC:离线学习和稳定在线微调的轻松政策约束
Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and Stable Online Fine-Tuning
论文作者
论文摘要
从固定数据集中发现最佳行为的能力有可能将增强学习的成功(RL)转移到数据收集严重问题的域。在这种离线环境中,一个关键的挑战是克服了数据中不存在的动作的高估偏差,而数据没有通过与环境互动来纠正的能力,可以在训练过程中传播和复合,从而导致高度优化的策略。减少这种偏见的一种简单方法是通过行为克隆(BC)引入策略约束,该策略限制鼓励代理商接近源数据。通过在RL和BC之间找到适当的平衡,这种方法已被证明是出乎意料的有效性,同时需要对它们所基于的基础算法进行最小的更改。迄今为止,这种余额一直保持不变,但是在这项工作中,我们探索了在初步培训后将这种平衡倾斜到RL的想法。使用TD3-BC,我们证明,通过继续离线训练政策,同时减少BC组件的影响,我们可以制作出优于原始基线的精制政策,并匹配或超过更复杂替代方案的性能。此外,我们证明了这种方法可用于稳定的在线微调,从而可以在部署期间安全改进政策。
The ability to discover optimal behaviour from fixed data sets has the potential to transfer the successes of reinforcement learning (RL) to domains where data collection is acutely problematic. In this offline setting, a key challenge is overcoming overestimation bias for actions not present in data which, without the ability to correct for via interaction with the environment, can propagate and compound during training, leading to highly sub-optimal policies. One simple method to reduce this bias is to introduce a policy constraint via behavioural cloning (BC), which encourages agents to pick actions closer to the source data. By finding the right balance between RL and BC such approaches have been shown to be surprisingly effective while requiring minimal changes to the underlying algorithms they are based on. To date this balance has been held constant, but in this work we explore the idea of tipping this balance towards RL following initial training. Using TD3-BC, we demonstrate that by continuing to train a policy offline while reducing the influence of the BC component we can produce refined policies that outperform the original baseline, as well as match or exceed the performance of more complex alternatives. Furthermore, we demonstrate such an approach can be used for stable online fine-tuning, allowing policies to be safely improved during deployment.