论文标题

使用有限的数据,用于离线增强学习的数据有效管道

Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data

论文作者

Nie, Allen, Flet-Berliac, Yannis, Jordan, Deon R., Steenbergen, William, Brunskill, Emma

论文摘要

离线增强学习(RL)可用于通过利用历史数据来改善未来的绩效。有许多不同的离线RL算法,人们认识到,这些算法及其超参数设置可以导致具有实质性不同的决策政策。这促使对允许从业人员系统地进行算法的算法选择的管道需求进行设置。至关重要的是,在大多数现实世界中,该管道必须仅涉及历史数据的使用。受监督学习的统计模型选择方法的启发,我们引入了一条任务和方法不合时宜的管道,用于自动培训,比较,选择和部署提供的数据集的大小限制。特别是,我们的工作突出了执行多个数据拆分以产生更可靠的算法 - 杂种参数选择的重要性。据我们所知,这是监督学习中的一种常见方法,但在离线RL设置中尚未详细讨论。我们表明,当数据集很小时,它可能会产生重大影响。与替代方法相比,我们提议的管道从广泛的离线政策学习算法以及医疗保健,教育和机器人技术中的各个模拟领域中输出了较高的绩效部署政策。这项工作有助于开发用于离线RL自动算法的通用元算象选择。

Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源