您不能指望运气：为什么决策变形金刚和RV在随机环境中失败

论文标题

您不能指望运气：为什么决策变形金刚和RV在随机环境中失败

You Can't Count on Luck: Why Decision Transformers and RvS Fail in Stochastic Environments

论文作者

Paster, Keiran, McIlraith, Sheila, Ba, Jimmy

论文摘要

最近，诸如决策变压器之类的方法将强化学习减少到预测任务并通过监督学习（RVS）来解决它，因为它们的简单性，对超参数的鲁棒性以及在离线RL任务上的总体表现出色。但是，仅在所需的返回上简单地调节概率模型并采取预测动作可能会在随机环境中急剧失败，因为导致回报的轨迹可能仅由于运气而实现了回报。在这项工作中，我们描述了随机环境中RVS方法的局限性，并提出了解决方案。我们提出的方法ESPER并不是简单地基于单个轨迹返回单个轨迹的归还，而是学会了平均群集轨迹和条件的群集回报，这与环境随机性无关。这样做可以使ESPER在实际环境中实现目标回报和预期性能之间的强大一致性。我们在几项具有挑战性的随机离线RL任务中证明了这一点，包括具有挑战性的益智游戏2048，并将四个与随机对手的比赛联系起来。在所有测试的域中，ESPER在目标返回和获得回报之间的比对明显更好，而不是简单的回报条件。埃斯珀（Esper）也比基于价值的基线还能达到更高的最高性能。

Recently, methods such as Decision Transformer that reduce reinforcement learning to a prediction task and solve it via supervised learning (RvS) have become popular due to their simplicity, robustness to hyperparameters, and strong overall performance on offline RL tasks. However, simply conditioning a probabilistic model on a desired return and taking the predicted action can fail dramatically in stochastic environments since trajectories that result in a return may have only achieved that return due to luck. In this work, we describe the limitations of RvS approaches in stochastic environments and propose a solution. Rather than simply conditioning on the return of a single trajectory as is standard practice, our proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity. Doing so allows ESPER to achieve strong alignment between target return and expected performance in real environments. We demonstrate this in several challenging stochastic offline-RL tasks including the challenging puzzle game 2048, and Connect Four playing against a stochastic opponent. In all tested domains, ESPER achieves significantly better alignment between the target return and achieved return than simply conditioning on returns. ESPER also achieves higher maximum performance than even the value-based baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题