在非平稳的MDP中优化未来

论文标题

在非平稳的MDP中优化未来

Optimizing for the Future in Non-Stationary MDPs

论文作者

Chandak, Yash, Theocharous, Georgios, Shankar, Shiv, White, Martha, Mahadevan, Sridhar, Thomas, Philip S.

论文摘要

大多数强化学习方法基于以下关键假设：过渡动态和奖励函数是固定的，即马尔可夫决策过程是静止的。但是，在许多实际应用中，违反了此假设，使用现有算法可能会导致性能滞后。为了主动搜索美好的未来政策，我们提出了一种策略梯度算法，该算法最大程度地提高了未来绩效的预测。通过将曲线拟合到随着时间的推移的策略绩效估计值，而无需明确对基本的非平稳性进行建模，从而获得了此预测。由此产生的算法相当于对过去数据的不均匀重新加权，并且在搜索最大化未来绩效的策略时，我们观察到，对过去情节的某些数据的性能最小化可能是有益的。我们表明，在由实际应用程序激发的三个模拟问题上，我们的算法（称为预后剂）对非平稳性更强大。

Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary. However, in many real-world applications, this assumption is violated, and using existing algorithms may result in a performance lag. To proactively search for a good future policy, we present a policy gradient algorithm that maximizes a forecast of future performance. This forecast is obtained by fitting a curve to the counter-factual estimates of policy performance over time, without explicitly modeling the underlying non-stationarity. The resulting algorithm amounts to a non-uniform reweighting of past data, and we observe that minimizing performance over some of the data from past episodes can be beneficial when searching for a policy that maximizes future performance. We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques, on three simulated problems motivated by real-world applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题