重新访问最大熵逆增强学习：新的观点和算法

论文标题

重新访问最大熵逆增强学习：新的观点和算法

Revisiting Maximum Entropy Inverse Reinforcement Learning: New Perspectives and Algorithms

论文作者

Snoswell, Aaron J., Singh, Surya P. N., Ye, Nan

论文摘要

我们为最大熵（最大）逆增强学习（IRL）提供了新的观点和推理算法，该方法提供了一种原则性的方法，可以找到与给定的专家示范相一致的最非信号奖励功能，包括许多一致的奖励功能。我们首先基于最小化KL差异而不是最大化熵的广义最大配方。这改善了Maxent IRL模型（对于随机MDP）的先前启发式推导，允许对Maxent IRL和相对熵IRL的统一视图，并导致最大值IRL模型的无模型学习算法。其次，对现有推理算法和实现的仔细审查表明，它们大约计算学习模型所需的边际。我们提供了示例来说明这一点，并提出了一种有效而精确的推理算法。我们的算法可以处理可变长度演示；此外，虽然基本版本将时间二次化为最大演示长度L，但该算法的改进版本使用填充技巧将其降低为线性。实验表明，与近似算法相比，我们的精确算法可以改善奖励学习。此外，我们的算法扩展到涉及驾驶员行为预测的大型现实数据集。我们提供了与OpenAI Gym界面兼容的优化实现。我们的新见解和算法可能会导致对原始Maxent IRL模型的进一步兴趣和探索。

We provide new perspectives and inference algorithms for Maximum Entropy (MaxEnt) Inverse Reinforcement Learning (IRL), which provides a principled method to find a most non-committal reward function consistent with given expert demonstrations, among many consistent reward functions. We first present a generalized MaxEnt formulation based on minimizing a KL-divergence instead of maximizing an entropy. This improves the previous heuristic derivation of the MaxEnt IRL model (for stochastic MDPs), allows a unified view of MaxEnt IRL and Relative Entropy IRL, and leads to a model-free learning algorithm for the MaxEnt IRL model. Second, a careful review of existing inference algorithms and implementations showed that they approximately compute the marginals required for learning the model. We provide examples to illustrate this, and present an efficient and exact inference algorithm. Our algorithm can handle variable length demonstrations; in addition, while a basic version takes time quadratic in the maximum demonstration length L, an improved version of this algorithm reduces this to linear using a padding trick. Experiments show that our exact algorithm improves reward learning as compared to the approximate ones. Furthermore, our algorithm scales up to a large, real-world dataset involving driver behaviour forecasting. We provide an optimized implementation compatible with the OpenAI Gym interface. Our new insight and algorithms could possibly lead to further interest and exploration of the original MaxEnt IRL model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题