信任区域政策优化具有最佳运输差异：二元性和算法的连续行动

论文标题

信任区域政策优化具有最佳运输差异：二元性和算法的连续行动

Trust Region Policy Optimization with Optimal Transport Discrepancies: Duality and Algorithm for Continuous Actions

论文作者

Terpin, Antonio, Lanzetti, Nicolas, Yardim, Batuhan, Dörfler, Florian, Ramponi, Giorgia

论文摘要

策略优化（PO）算法已被证明特别适合处理现实世界连续控制任务的高维度。在这种情况下，信任区域策略优化方法代表了一种稳定策略更新的流行方法。这些通常依靠Kullback-Leibler（KL）差异来限制策略的变化。 Wasserstein距离代表了自然的替代方案，代替KL差异，以定义信任区域或正规化目标函数。但是，最新的工作要么诉诸于其近似值，要么不提供连续状态空间的算法，从而降低了该方法的适用性。在本文中，我们探讨了定义信任区域的最佳运输差异（包括Wasserstein距离），并提出了一种新型算法 - 最佳运输信任区域策略优化（OT -TRPO） - 用于连续的状态行动空间。我们通过提供一维双重重新制定，为PO避免了PO的无限优化问题。然后，我们通过分析得出偶尔问题解决方案的最佳策略更新。这样，我们绕过了最佳运输成本和最佳运输图的计算，我们通过解决双重公式来隐含地表征。最后，我们提供了对各种控制任务的方法的实验评估。我们的结果表明，最佳运输差异可以比最先进的方法具有优势。

Policy Optimization (PO) algorithms have been proven particularly suited to handle the high-dimensionality of real-world continuous control tasks. In this context, Trust Region Policy Optimization methods represent a popular approach to stabilize the policy updates. These usually rely on the Kullback-Leibler (KL) divergence to limit the change in the policy. The Wasserstein distance represents a natural alternative, in place of the KL divergence, to define trust regions or to regularize the objective function. However, state-of-the-art works either resort to its approximations or do not provide an algorithm for continuous state-action spaces, reducing the applicability of the method. In this paper, we explore optimal transport discrepancies (which include the Wasserstein distance) to define trust regions, and we propose a novel algorithm - Optimal Transport Trust Region Policy Optimization (OT-TRPO) - for continuous state-action spaces. We circumvent the infinite-dimensional optimization problem for PO by providing a one-dimensional dual reformulation for which strong duality holds. We then analytically derive the optimal policy update given the solution of the dual problem. This way, we bypass the computation of optimal transport costs and of optimal transport maps, which we implicitly characterize by solving the dual formulation. Finally, we provide an experimental evaluation of our approach across various control tasks. Our results show that optimal transport discrepancies can offer an advantage over state-of-the-art approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题