通过正规化的拉格朗日评估非政策评估

论文标题

通过正规化的拉格朗日评估非政策评估

Off-Policy Evaluation via the Regularized Lagrangian

论文作者

Yang, Mengjiao, Nachum, Ofir, Dai, Bo, Li, Lihong, Schuurmans, Dale

论文摘要

最近提出的分布校正估计（DICE）的估计量家族从行为不可能的数据中提出了非政策评估的最新技术。尽管这些估计器都执行某种形式的固定分布校正，但它们源于不同的推导和客观功能。在本文中，我们将这些估计量统一为同一线性程序的正规lagrangians。统一使我们能够将骰子估计器的空间扩展到表现出改善性能的新替代方案。更重要的是，通过在数学和经验上分析估计器的扩展空间，我们发现双重解决方案在导航优化稳定性和估计偏差之间的权衡方面具有更大的灵活性，并且通常在实践中提供了较高的估计值。

The recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data. While these estimators all perform some form of stationary distribution correction, they arise from different derivations and objective functions. In this paper, we unify these estimators as regularized Lagrangians of the same linear program. The unification allows us to expand the space of DICE estimators to new alternatives that demonstrate improved performance. More importantly, by analyzing the expanded space of estimators both mathematically and empirically we find that dual solutions offer greater flexibility in navigating the tradeoff between optimization stability and estimation bias, and generally provide superior estimates in practice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题