动态治疗简历Q功能的最小信息差异

论文标题

动态治疗简历Q功能的最小信息差异

Minimum information divergence of Q-functions for dynamic treatment resumes

论文作者

Eguchi, Shinto

论文摘要

本文旨在介绍信息几何形状的新应用，以增强学习恢复的强化学习。在强化学习的标准框架中，Q功能被定义为给定状态的有条件期望和对单阶段情况的行动。我们在所有Q-功能的空间中介绍了一个称为策略等价的等价关系。每个阶段的Q功能空间中定义了一类信息差异。主要目的是通过基于轨迹数据集的最小信息差异方法提出最佳策略函数的估计器。特别是，我们讨论了$γ$ - 功率的差异，该差异显示为有利的属性，因此$γ$ - 功率 - 策略等效Q-功能之间的分歧消失了。该属性本质上是为了寻求最佳政策，该策略将在Q功能的半参数模型的框架中进行了讨论。功率指数$γ$的特定选择提供了值函数的有趣关系，以及Q功能的几何和谐波手段。数值实验表明，在动态治疗方案的背景下，最低$γ$ - 能力差异方法的性能。

This paper aims at presenting a new application of information geometry to reinforcement learning focusing on dynamic treatment resumes. In a standard framework of reinforcement learning, a Q-function is defined as the conditional expectation of a reward given a state and an action for a single-stage situation. We introduce an equivalence relation, called the policy equivalence, in the space of all the Q-functions. A class of information divergence is defined in the Q-function space for every stage. The main objective is to propose an estimator of the optimal policy function by a method of minimum information divergence based on a dataset of trajectories. In particular, we discuss the $γ$-power divergence that is shown to have an advantageous property such that the $γ$-power divergence between policy-equivalent Q-functions vanishes. This property essentially works to seek the optimal policy, which is discussed in a framework of a semiparametric model for the Q-function. The specific choices of power index $γ$ give interesting relationships of the value function, and the geometric and harmonic means of the Q-function. A numerical experiment demonstrates the performance of the minimum $γ$-power divergence method in the context of dynamic treatment regimes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题