差异的重新分析减少了时间差异学习

论文标题

差异的重新分析减少了时间差异学习

Reanalysis of Variance Reduced Temporal Difference Learning

论文作者

Xu, Tengyu, Wang, Zhe, Zhou, Yi, Liang, Yingbin

论文摘要

时间差异（TD）学习是强化学习中政策评估的流行算法，但是香草TD可能会遭受固有的优化差异。 Korda和La（2015）提出了一种差异降低的TD（VRTD）算法，该算法将差异技术直接应用于Markovian样本的在线TD学习。在这项工作中，我们首先指出了Korda和La（2015）中VRTD分析的技术错误，然后对VRTD的非反应收敛性及其降低性能提供了数学上扎实的分析。我们表明，VRTD保证以线性收敛速率收敛到TD的定点解决方案的邻域。此外，与Vanilla TD相比，VRTD的差异误差（对于I.I.D. \和Markovian采样）和VRTD的偏差误差（对于Markovian采样）显着降低。结果，VRTD的总体计算复杂性达到给定的准确解决方案优于Markov采样下的TD，并且在I.I.D. \ \采样下的TD的表现优于TD，对于足够小的条件数字。

Temporal difference (TD) learning is a popular algorithm for policy evaluation in reinforcement learning, but the vanilla TD can substantially suffer from the inherent optimization variance. A variance reduced TD (VRTD) algorithm was proposed by Korda and La (2015), which applies the variance reduction technique directly to the online TD learning with Markovian samples. In this work, we first point out the technical errors in the analysis of VRTD in Korda and La (2015), and then provide a mathematically solid analysis of the non-asymptotic convergence of VRTD and its variance reduction performance. We show that VRTD is guaranteed to converge to a neighborhood of the fixed-point solution of TD at a linear convergence rate. Furthermore, the variance error (for both i.i.d.\ and Markovian sampling) and the bias error (for Markovian sampling) of VRTD are significantly reduced by the batch size of variance reduction in comparison to those of vanilla TD. As a result, the overall computational complexity of VRTD to attain a given accurate solution outperforms that of TD under Markov sampling and outperforms that of TD under i.i.d.\ sampling for a sufficiently small conditional number.

下载PDF全文

下载文献需遵守相关版权规定

论文标题