两个基于两次价值的强化学习算法的样本复杂性界限

论文标题

两个基于两次价值的强化学习算法的样本复杂性界限

Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms

论文作者

Xu, Tengyu, Liang, Yingbin

论文摘要

两个时间尺度随机近似（SA）已被广泛用于基于价值的增强学习算法。在策略评估设置中，它可以将线性和非线性的时间差学习与梯度校正（TDC）算法分别建模为线性SA和非线性SA。在策略优化设置中，两个时间尺度非线性SA还可以对贪婪的渐变-Q（Greedy-GQ）算法进行建模。在先前的研究中，在马尔可夫环境中已经研究了线性TDC和贪婪GQ的非反应分析，并具有降低或准确的依赖性步骤。对于非线性TDC算法，仅建立了渐近收敛。在本文中，我们研究了马尔可夫采样下的两个时尺度线性和非线性TDC和贪婪的GQ的非反应收敛速率，并以准确的无关恒定步骤尺寸。对于线性TDC，我们提供了一种新颖的非质合分析，并表明它具有$ \ Mathcal {o}（O}（ε^{ - 1} \ log（1/ε）$的最佳样品复杂性，在恒定步骤下）。对于非线性TDC和Greedy-GQ，我们表明这两种算法都达到了$ε$ -CACCRATE SENTARY解决方案，带有样品复杂性$ \ MATHCAL {O}（ε^{ - 2}）$。这是在马尔可夫采样下为非线性TDC建立的第一个非反应收敛结果，我们的贪婪-GQ的结果均优于先前的结果，以$ \ MATHCAL {O} {O}（ε^{ - 1} \ log log（1/ε（1/ε）））$。

Two timescale stochastic approximation (SA) has been widely used in value-based reinforcement learning algorithms. In the policy evaluation setting, it can model the linear and nonlinear temporal difference learning with gradient correction (TDC) algorithms as linear SA and nonlinear SA, respectively. In the policy optimization setting, two timescale nonlinear SA can also model the greedy gradient-Q (Greedy-GQ) algorithm. In previous studies, the non-asymptotic analysis of linear TDC and Greedy-GQ has been studied in the Markovian setting, with diminishing or accuracy-dependent stepsize. For the nonlinear TDC algorithm, only the asymptotic convergence has been established. In this paper, we study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ under Markovian sampling and with accuracy-independent constant stepsize. For linear TDC, we provide a novel non-asymptotic analysis and show that it attains an $ε$-accurate solution with the optimal sample complexity of $\mathcal{O}(ε^{-1}\log(1/ε))$ under a constant stepsize. For nonlinear TDC and Greedy-GQ, we show that both algorithms attain $ε$-accurate stationary solution with sample complexity $\mathcal{O}(ε^{-2})$. It is the first non-asymptotic convergence result established for nonlinear TDC under Markovian sampling and our result for Greedy-GQ outperforms the previous result orderwisely by a factor of $\mathcal{O}(ε^{-1}\log(1/ε))$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题