TDPROP：Jacobi的预处理是否有助于时间差异学习？

论文标题

TDPROP：Jacobi的预处理是否有助于时间差异学习？

TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

论文作者

Romoff, Joshua, Henderson, Peter, Kanaa, David, Bengio, Emmanuel, Touati, Ahmed, Bacon, Pierre-Luc, Pineau, Joelle

论文摘要

我们调查了雅各比的预处理，时间差异（TD）学习中的自举术语是否可以帮助提高自适应优化者的性能。我们的方法TDPROP根据TD更新规则的对角线预处理计算每个参数学习率。我们展示了如何在$ n $ -Step Returns和TD（$λ$）中使用它。我们的理论发现表明，如果通过高参数搜索发现了两者的最佳学习率，则包括此其他预处理信息可与正常的半差异TD相当。在使用预期SARSA的深度RL实验中，TDPROP在近乎最佳的学习率下达到或超过了所有测试游戏中亚当的性能，但是经过良好调整的SGD可以产生类似的改进 - 与我们的理论相匹配。我们的发现表明，雅各比（Jacobi）的预处理可能会在DEEP RL中的典型自适应优化方法上改善，但是尽管结合了TD Bootstrap项中的其他信息，但可能并不总是比SGD更好。

We investigate whether Jacobi preconditioning, accounting for the bootstrap term in temporal difference (TD) learning, can help boost performance of adaptive optimizers. Our method, TDprop, computes a per parameter learning rate based on the diagonal preconditioning of the TD update rule. We show how this can be used in both $n$-step returns and TD($λ$). Our theoretical findings demonstrate that including this additional preconditioning information is, surprisingly, comparable to normal semi-gradient TD if the optimal learning rate is found for both via a hyperparameter search. In Deep RL experiments using Expected SARSA, TDprop meets or exceeds the performance of Adam in all tested games under near-optimal learning rates, but a well-tuned SGD can yield similar improvements -- matching our theory. Our findings suggest that Jacobi preconditioning may improve upon typical adaptive optimization methods in Deep RL, but despite incorporating additional information from the TD bootstrap term, may not always be better than SGD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题