汤普森采样实现$ \ tilde o（\ sqrt {t}）$线性二次控制中的遗憾

论文标题

汤普森采样实现$ \ tilde o（\ sqrt {t}）$线性二次控制中的遗憾

Thompson Sampling Achieves $\tilde O(\sqrt{T})$ Regret in Linear Quadratic Control

论文作者

Kargin, Taylan, Lale, Sahin, Azizzadenesheli, Kamyar, Anandkumar, Anima, Hassibi, Babak

论文摘要

汤普森采样（TS）是不确定性下决策的有效方法，其中从经过精心规定的分布中采样了动作，该分布根据观察到的数据进行更新。在这项工作中，我们研究了使用TS的可稳定线性季度调节剂（LQR）自适应控制的问题，其中系统动力学是未知的。先前的工作已经确定，$ \ tilde o（\ sqrt {t}）$频繁的遗憾对于LQR的自适应控制是最佳的。但是，现有方法要么仅在限制性设置中起作用，需要先验已知的稳定控制器，要么使用计算上棘手的方法。我们提出了一种有效的TS算法，用于对LQR的自适应控制，TS基于TS的自适应控制，TSAC，该算法达到了$ \ tilde O（\ sqrt {t}）$遗憾，即使对于多维系统，也可以解决在Abeille和Lazaric（2018）中提出的开放问题。 TSAC不需要先验已知的稳定控制器，并通过在早期阶段有效探索环境来实现基础系统的快速稳定。我们的结果取决于开发新颖的下限TS提供乐观样本的概率。通过仔细规定早期的探索策略和政策更新规则，我们表明TS在适应性控制多维可稳定性LQR方面实现了订单最佳的遗憾。我们从经验上证明了TSAC在几个自适应控制任务中的性能和效率。

Thompson Sampling (TS) is an efficient method for decision-making under uncertainty, where an action is sampled from a carefully prescribed distribution which is updated based on the observed data. In this work, we study the problem of adaptive control of stabilizable linear-quadratic regulators (LQRs) using TS, where the system dynamics are unknown. Previous works have established that $\tilde O(\sqrt{T})$ frequentist regret is optimal for the adaptive control of LQRs. However, the existing methods either work only in restrictive settings, require a priori known stabilizing controllers, or utilize computationally intractable approaches. We propose an efficient TS algorithm for the adaptive control of LQRs, TS-based Adaptive Control, TSAC, that attains $\tilde O(\sqrt{T})$ regret, even for multidimensional systems, thereby solving the open problem posed in Abeille and Lazaric (2018). TSAC does not require a priori known stabilizing controller and achieves fast stabilization of the underlying system by effectively exploring the environment in the early stages. Our result hinges on developing a novel lower bound on the probability that the TS provides an optimistic sample. By carefully prescribing an early exploration strategy and a policy update rule, we show that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs. We empirically demonstrate the performance and the efficiency of TSAC in several adaptive control tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题