论文标题
学会在随着时间变化的环境下控制
Learning to Control under Time-Varying Environment
论文作者
论文摘要
本文研究了线性时变(LTV)动力学系统中遗憾最小化的问题。由于同时存在不确定性和非平稳性,因此为未知LTV系统设计在线控制算法仍然是一项艰巨的任务。以NP-HARD脱机计划的代价,先前的作品引入了在线凸优化算法,尽管它们遭受了遗憾率。 在本文中,我们提出了第一个可遗憾的在线算法的计算访问算法,可以保证避免在州线性反馈政策上脱机计划。我们的算法基于面对不确定性(OFU)原理的乐观主义,在这种原理中,我们在高置信区域中乐观地选择了最佳模型。与以前的方法相比,我们的算法更具探索性。为了克服非平稳性,我们提出了重新启动策略(R-OFU)或滑动窗口(SW-OFU)策略。通过适当的配置,我们的算法被达到Sublinear后悔$ O(T^{2/3})$。这些算法利用当前阶段的数据来跟踪系统动力学的变化。我们用数值实验证实了我们的理论发现,这突出了我们方法的有效性。据我们所知,我们的研究建立了第一个基于模型的在线算法,并在LTV动力学系统下保证了遗憾。
This paper investigates the problem of regret minimization in linear time-varying (LTV) dynamical systems. Due to the simultaneous presence of uncertainty and non-stationarity, designing online control algorithms for unknown LTV systems remains a challenging task. At a cost of NP-hard offline planning, prior works have introduced online convex optimization algorithms, although they suffer from nonparametric rate of regret. In this paper, we propose the first computationally tractable online algorithm with regret guarantees that avoids offline planning over the state linear feedback policies. Our algorithm is based on the optimism in the face of uncertainty (OFU) principle in which we optimistically select the best model in a high confidence region. Our algorithm is then more explorative when compared to previous approaches. To overcome non-stationarity, we propose either a restarting strategy (R-OFU) or a sliding window (SW-OFU) strategy. With proper configuration, our algorithm is attains sublinear regret $O(T^{2/3})$. These algorithms utilize data from the current phase for tracking variations on the system dynamics. We corroborate our theoretical findings with numerical experiments, which highlight the effectiveness of our methods. To the best of our knowledge, our study establishes the first model-based online algorithm with regret guarantees under LTV dynamical systems.