论文标题
与随机线性土匪的多任务表示学习
Multi-task Representation Learning with Stochastic Linear Bandits
论文作者
论文摘要
我们研究随机线性匪徒任务的设置中转移学习的问题。我们认为,在整个任务中共享低维线性表示,并研究在多任务学习设置中学习此表示的好处。在设计随机匪徒政策的最新结果之后,我们提出了基于痕量规范正规化的有效贪婪政策。它通过鼓励任务回归向量形成的矩阵为低等级的矩阵隐含地学习了低维表示。与文献中以前的工作不同,我们的政策不需要知道基础矩阵的排名。我们在政策的多任务遗憾中获得了上限,即$ \ sqrt {ndt(t+d)r} $的订单,这是对数因素,其中$ t $是任务数量,$ r $ the等级,$ d $ d $变量数量和$ n $ n $ nover nass the task。与基线$ TD \ sqrt {n} $相比,我们显示了我们策略的好处,该{n} $通过独立解决每个任务而获得。我们还为多任务遗憾提供了下限。最后,我们通过合成数据的初步实验证实了理论发现。
We study the problem of transfer-learning in the setting of stochastic linear bandit tasks. We consider that a low dimensional linear representation is shared across the tasks, and study the benefit of learning this representation in the multi-task learning setting. Following recent results to design stochastic bandit policies, we propose an efficient greedy policy based on trace norm regularization. It implicitly learns a low dimensional representation by encouraging the matrix formed by the task regression vectors to be of low rank. Unlike previous work in the literature, our policy does not need to know the rank of the underlying matrix. We derive an upper bound on the multi-task regret of our policy, which is, up to logarithmic factors, of order $\sqrt{NdT(T+d)r}$, where $T$ is the number of tasks, $r$ the rank, $d$ the number of variables and $N$ the number of rounds per task. We show the benefit of our strategy compared to the baseline $Td\sqrt{N}$ obtained by solving each task independently. We also provide a lower bound to the multi-task regret. Finally, we corroborate our theoretical findings with preliminary experiments on synthetic data.