非组织上下文强盗的复发性神经线性后采样

论文标题

非组织上下文强盗的复发性神经线性后采样

Recurrent Neural-Linear Posterior Sampling for Nonstationary Contextual Bandits

论文作者

Ramesh, Aditya, Rauber, Paulo, Conserva, Michelangelo, Schmidhuber, Jürgen

论文摘要

在非组织上下文的匪徒问题中，应在探索和对先前经验中存在的（周期性或结构化）模式的剥削之间进行平衡。手工制作适当的历史环境是将非组织问题转变为可以有效解决的固定问题的有吸引力的替代方法。但是，即使是经过精心设计的历史背景，也可能引入虚假关系，或者缺乏对关键信息的方便表示。为了解决这些问题，我们提出了一种方法，该方法学会说明仅基于代理与环境之间互动的原始历史的决定的相关背景。这种方法依赖于基于后采样的上下文线性匪徒算法提取的特征的组合。我们对各种背景和非上下文的非本质问题选择的实验表明，我们的经常性方法始终优于其前馈交易，这需要手工制作的历史上下文，同时比传统的非机构强盗算法更广泛地适用。尽管很难为我们的新方法提供理论性能保证，但我们也证明了对线性后验采样的新颖遗憾，并带有测量误差，这可能是未来理论工作的基础。

An agent in a nonstationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a nonstationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and noncontextual nonstationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional nonstationary bandit algorithms. Although it is very difficult to provide theoretical performance guarantees for our new approach, we also prove a novel regret bound for linear posterior sampling with measurement error that may serve as a foundation for future theoretical work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题