论文标题

遗憾的是基于EXP的算法的界限和强化学习探索

Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

论文作者

Xu, Mengfan, Klabjan, Diego

论文摘要

我们研究了强盗和强化学习中具有挑战性的探索激励问题,其中奖励是没有规模的,并且可能是无限的,这是由现实世界情景驱动的,与现有工作不同。过去在加强学习中工作要么与环境进行昂贵的相互作用,要么提出算法,发现潜在的低质量局部最大值。由Exp-type方法的激励,将多种试剂(专家)集成在匪徒中,并假设奖励是有限的,我们提出了新的算法,即EXP4.P和EXP4-RL,以在未绑定的奖励案例中进行探索,并在这些新环境中证明了它们的有效性。无限的奖励会引入挑战,因为遗憾不能受到审判数量的限制,而选择次优的武器可能会导致无限的遗憾。具体而言,我们在有限的和无限的线性和随机上下文匪徒中建立了EXP4.P的遗憾上限。令人惊讶的是,我们还发现,通过包括一位足够有能力的专家,EXP4.P可以在线性案例中实现全球最优性。这个无限的奖励结果也适用于多武器强盗场景中的Exp3.p的修订版。在EXP4-RL中,我们将EXP4.P从强盗场景扩展到加强学习,以激励多个代理商(包括一位高性能代理)的探索,以提高效率和卓越。该算法已在难以探索的游戏上进行了测试,与最新的探索相比,探索方面有显着改善。

We study the challenging exploration incentive problem in both bandit and reinforcement learning, where the rewards are scale-free and potentially unbounded, driven by real-world scenarios and differing from existing work. Past works in reinforcement learning either assume costly interactions with an environment or propose algorithms finding potentially low quality local maxima. Motivated by EXP-type methods that integrate multiple agents (experts) for exploration in bandits with the assumption that rewards are bounded, we propose new algorithms, namely EXP4.P and EXP4-RL for exploration in the unbounded reward case, and demonstrate their effectiveness in these new settings. Unbounded rewards introduce challenges as the regret cannot be limited by the number of trials, and selecting suboptimal arms may lead to infinite regret. Specifically, we establish EXP4.P's regret upper bounds in both bounded and unbounded linear and stochastic contextual bandits. Surprisingly, we also find that by including one sufficiently competent expert, EXP4.P can achieve global optimality in the linear case. This unbounded reward result is also applicable to a revised version of EXP3.P in the Multi-armed Bandit scenario. In EXP4-RL, we extend EXP4.P from bandit scenarios to reinforcement learning to incentivize exploration by multiple agents, including one high-performing agent, for both efficiency and excellence. This algorithm has been tested on difficult-to-explore games and shows significant improvements in exploration compared to state-of-the-art.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源