论文标题
无重复的非合作游戏的无重格学习
No-regret learning for repeated non-cooperative games with lossy bandits
论文作者
论文摘要
本文认为,对于有失去的强盗反馈的重复连续内核游戏,无需学习。由于很难在动态环境中提供实用程序功能的明确模型,因此只能通过强盗反馈来学习玩家的动作。此外,由于不可靠的沟通渠道或隐私保护,强盗反馈可能会随机丢失或丢弃。因此,我们研究了玩家的异步在线学习策略,以适应性地调整下一个动作,以最大程度地减少长期的遗憾损失。该论文提供了一种新颖的无重格学习算法,称为有失落的土匪(OGD-LB)的在线梯度下降。我们首先对具有可区分和Lipschitz实用程序的凹面游戏进行了遗憾分析。然后,我们证明,当游戏也严格单调时,动作曲线会收敛到nash平衡,并具有概率1。我们进一步提供了均方根收敛率$ \ MATHCAL {O} \ left(k^{ - 2 \ min \ {β,1/6 \}} \ right)$当游戏为$β-$强烈单调时。此外,我们将算法扩展到盗销反馈的损失概率未知的情况下,并证明其几乎可以确定其融合了严格单调游戏的NASH平衡。最后,我们以雾计算中的资源管理为应用程序示例,并进行数值实验以凭经验证明算法性能。
This paper considers no-regret learning for repeated continuous-kernel games with lossy bandit feedback. Since it is difficult to give the explicit model of the utility functions in dynamic environments, the players' action can only be learned with bandit feedback. Moreover, because of unreliable communication channels or privacy protection, the bandit feedback may be lost or dropped at random. Therefore, we study the asynchronous online learning strategy of the players to adaptively adjust the next actions for minimizing the long-term regret loss. The paper provides a novel no-regret learning algorithm, called Online Gradient Descent with lossy bandits (OGD-lb). We first give the regret analysis for concave games with differentiable and Lipschitz utilities. Then we show that the action profile converges to a Nash equilibrium with probability 1 when the game is also strictly monotone. We further provide the mean square convergence rate $\mathcal{O}\left(k^{-2\min\{β, 1/6\}}\right)$ when the game is $β-$ strongly monotone. In addition, we extend the algorithm to the case when the loss probability of the bandit feedback is unknown, and prove its almost sure convergence to Nash equilibrium for strictly monotone games. Finally, we take the resource management in fog computing as an application example, and carry out numerical experiments to empirically demonstrate the algorithm performance.