5G调度器设计中知识辅助深入的强化学习：从理论框架到实施

论文标题

5G调度器设计中知识辅助深入的强化学习：从理论框架到实施

Knowledge-Assisted Deep Reinforcement Learning in 5G Scheduler Design: From Theoretical Framework to Implementation

论文作者

Gu, Zhouyou, She, Changyang, Hardjawana, Wibowo, Lumb, Simon, McKechnie, David, Essery, Todd, Vucetic, Branka

论文摘要

在本文中，我们开发了一种知识辅助的深入增强学习（DRL）算法，以设计具有时间敏感性流量的第五代（5G）蜂窝网络中的无线调度程序。由于调度策略是从渠道和队列状态到调度措施的确定性映射，因此可以通过使用深层确定性策略梯度（DDPG）进行优化。我们表明，DDPG的直接实现缓慢地收敛，服务质量（QoS）的性能较差，并且不能在实际上是非平稳的现实5G系统中实施。为了解决这些问题，我们提出了一个理论上的DRL框架，其中使用无线通信的理论模型用于在DRL中制定马尔可夫决策过程。为了减少收敛时间并改善每个用户的QoS，我们设计了一个知识辅助的DDPG（K-DDPG），该知识利用了调度程序设计问题的专家知识，例如QoS的知识，目标调度策略以及每个培训样本的重要性，由每个培训样本的重要性，由价值功能的近似值和包装损耗的近似值确定。此外，我们开发了一种用于在线培训和推理的体系结构，K-DDPG在该架构中将调度程序离线初始化，然后对在线调度程序进行微调以处理离线模拟与非平稳现实世界系统之间的不匹配。仿真结果表明，我们的方法比现有调度程序（减少30％〜50％的数据包损失）大大减少了DDPG的收敛时间，并且取得了更好的QoS。实验结果表明，通过离线初始化，我们的方法比随机初始化获得了更好的初始QoS，并且在几分钟内进行了在线微调收敛。

In this paper, we develop a knowledge-assisted deep reinforcement learning (DRL) algorithm to design wireless schedulers in the fifth-generation (5G) cellular networks with time-sensitive traffic. Since the scheduling policy is a deterministic mapping from channel and queue states to scheduling actions, it can be optimized by using deep deterministic policy gradient (DDPG). We show that a straightforward implementation of DDPG converges slowly, has a poor quality-of-service (QoS) performance, and cannot be implemented in real-world 5G systems, which are non-stationary in general. To address these issues, we propose a theoretical DRL framework, where theoretical models from wireless communications are used to formulate a Markov decision process in DRL. To reduce the convergence time and improve the QoS of each user, we design a knowledge-assisted DDPG (K-DDPG) that exploits expert knowledge of the scheduler design problem, such as the knowledge of the QoS, the target scheduling policy, and the importance of each training sample, determined by the approximation error of the value function and the number of packet losses. Furthermore, we develop an architecture for online training and inference, where K-DDPG initializes the scheduler off-line and then fine-tunes the scheduler online to handle the mismatch between off-line simulations and non-stationary real-world systems. Simulation results show that our approach reduces the convergence time of DDPG significantly and achieves better QoS than existing schedulers (reducing 30% ~ 50% packet losses). Experimental results show that with off-line initialization, our approach achieves better initial QoS than random initialization and the online fine-tuning converges in few minutes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题