论文标题
5G-PUSCH在可扩展的RISC-V多核处理器上有效平行
Efficient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor
论文作者
论文摘要
5G无线电访问网络分解和软件化在处理单元的计算绩效方面构成了挑战。在物理层层面上,基带处理计算工作通常被卸载到专用硬件加速器上。但是,软件定义的放射访问网络的趋势需要灵活的可编程体系结构。在本文中,我们探讨了Mempool和Terapool上物理上行链路共享通道(PHY)的较低物理层(PHY)的关键核的软件设计,并行化和优化,这两个多核系统分别具有256和1024小型且有效的RISC RISC-V核,具有大型共享L1数据记忆。 Pusch处理要求且严格限制,这是基带处理器的挑战,对于大多数上行链路渠道来说,它也很常见。因此,我们的分析概括为GNODEB(GNB)的上行链路接收器的整个较低PHY。根据PUSCH算法阶段所需的计算工作(在多重蓄能操作中)的评估,我们重点介绍主导核的平行实现,即快速傅立叶变换,矩阵矩阵乘法,以及用于线性系统的求解矩阵分解核。我们优化的平行核分别在211、225、158和762、880、722的Mempool和Terapool速度上实现,高利用率(0.81、0.89、0.71和0.74、0.71和0.74、0.88、0.71),可相当的单核执行,将单次连续执行朝向完整的STEP STEP WASTERATION,朝着全距离的脚步置于puseftereweware。
5G Radio access network disaggregation and softwarization pose challenges in terms of computational performance to the processing units. At the physical layer level, the baseband processing computational effort is typically offloaded to specialized hardware accelerators. However, the trend toward software-defined radio-access networks demands flexible, programmable architectures. In this paper, we explore the software design, parallelization and optimization of the key kernels of the lower physical layer (PHY) for physical uplink shared channel (PUSCH) reception on MemPool and TeraPool, two manycore systems having respectively 256 and 1024 small and efficient RISC-V cores with a large shared L1 data memory. PUSCH processing is demanding and strictly time-constrained, it represents a challenge for the baseband processors, and it is also common to most of the uplink channels. Our analysis thus generalizes to the entire lower PHY of the uplink receiver at gNodeB (gNB). Based on the evaluation of the computational effort (in multiply-accumulate operations) required by the PUSCH algorithmic stages, we focus on the parallel implementation of the dominant kernels, namely fast Fourier transform, matrix-matrix multiplication, and matrix decomposition kernels for the solution of linear systems. Our optimized parallel kernels achieve respectively on MemPool and TeraPool speedups of 211, 225, 158, and 762, 880, 722, at high utilization (0.81, 0.89, 0.71, and 0.74, 0.88, 0.71), comparable a single-core serial execution, moving a step closer toward a full-software PUSCH implementation.