论文标题
通过用于动态ML工作负载的资源调度来探索收缩矢量体系结构
Exploration of Systolic-Vector Architecture with Resource Scheduling for Dynamic ML Workloads
论文作者
论文摘要
随着人工智能(AI)和机器学习(ML)技术破坏了广泛的行业,云数据中心在推理工作量中的需求不断增加。但是,基于CPU的传统服务器无法处理深神经网络(DNN)模型过多的计算要求,而基于GPU的服务器则遭受巨大的功耗和高运营成本。在本文中,我们提出了一个可扩展的收缩压体系结构,该体系结构可以应对云数据中心中动态更改DNN的工作负载。我们首先设计了一种轻巧的DNN模型描述格式,称为统一模型格式(UMF),该格式可以在硬件加速器中实现通用模型表示和快速解码。基于这种模型格式,我们提出了一个异质体系结构,该架构具有负载平衡器,该载荷平衡器执行高水平的工作负载分布和多个收缩压矢量簇,每个群集都由可编程的调度程序,以吞吐量为导向的基调阵列和功能为导向的矢量处理器。我们还提出了一种异质性感知的调度算法,该算法可以同时执行多个DNN工作负载,同时基于计算和内存访问时间估计,在最大化异质硬件利用率上最大化。最后,我们基于实际综合和位置和路由实现结果建立了一个体系结构仿真框架,并为拟议的体系结构进行设计空间探索。结果,所提出的收缩压体系结构的吞吐量性能高出10.9倍,而在现实的ML工作负载上,吞吐量性能高30.17倍。与标准的旋转蛋白计划相比,拟议的异质性调度算法分别提高了81%和20%的吞吐量和能源效率。
As artificial intelligence (AI) and machine learning (ML) technologies disrupt a wide range of industries, cloud datacenters face ever-increasing demand in inference workloads. However, conventional CPU-based servers cannot handle excessive computational requirements of deep neural network (DNN) models, while GPU-based servers suffer from huge power consumption and high operating cost. In this paper, we present a scalable systolic-vector architecture that can cope with dynamically changing DNN workloads in cloud datacenters. We first devise a lightweight DNN model description format called unified model format (UMF) that enables general model representation and fast decoding in hardware accelerator. Based on this model format, we propose a heterogeneous architecture that features a load balancer that performs a high-level workload distribution and multiple systolic-vector clusters, in which each cluster consists of a programmable scheduler, throughput-oriented systolic arrays, and function-oriented vector processors. We also propose a heterogeneity-aware scheduling algorithm that enables concurrent execution of multiple DNN workloads while maximizing heterogeneous hardware utilization based on computation and memory access time estimation. Finally, we build an architecture simulation framework based on actual synthesis and place-and-route implementation results and conduct design space exploration for the proposed architecture. As a result, the proposed systolic-vector architecture achieves 10.9x higher throughput performance and 30.17x higher energy efficiency than a compatible GPU on realistic ML workloads. The proposed heterogeneity-aware scheduling algorithm improves the throughput and energy efficiency by 81% and 20%, respectively, compared to a standard round-robin scheduling.