论文标题

分解云数据中心的加速器管理系统

Disaggregated Accelerator Management System for Cloud Data Centers

论文作者

Takano, Ryousei, Suzaki, Kuniyasu

论文摘要

由整体服务器组成的常规数据中心面临的局限性包括缺乏操作灵活性,资源低,可维护性低等。资源分类是解决上述问题的有希望的解决方案。我们提出了一个称为Flow-Cloud(FIC)的分解云数据中心体系结构的概念,该体系结构使现有的群集计算机系统能够通过高速网络扩展加速器池。 Flowos-RM管理整个池资源,并根据用户请求在动态构造的切片上部署用户作业。该切片由计算节点和加速器组成,其中每个加速器都连接到相应的计算节点。本文展示了FIC在概念证明实验中的可行性,该实验运行了原型系统上的分布式深度学习应用程序。结果成功保证了拟议系统的适用性。

A conventional data center that consists of monolithic-servers is confronted with limitations including lack of operational flexibility, low resource utilization, low maintainability, etc. Resource disaggregation is a promising solution to address the above issues. We propose a concept of disaggregated cloud data center architecture called Flow-in-Cloud (FiC) that enables an existing cluster computer system to expand an accelerator pool through a high-speed network. FlowOS-RM manages the entire pool resources, and deploys a user job on a dynamically constructed slice according to a user request. This slice consists of compute nodes and accelerators where each accelerator is attached to the corresponding compute node. This paper demonstrates the feasibility of FiC in a proof of concept experiment running a distributed deep learning application on the prototype system. The result successfully warrants the applicability of the proposed system.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源