论文标题
SMLT:用于可扩展和自适应机器学习设计和培训的无服务器框架
SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning Design and Training
论文作者
论文摘要
在当今的生产机器学习(ML)系统中,模型经过不断训练,改进和部署。 ML设计和培训正成为具有动态资源需求的各种任务的连续工作流程。无服务器计算是一种新兴的云范式,可为用户提供透明的资源管理和扩展,并有可能彻底改变ML设计和培训的常规。但是,由于其内在的设计局限性,例如无状态性质,跨功能实例的有限的沟通支持以及功能执行持续时间有限,因此在现有无服务器平台上托管现代ML工作流都有非平凡的挑战。这些局限性导致缺乏用于训练动力学的总体视图和适应机制,并放大了ML工作流中现有问题。 为了应对上述挑战,我们提出了SMLT,这是一个自动化,可扩展和自适应的无服务器框架,以实现高效且以用户为中心的ML设计和培训。 SMLT采用自动化和自适应调度机制来动态优化培训期间ML任务的部署和资源扩展。 SMLT通过支持用户指定的培训截止日期和预算限制,进一步启用了以用户为中心的ML工作流执行。此外,通过提供端到端设计,SMLT解决了无服务器平台中的内在问题,例如通信开销,有限的功能执行持续时间,重复初始化的需求,并为ML培训提供明确的容错公差。 SMLT是开源的,与所有主要ML框架兼容。我们对大型,复杂的现代ML模型的实验评估表明,SMLT的表现优于最先进的VM系统和现有的无服务器ML培训框架,培训速度(最高8倍)和货币成本(最高3倍)
In today's production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent resource management and scaling for users and has the potential to revolutionize the routine of ML design and training. However, hosting modern ML workflows on existing serverless platforms has non-trivial challenges due to their intrinsic design limitations such as stateless nature, limited communication support across function instances, and limited function execution duration. These limitations result in a lack of an overarching view and adaptation mechanism for training dynamics and an amplification of existing problems in ML workflows. To address the above challenges, we propose SMLT, an automated, scalable, and adaptive serverless framework to enable efficient and user-centric ML design and training. SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. SMLT further enables user-centric ML workflow execution by supporting user-specified training deadlines and budget limits. In addition, by providing an end-to-end design, SMLT solves the intrinsic problems in serverless platforms such as the communication overhead, limited function execution duration, need for repeated initialization, and also provides explicit fault tolerance for ML training. SMLT is open-sourced and compatible with all major ML frameworks. Our experimental evaluation with large, sophisticated modern ML models demonstrate that SMLT outperforms the state-of-the-art VM based systems and existing serverless ML training frameworks in both training speed (up to 8X) and monetary cost (up to 3X)