论文标题
挥发性实例的机器学习
Machine Learning on Volatile Instances
论文作者
论文摘要
由于当今机器学习中使用的神经网络模型和训练数据集的大小规模,必须通过分解任务(例如跨多个工人节点的梯度评估)来分发随机梯度下降(SGD)。但是,运行分布式SGD可能非常昂贵,因为它可能需要长时间的专业计算资源,例如GPU。我们提出了具有成本效益的策略,以利用比标准实例便宜但更高的优先工作负载中断的挥发性云实例。据我们所知,这项工作是第一个量化活跃工人节点数量的变化(由于抢占)如何影响SGD收敛性和训练模型的时间。通过了解实例,准确性和培训时间的先发制概率之间的这些权衡,我们可以得出实用的策略,以在不动荡的实例(例如Amazon EC2现场实例和其他可享有的云实例)上配置分布式SGD作业。实验结果表明,我们的策略以较低的成本实现良好的培训表现。
Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple worker nodes. However, running distributed SGD can be prohibitively expensive because it may require specialized computing resources such as GPUs for extended periods of time. We propose cost-effective strategies to exploit volatile cloud instances that are cheaper than standard instances, but may be interrupted by higher priority workloads. To the best of our knowledge, this work is the first to quantify how variations in the number of active worker nodes (as a result of preemption) affects SGD convergence and the time to train the model. By understanding these trade-offs between preemption probability of the instances, accuracy, and training time, we are able to derive practical strategies for configuring distributed SGD jobs on volatile instances such as Amazon EC2 spot instances and other preemptible cloud instances. Experimental results show that our strategies achieve good training performance at substantially lower cost.