论文标题
检查N-RUN:用于培训深度学习建议模型的检查点系统
Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models
论文作者
论文摘要
检查点在训练长期运行机器学习(ML)模型中起着重要作用。检查点可拍摄ML型号的快照,并将其存储在非挥发性内存中,以便可以将它们从失败中恢复以确保快速训练的进度。此外,它们还用于在线培训,以通过持续学习提高推理预测准确性。鉴于较大且不断增加的模型尺寸,检查点频率通常是由存储写带宽和容量瓶颈瓶颈。当检查点在远程存储上保持在远程存储中,就像许多工业设置一样,它们也被网络带宽所瓶装。我们提出了Check-N-Run,这是一种可扩展的检查点系统,用于在Facebook上培训大型ML模型。虽然Check-N-Run适用于长期运行的ML作业,但我们专注于检查点的推荐模型,该模型是目前最大的ML模型,具有型号大小的trabytes。 Check-N-Run使用两种主要技术来应对大小和带宽挑战。首先,它应用了增量检查点,该检查点跟踪和检查点了模型的修改部分。在建议模型的上下文中,增量检查点特别有价值,在建议模型的情况下,仅在每个迭代中更新模型的一部分(作为嵌入表格)。其次,检查N-RUN利用量化技术可显着降低检查点的大小,而不会降低训练精度。这些技术允许Check-N-Run在Facebook上的现实世界模型上将所需的写带宽降低6-17X,并将所需的容量降低2.5-8倍,从而显着提高了检查点功能,同时降低了总拥有成本。
Checkpoints play an important role in training long running machine learning (ML) models. Checkpoints take a snapshot of an ML model and store it in a non-volatile memory so that they can be used to recover from failures to ensure rapid training progress. In addition, they are used for online training to improve inference prediction accuracy with continuous learning. Given the large and ever increasing model sizes, checkpoint frequency is often bottlenecked by the storage write bandwidth and capacity. When checkpoints are maintained on remote storage, as is the case with many industrial settings, they are also bottlenecked by network bandwidth. We present Check-N-Run, a scalable checkpointing system for training large ML models at Facebook. While Check-N-Run is applicable to long running ML jobs, we focus on checkpointing recommendation models which are currently the largest ML models with Terabytes of model size. Check-N-Run uses two primary techniques to address the size and bandwidth challenges. First, it applies incremental checkpointing, which tracks and checkpoints the modified part of the model. Incremental checkpointing is particularly valuable in the context of recommendation models where only a fraction of the model (stored as embedding tables) is updated on each iteration. Second, Check-N-Run leverages quantization techniques to significantly reduce the checkpoint size, without degrading training accuracy. These techniques allow Check-N-Run to reduce the required write bandwidth by 6-17x and the required capacity by 2.5-8x on real-world models at Facebook, and thereby significantly improve checkpoint capabilities while reducing the total cost of ownership.